Social Media as a Leading Indicator of Markets and Predictor of Voting Patterns

Transcription

1 Social Media as a Leading Indicator of Markets and Predictor of Voting Patterns Mattias Lidman April 19, 211 Master s Thesis in Computing Science, 3 credits Supervisor at CS-UmU: Martin Berglund Supervisor at Nomura Umeå: Daniel Brändström Examiner: Fredrik Georgsson Umeå University Department of Computing Science SE UMEÅ SWEDEN

2

3 Abstract In recent years the use of social media has risen significantly and much of the content, produced by millions of individual users, is freely accessible. This suggests that there may exist methods for determining aggregate sentiment and levels of interest for these millions of users. It may also be possible to use social media data to find correlations with, or even predict, real-world phenomena governed by human behavior. This thesis project investigates some such phenomena, with special attention paid to stock market movements. To this end, more than 9 million messages posted to Twitter over a five month period are analyzed. Social media indicators are constructed based on message volume as well as aggregate sentiment as determined by a new sentiment analysis method using tiered lexica. The project demonstrates that there are clear correlations between social media chatter and movements of the NASDAQ stock exchange, as well as indications that social media data may at times be used to predict such market movements. A method of automating a certain class of market predictions based on social media data is also proposed and implemented. The project also demonstrates that social media data may be used to predict competitions determined by public voting, in this case the music competition Idol. The report concludes that there is a major potential for further developments, both in the specific use-cases and methods featured in this project, as well as in the field of social media analysis in general.

4 ii

5 Contents 1 Introduction Hypotheses and Problem Statements Why Twitter? Thesis Outline Background Twitter The Twitter API and Third Party Apps The Demographics of Twitter Users Predicting Financial Market Developments The Efficient-Market Hypothesis and the Role of Sentiment In Financial Markets Sentiment Analysis and Aggregate Opinion Mining A Review of Approaches Notable Results Summary of Findings System Design and Implementation Introduction Twitter Client Redundancy, Monitoring and Notification Tweet Parser Description of Parsing Process Description of Output Data Tiered Lexica Sentiment Analyzer Algorithmic Description of Sentiment Analysis Examples Description of Lexica Query Processing Description of Raw Statistics iii

6 iv CONTENTS Determining Number of Daily Tweets and the Difference Between hype and tot cnt Statistical Analysis Tools Visualization Tools Predicting Idol Results Hypothesis Method Result Statistical Evaluation of Results Determining the Probability Mass Function of a Single Competition Determining the Probability Mass Function of a Set of Competitions Conclusions and Discussion Correlating Twitter Message Volume and News Events Hypothesis Method Results Google Microsoft Cisco Oracle Intel Conclusions and Discussion Stock Market Correlations Hypotheses and Problem Statement Method Description of Gathered Data Summary of Time Series Vector Notation Derivatives Correlation Tests Heuristics for Evaluating Correlations Meta Analyses Results Stock Market Correlations Related to Trade Volume Stock Market Correlations Related to Stock Value Results of Meta Analyses Conclusions and Discussion On the Interpretation of Statistical Results Meta Analyses

7 CONTENTS v 7 Predicting Market Movements Problem Statement Method Results Example Using Hypothetical Data Examples Using Real-World Data Conclusions and Discussion Evaluating the Prediction Method Conclusions and Discussion Summary of Results Can Twitter Indicators Predict Market Behavior? Some Potential Dataset Problems in Social Media Analysis Demographic Tilting Tilting Due to Automatic Postings Tilting Due to Public Relations Efforts and Targeted Attacks Summary Evaluations of Methods Choice of Search Terms Sentiment Analysis Filtering Levels On the Tailedness of Markets and Social Media Data Some Final Thoughts Twitter s Reaction Time Fuzzy Correlations Buying on Rumor and Selling on News Acknowledgments Thanks! Open Source Acknowledgments References 87 A Further Examples of Twitter/Market Correlations 91

8 vi CONTENTS

9 List of Figures 3.1 The primary software components developed during the project Number of Twitter mentions of each Idol contestant; the bars of contestant(s) who received the fewest votes are marked red Message volume concerning GOOG Message volume concerning MSFT Message volume concerning CSCO Message volume concerning ORCL Message volume concerning INTC Relationship between Twitter indicator Total count, no URLs and market indicator Volume for Apple Relationship between Twitter indicator Total count, no URLs or RTs and market indicator Volume for Google Relationship between Twitter indicator Total count, no URLs and market indicator Volume for Intel Relationship between Twitter indicator Total count and market indicator Volume for Cisco Relationship between Twitter indicator Median differential of positive value and market indicator Median differential of volume for Amgen Relationship between Twitter indicator Positive value, no URLs or RTs and market indicator Volume for Qualcomm Relationship between Twitter indicator Median differential of positive value, no URLs or RTs and market indicator Median differential of volume for Qualcomm Relationship between Twitter indicator Median differential of positive proportion, no URLs or RTs and market indicator Rate-of-change of close for Apple vii

10 viii LIST OF FIGURES 6.9 Relationship between Twitter indicator Median differential of negative proportion, no URLs or RTs and market indicator Rate-of-change of close for Apple Relationship during November 21 between Twitter indicator Median differential of positive proportion, no URLs or RTs and market indicator Rateof-change of close for Apple Relationship during November 21 between Twitter indicator Median differential of negative proportion, no URLs or RTs and market indicator Rateof-change of close for Apple Relationship between Twitter indicator Positive value, no URLs or RTs and market indicator Median differential of close for Qualcomm Proportion of high-value correlations related to each market statistic using three different limits Proportion of high-value correlations related to each market statistic using the alternate method Proportion of high-value correlations related to each filtering level using three different limits Proportion of high-value correlations related to each filtering level using the alternate method Number of significant correlations at the.1% significance level for -7 days of lag Anscombe s quartet. An example of four very different datasets with very similar summary statistics Example data showing three event types: Market leads Twitter reaction, Twitter leads market reaction, and both react simultaneously The prediction method applied to example data Prediction indicator for Qualcomm Prediction indicator for DirecTV Prediction indicator for Cisco, first time period Prediction indicator for Cisco, second time period Episode of the web-comic xkcd illustrating Twitter s reaction time A.1 Relationship between Twitter indicator Median differential of hype, no URLs and market indicator Median differential of volume for Apple A.2 Relationship between Twitter indicator Bollinger bandwidth of positive value, no URLs and market indicator Bollinger bandwidth of volume for Apple A.3 Relationship between Twitter indicator Positive value and market indicator Volume for Amgen A.4 Relationship during December 21 between Twitter indicator Total count and market indicator High for Amazon

11 LIST OF FIGURES ix A.5 Relationship between Twitter indicator Rate-of-change of total count, no URLs and market indicator Rate-of-change of volume for Amazon A.6 Relationship between Twitter indicator Absolute median differential of total count and market indicator Absolute median differential volume for Cisco.. 94 A.7 Relationship between Twitter indicator Median differential of total count, no URLs or RTs and market indicator Median differential of volume for Google 95 A.8 Relationship between Twitter indicator Absolute median differential of total count, no URLs or RTs and market indicator Absolute median differential of volume for Google A.9 Relationship during January 211 between Twitter indicator Median differential of total count, no URLs and market indicator Median differential of volume for Microsoft A.1 Relationship between Twitter indicator Total count, no URLs and market indicator Volume for Oracle

12 x LIST OF FIGURES

13 List of Tables 3.1 Examples of three sentiment lexica L 1, L 2 and L 3 with different priorities and weights Examples of sentiment analysis using tiered and weighted lexica Default priority and weight for the lexica used in the project Summary statistics for correlations related to trade volume, using unlagged data Summary statistics for correlations related to trade volume, with the market indicator lagged by one day compared to the Twitter indicator Summary statistics for correlations related to stock value, using unlagged data Summary statistics for correlations related to trade volume, with the market indicator lagged by one day compared to the Twitter indicator Number of high-value correlations related to each market statistic using three different limits with the standard method, as well as the alternate method Number of high-value correlations related to each filtering level using three different limits with the standard method, as well as the alternate method Number of significant correlations at the.1% significance level for -7 days of lag Number of high-value correlations related to each sentiment-related indicator set using three different limits with the standard method, as well as the alternate method A.1 Summary statistics for Figures A.1 to A.1 using unlagged indicators A.2 Summary statistics for Figures A.1 to A.1, with the market indicator lagged by one day compared to the Twitter indicator xi

14 xii LIST OF TABLES

15 Chapter 1 Introduction The goal of this project is to investigate correlations between content posted to social media and various real-life phenomena, and to what extent such correlations can be used to predict future events. More specifically, large numbers of messages sent to the Twitter 1 microblogging service will be gathered and analyzed, both with respect to content and message volume. The goal of these analyses is to investigate if there is a link between message streams and outcomes of televised competitions determined by public voting, major news events concerning certain companies, and stock market movements. The following chapter will describe in broad terms the goals of this project. The chapter also provides arguments for why Twitter present a particularly suitable source of data for a project such as this. The chapter ends with an outline of the remaining chapters of this thesis. 1.1 Hypotheses and Problem Statements The overarching goal of this project is to investigate ways in which Twitter data can be found to correlate with real-world phenomena, and to what extent such correlations can be used to predict future events. In practical terms, this is tested by formulating and investigating a number of hypotheses. More formal definitions of these hypotheses are deferred until they can be placed in their proper contexts in later chapters. However, in order to provide an overview of what this project aims to test, they are listed here in more general terms: 1. Twitter data can be used to predict public voting patterns. This is tested in Chapter 4, by attempting to predict outcomes of the Idol television program. 2. News events concerning a subject will result in a significantly higher Twitter message volume concerning that subject. This is tested in Chapter Indicators derived from Twitter data can be found to correlate with market data, such a stock prices. This is tested in Chapter Twitter is a trademark of Twitter, Inc. 1

16 2 Chapter 1. Introduction 4. Indicators derived from Twitter data will sometimes foreshadow market movements, such as changes in stock prices. This is also tested in Chapter 6. In addition to the listed hypotheses, the following problem statements are also central to this work: 1. Statistically evaluate the methods used to find correlations between Twitter data and market movements. This is done in Chapter Given the occurrence of instances where Twitter data appears to foreshadow market movements, implement a method for automatically detecting such instances. This is carried out in Chapter Why Twitter? The following section presents reasons why Twitter provides a compelling source of data for social media analysis in general, and for this project in particular because the use-cases include financial applications. To this end, Twitter is compared to other available social media services. To begin with, Twitter is an open service, in the sense that all messages are freely available. This is in contrast with services such as Facebook where status updates are typically only available to Facebook friends of the poster. Twitter also has a quite permissive API. The API makes the data gathering process technically easy when compared to for example scraping data from blogs, as blogs may be formatted in any number of ways. Further, Twitters userbase is well suited for a project such as this. The service must be widely used because large amounts of data are required. As is shown in Section 3.5.2, at the beginning of this project Twitter had roughly 1 million messages every day. It also matters who the users are. According to [38] Twitter users are much more likely to follow brands or companies than the average social network user; a full 51% of Twitter users answered yes to the question Do you follow/friend any brands or companies on social networks?, compared to only 16% among social network users in general. This suggests that Twitter users may be likely to both be sensitive to and help shape market trends. Finally, Twitters limit of 14 characters and the ease of submitting a message provides two benefits. First, messages are unlikely to contain any complex trains of thought which simplifies automated sentiment analysis. Second, the format provides a low threshold for posting, as the expectation is for a tweet to contain off-the-cuff thoughts or summary information concerning an unfolding event. The threshold is lowered further by the rising ubiquity of mobile devices featuring Twitter clients. This is an advantage in particular when compared to using blogs as a data source. 1.3 Thesis Outline Chapter 2 provides background information. This includes information about Twitter, a review of earlier research on the topics of social media analysis and sentiment analysis, and a description of the problem of predicting market movements. Chapter 3 describes the implemented software components. This includes a description of some of the general methods used in later chapters, including sentiment analysis.

17 1.3. Thesis Outline 3 Chapters 4-7 describes a set of experiments. Each chapter follows a hypothesismethod-results-conclusion outline. The subject of each chapter is as follows: Chapter 4 investigates if Twitter message flows can be used to predict the outcomes of the televised music competition Idol before the official results are announced. Chapter 5 will investigate how the publication of news stories concerning certain companies affects the message volume concerning those companies. Chapter 6 will investigate if, and to what extent, Twitter message flows can be found to correlate with stock market movements. The chapter will also investigate the possibility that social media data may be used to predict future stock market movements. Chapter 7 will propose and evaluate a method for predicting a certain class of swings in market indicators based on historic information derived from social media. Chapter 8 will provide conclusions and discussions related to the project as a whole, in addition to the experiment-specific conclusions and discussion in Chapters 4-7. The chapter will also discuss some issues related to social media analysis in general. Chapter 9 lists acknowledgments. Appendix A provides some further examples of correlations between Twitter data and stock market indicators, in addition to those given in Chapter 6.

18 4 Chapter 1. Introduction

19 Chapter 2 Background The following chapter presents background information on several topics concerning this project. This includes a description of the Twitter service and the demographics of it s users, background information on some approaches of predicting market developments and the concept of informationally efficient markets, and a review of different approaches to sentiment analysis with special focus on determining aggregate sentiment. 2.1 Twitter Twitter is service for social networking and microblogging. The service allows users to post short and publicly readable text messages, called tweets, of up to 14 characters which are then displayed on the users profile page. The tweets are also available through searching, and may be grouped into different topics by using so called hashtags prefixing a word with the symbol # indicates that the tweet belongs to that particular topic. For example, the message #Apple announced the new #iphone today would belong to the topics Apple and iphone. Similarly, followed by a username may be used to mention or direct a tweet to a particular user. The service was launched in 26 and is reported to have 175 million registered users as of late October 21 [34]. It was found in this study that there are an average of slightly below 1 million tweets posted each day (see Section 3.5.2) The Twitter API and Third Party Apps Twitter exposes a quite permissive API [37] for the use of third party developers. This has lead to a plethora of client applications for a number of different platforms, as well as analysis, statistics and aggregation tools. This openness coupled with the huge Twitter user base means that large amounts of text messages can be gathered reliably in a way that is technically non-prohibitive The Demographics of Twitter Users When analyzing a community (such as the set of Twitter users) in order to draw conclusions about some wider, underlying population (such as the public at large) or some subset thereof 5

20 6 Chapter 2. Background (such as potential smartphone buyers, or stock market traders), it is important to know to what extent the former models the latter. One might suspect that the members of a community such as Twitter would be demographically tilted in various ways, for example with respect to age, affluence and geographic location. If this is the case, it may be important to keep in mind when trying to draw conclusions from a corpus of tweets. Demographics of American Users Several studies have looked at the demographics of American Twitter users. Some of the (partly contradictory) results are listed below: 19% of Internet users said in October 29 that they use Twitter or some other service for sharing updates about themselves, marking a sharp increase since April 29 when the corresponding figure was 11% [19]. For the age group the figure is 33%, while for ages 5-64 it is only 9% [19]. These figures are however seemingly contradicted by [14], which claims that the percentages of Twitter users between and are nearly equal. The median age of Twitter users is 31 [19]. Twitter users are more likely to live in lower-income households according to [27]. The authors attributes this to the relative youth of Twitter users. This is however contradicted by [38], which claims that Twitter users are more likely to come from higher-income households. 1 Twitter users are somewhat more likely than the average Internet user to live in urban areas [27]. Twitter users are much more likely than the average Internet user to connect wirelessly to the Internet and to own more than one device capable of connecting to the Internet [19, 27, 38]. Twitter users are better educated than the public at large, with 63% having attended at least four years of college compared to 4% for the total American population [38]. The possible effects of demographic tilting is discussed in Section Demographics of Swedish Users Since Chapter 4 will deal with predicting the results of a Swedish televised competition it is useful to also have a sense Twitter usage in Sweden. This is investigated in [12], a study conducted in December 21. The study concludes that there are roughly 36, active Twitter users in Sweden, where active user is defined as a user with at least one tweet in the last 3 days and at least three tweets in total. With a population of 9.3 million people this means that approximately.39% of all Swedes are active Twitter users. Further, the study concludes that the message frequency is very unevenly distributed among users, with 6% of users (or.23% of the full population) contributing 68% of the message volume. 1 One possible explanation for this disparity is that the latter study compensates for age factors this is not, however, stated by the authors.

21 2.2. Predicting Financial Market Developments Predicting Financial Market Developments At the heart of attempts at predicting financial market developments is the goal of beating the market, by using available information to consistently achieve a higher rate-of-return than the market average. One approach is technical analysis, the goal of which is to forecast future market developments using historical market data such as price and trading volume, and to define trading rules based on those forecasts. While the use of forecasting strategies is widespread, opinions of their usefulness is mixed within the scientific community as is the viability of the very idea of beating the market. One notable example of such criticism is the book A Random Walk Down Wall Street by Princeton economist Burton Malkiel [29]. Malkiel analyses investment techniques including technical analysis and concludes that a buy-and-hold strategy is the preferable option for most investors over the long run. Malkiels book is also the origin of the colorful assertion that a blindfolded chimpanzee throwing darts at the Wall Street Journal would be just as effective at selecting a stock portfolio as a financial expert. A survey of 95 empirical studies of technical analysis conducted between 1988 and 24 concluded that while a majority of the studies found positive results, most of the studies suffered from various problems in methodology. The survey also found significant differences between markets in particular, studies indicate that technical trading rules did not generate economic benefits in US stock markets after the 198s [36] The Efficient-Market Hypothesis and the Role of Sentiment In Financial Markets A theoretical underpinning of much of the criticism of financial forecasting is the efficientmarket hypothesis (EMH). The EMH states that prices in financial markets accurately reflects all currently available information [18], thereby making it impossible to consistently generate returns in excess of the market average [36]. A stricter definition is proposed by Jensen in [25]: A market is efficient with respect to information set θ t if it is impossible to make economic profits by trading on the basis of information set θ t. Formulations of the EMH differ with respect to the nature of the information said to be in θ t. Weak-form efficiency states that θ t contains the full price history of the market, and by extension that technical analysis cannot generate returns in excess of the market average. The semi-strong-form efficiency states that θ t contains all publicly available information, and that prices will adapt rapidly to reflect new information. The strong-form efficiency asserts that even insider information is reflected in market prices. This form is, unlike the other two, rarely held to have basis in fact. This has the key implication that if information can be gleaned that is not in θ t, it may be possible to use this information to generate an economic advantage. While Jensen in [25] asserts that the empirical evidence is solidly in favor of the EMH, more recent work such as [6] offer a more mixed picture. The hypothesis, critics argue, struggles to explain phenomena such as sudden stock market crashes and speculative bubbles, as well as what former Federal Reserve chairman Alan Greenspan called irrational exuberance [23]. A rebuttal to these criticisms is offered by Malkiel in [3] where he argues that even though investors may act irrationally and produce bubbles, this offers no opportunities for rational investors to profit.

22 8 Chapter 2. Background The criticism is often based in the field of behavioral finance, which argues that the decisions of investors are frequently based in cognitive biases when assessing information. They argue that investor sentiment, in addition to observed facts, plays a crucial role in investor decision making. This leads to over- or under-reactions to new information when compared to that of a fully rational observer. Examples of cognitive biases include overconfidence in ones own judgment, tendencies to disregard new information in clinging to previously held opinions, and wishful thinking [5]. Another example is the concept of home bias, described in [2], which is the tendency of investors to prefer domestic equity to a higher extent than the objective data would suggest is optimal. Moreover, these biases are held to produce a degree of predictability which may be used in forecasting. One such model of investor sentiment is proposed in [4]. An emerging field is quantitative behavioral finance, which attempts to use mathematical and statistical methods to explain patterns which seem inconsistent with traditional economic theories such as the EMH. An example of such work is [16]. 2.3 Sentiment Analysis and Aggregate Opinion Mining Sentiment analysis or opinion mining is the process of programmatically determining the opinion of a writer or set of writers with respect to some object or topic. The term aggregate opinion mining refers to approaches which are not primarily concerned with the individual sentiment of some well defined set of authors or documents, but rather with questions such as what percentage of users in some forum such as Twitter or even (adopting a slightly more loose definition of forum ) the entire blogosphere have a favorable or unfavorable view about something. These methods often aim to use aggregate sentiment as a proxy for, e.g., public opinion polls or stock market predictions. As these are by no means solved problems a number of different approaches have been adopted in recent articles. The following section will describe some of them and discuss some key results. The section is concluded by a brief summary of findings A Review of Approaches Lexical Approaches Lexical approaches to sentiment analysis are methods that are at their core based upon matching a piece of text against a lexicon of words that have been tagged as expressing some sentiment. In [35], by O Connor et al., Twitter messages are classified as either positive or negative by matching them against a subjectivity lexicon of positive and negative value words. The relatively simple approach nevertheless proves effective when the authors attempt to correlate results with public opinion polls, as is described below. The approach allows messages to be tagged and both positive and negative; this suggests that the method would not be as effective in evaluating larger bodies of text, as they would almost surely contain at least one value word of each polarity. It may be desirable to adopt a different set of value words depending on the topic to be analyzed; in [21] Namrata et al. adopts a method of automatically generating sentiment lexica for different topics. A small set of topic-specific seed words is manually assigned polarity (positive or negative). The list is then expanded by recursively querying a lexical database for synonyms and antonyms of the seed words, assigning the same polarity to synonyms and opposite polarity to antonyms. A score is then assigned to each word which

23 2.3. Sentiment Analysis and Aggregate Opinion Mining 9 decreases in relation to the number of steps separating the derived word from a seed word, and increases for each unique path found between the derived word and seed word. The authors of [21] also proposes the use of polarity reversal when a word is preceded by a negation, and increasing or decreasing polarity strength when a word is preceded by a modifier. E.g., good = +1; not good = 1; very good = +2. The resulting lexicon was used in analyzing news and blog sentiment regarding a number of public figures. The results generally agreed with expectations, registering negative sentiment for criminals such as Slobodan Milosevic and Zacarias Moussaoui, positive sentiment for entertainers and sports stars, and mixed results for politicians. In [15] Ding et al. seeks to augment the lexical approach further by making it context sensitive. For example, consider a sentence such as This camera takes great pictures and has long battery life. A naive lexical method would be able to infer that the sentence carries a positive sentiment of the cameras picture quality, since the word great is usually 2 an unambiguously positive marker. However, since a word like long is not intrinsically positive or negative a naive lexical approach would not be able to determine if the view on the cameras battery life is positive or negative. To address this the authors propose the intra-sentence conjunction rule which states that a sentence expresses a single sentiment polarity unless it contains a word such as but. Note for example that a sentance such as This camera takes great pictures and the battery life is short seems somewhat peculiar compared to This camera takes great pictures but the battery life is short. Several other, similar, rules are also proposed which exploit other linguistic conventions. Other Approaches: Classifiers and Machine Learning In contrast to the lexical approaches described above, Asur and Huberman in [3] constructed a sentiment analysis classifier using a linguistic analysis package. The classifier takes as input a large number of messages that have been manually tagged as expressing positive or negative sentiment. Using this training data the classifier learns to classify other messages. Melville et al. in [32] investigates the possibility of combining the lexical approach with machine learning using training data, and presents a combined model which when tested outperforms the other approaches in isolation. Differences in Sentiment Modeling Approaches also differ with respect to how they choose to model sentiment itself. While most authors seek to classify text only as positive or negative with respect to some subject, in [9] and later in [8] (discussed more extensively in the following section) Bollen et al. adopts a more complex model of writer sentiment in analyzing Twitter data. Rather than simply labeling each tweet as positive or negative, a 6-dimensional mood profile is assigned to each 14 character message Notable Results Linking Sentiment on Social Media to Public Opinion In [35], O Connor et al. investigate the possibility of using Twitter sentiment as a proxy for public opinion polls. 2 Note the qualifier usually ; as in most examples concerning sentiment analysis it is of course possible to think of counterexamples that violate the proposed rule.

24 1 Chapter 2. Background Each tweet is first tagged as positive, negative, or both. Then, for a given topic word, a sentiment score is defined for each day t as: x t = p pos /p neg, (2.1) where p pos is the proportion of tweets mentioning the topic word that are marked as positive, and p neg is the proportion marked as negative. Results are then smoothed using a moving average. The smoothed scores of certain topic words are then compared to public opinion polls spanning the same time period. The authors found significant correlations between sentiment levels for the topic word jobs and polls tracking consumer confidence between early 28 and late 29, and also between the topic word Obama and presidential approval ratings from 29. The results are particularly impressive considering the long time frame of the observations. Predicting Movie Box-Office Returns In Predicting the Future With Social Media [3] the authors try to forecast box-office revenues of upcoming movies, testing the hypothesis that movies that are well talked about will be well watched. By analyzing 2.89 million tweets relating to 24 different movies, the authors manage to consistently outperform the predictive accuracy of the Hollywood Stock Exchange, an online play-money prediction market. The authors propose a linear regression model that incorporates both message volume and sentiment; message volume was found to be most significant indicator, but that sentiment gained in predictive power following the initial release. Predicting the Stock Market Using Twitter In Twitter Mood Predicts the Stock Market [8] the authors analyze Twitter messages gathered between February 28 and November 3 28 and compare the results to the Dow Jones Industrial Average (DJIA) for the same time period. The analysis is limited to tweets that contain phrases such as I feel and I m feeling in order to identify messages which express writer sentiment. No filtering is done to limit the analysis to a particular topic. Each 14 character message is assigned a 6-dimensional mood profile; the dimensions are calm, alert, sure, vital, kind and happy. Messages are also assigned a more general positive/negative value. The mood state time series were then normalized to a standard score based on a sliding window measure of sample mean and sample standard deviation; for day t the sliding window will consist of measures from the date range [t k,t + k]. The normalized time series were compared to DJIA stock market data using a Granger causality test. A Granger causality test is a measure of the predictive power of one time series over another. Informally, time series A is said to Granger cause time series B if A offers predictive power over future values of B in excess of what is contained in B alone. For each of the 7 measures of Twitter mood, the test was performed with the mood state time series lagged by 1-7 days compared to the DJIA data, for a total 49 statistical tests. The authors report p-values (statistical significance) for the tests, finding that 4 tests were statistically significant at the 5% level; these results where all found in the mood dimension calm, using a lag of 2-5 days. Finally, the predictive power of the data is evaluated using a Self-organizing Fuzzy Neural Network model (SOFNN), which can use either only historic DJIA data as input,

25 2.3. Sentiment Analysis and Aggregate Opinion Mining 11 or a combination of historic DIJA data and Twitter mood dimensions. The output of the model is a daily prediction of whether the DJIA will move up or down on the following day. Out of the entire dataset, spanning from February 28 to December 2, they isolate a period of 15 trading days where the prediction accuracy using only historic DIJA is 73.3%, and the addition of the mood dimension calm improves the prediction accuracy to 86.7%. It is asserted that the probability of finding such a result by mere chance for a given 15 day period is.32%, and 3.4% over the full time period. Based on these results, it is suggested that the calm time series has predictive power over the DJIA. However, there are some reasons to maintain a degree of skepticism: The authors did not filter messages in any way which would single out tweets that are related to the stock market. This means that the dataset contains tweets concerning how people feel about the dinner they had yesterday, the latest Justin Bieber album, the US presidential campaign which was in full swing at the time, and a myriad other things. It is somewhat difficult to conceive a model which predicts a causal relationship between Twitter calm as measured by the authors regarding this wide range of subjects, and stock market value at some time in the future. Of the p-values reported for the calm time series the most significant is the 2 day lag, and the 1 day lag is the least significant by a wide margin. This would imply, in effect, that investment decisions are affected by calm levels two days ago, but not yesterdays calm levels. Recall that a p-value is the probability of obtaining a results at least as extreme as observed if the null-hypothesis is true. This means in effect that if we use entirely random data, we would expect 5% of the tests to be significant at the 5% level; in the case under consideration this translates to 49.5 = 2.45 cases, compared to the 4 cases that are actually observed. This result is therefore not a great deal better than what one would expect from pure chance. Different lags of the same time series are not independent, meaning that if a fluke causes a particular time series at a given lag to be statistically significant it would not be surprising to find that others lags are also significant. Note that the authors use a normalization method which for day t uses information from days t ± k; in effect using data from the very future they are attempting to predict. The evaluation of the significance of the SOFNN prediction results is statistically curious in two ways: The authors use the odds of obtaining exactly the observed result rather than the odds of getting the observed result or better, which is the measure that is typically used. The authors use a probability of 5% of correctly guessing any up or down movement. However, over this period the accuracy when using only historic market data was 73.3%, and this is therefore more appropriately the figure to beat. Adopting these two changes, the probability of obtaining a result on par with what is observed over this particular time period by pure chance jumps from.32% to 19.32%. The probability of obtaining such a result anywhere over the full testing period will be higher still, but cannot be evaluated fully without access to the data.

26 12 Chapter 2. Background According to news articles these results will form the basis for a recently launched hedge fund set to start trading in February 211 with an initial 25 million (approximately $39 million, or 264 million SEK) of real money [11, 26] Summary of Findings In summary: Approaches to sentiment analysis includes the use of machine learning [32], classifiers [3], lexical approaches [35, 21], or some combination of methods. Measures of sentiment range from simple polarity (positive or negative) [3] or polarity adjusted by a strength factor [21] to more complex measures that attempt to model mood dimensions such as anger, tension and arousal [22, 9, 8]. A quite simplistic lexical approach has proved effective when applied to Twitter data [35]. Notable results of large scale sentiment analysis of online material includes correlations with consumer confidence, public opinion polls, influenza rates in the US, movie boxoffice returns, and important news events [35, 14, 9, 3].

27 Chapter 3 System Design and Implementation The following chapter describes the software components that have been developed as a part of the project. An overview is provided in Section 3.1, and succeeding sections describe specific components in more detail. All software has been developed and tested for the Linux platform. In spite of this, most code should with modest effort be portable to any platform supporting the programming languages Ruby and Matlab, and the database system KDB Introduction Figure 3.1 shows the primary software components that have been developed as a part of this project and the interactions between them, as well as the interaction with the Twitter Search API. 1. The Twitter Client continuously queries the Twitter Search API and writes the resulting tweets to disk. Tweets are encoded using JSON (JavaScript Object Notation), a structured data format. 2. The Tweet Parser transforms the raw JSON data into a CSV format more suitable for time series processing, and also uses the Sentiment Analyzer to determine message sentiment polarity. 3. The Query Processing aggregates the (typically many thousands of) tweets into timesliced (usually daily) statistics suitable for statistical analysis. 4. The Statical Analysis Tools is primarily concerned with determining levels of correlation between the statistics derived from tweets and some real world phenomenon (e.g., financial market data). 5. The Visualization Tools ensures that visualization (e.g., plotting of correlating time series) is done in a way that is simple, dynamic and reproducible. 1 A Kx Systems trademark. 13

28 14 Chapter 3. System Design and Implementation Figure 3.1: The primary software components developed during the project. The Twitter bird is a trademark of Twitter and is used with permission. 3.2 Twitter Client The Twitter Client is responsible for collecting tweets and storing them for later processing. It does so by continuously querying the Twitter Search API for a given set of search terms. It is developed in the Ruby programming language. The behavior of the application is governed by a configuration file which in particular specifies the following parameters: A set of search terms. An ISO language code. Optional if specified, searches will be restricted to retrieving tweets in the given language. A list of administrator addresses and a failure tolerance see Section A sleeptime. Specifies the time the Twitter Client will wait between each set of queries is issued. This is necessary not only in order to reduce load on the host where the application is running, but also because Twitter enforces an (undisclosed) search rate limit. Each configuration file is in practice used to specify a topic (e.g., consumer electronics, NASDAQ symbols) by listing a number of search terms related to that topic. This separation of data is valuable at later processing stages because it allows queries to be run over only a subset of the gathered data. The fetching of messages is performed by issuing a query to the Twitter Search API for each listed search term. In order to avoid fetching (and filtering out) the same tweets over and over again the highest encountered tweet identifier associated with each search term is retained between queries and is used for the since id field in subsequent queries. The

29 3.3. Tweet Parser 15 since id field has the effect of restricting the message set returned by the API to tweets with a higher message ID, effectively only returning previously unseen messages. Tweets are received in a JSON encoded format. The Twitter Client defers all decisions of what to do with the data to later processing stages, and therefore writes the data exactly as received directly to disk. The application creates a new output file for each day Redundancy, Monitoring and Notification Since the Twitter Client operates at near real-time it is imperative in a project such as this that the data collection be reliable, since a prolonged down-time would result in a loss of data that would be difficult to recover. Therefore, mechanisms of redundancy, monitoring and notification is used to ensure the reliability of the data gathering. Strictly speaking only notification is a part of the Twitter Client proper, but all three aspects are nevertheless described here: Redundancy Each base configuration of the Twitter Client was at all times deployed on four different physical hosts. Monitoring In order to make it as easy as possible to ensure that all sessions are operating normally a simple CGI script was developed that lists the Twitter Client processes on each host, and the filenames and sizes of the output data files. Notification The Twitter API will sometimes return an error message rather than valid data even during normal operation, but if they become too frequent this may be a sign that the attention of a human administrator is needed. Therefore, when a given number of subsequent error messages are received (specified as failure tolerance in the configuration files) the Twitter Client will automatically submit an notification to a predefined set of addresses. To avoid swamping these inboxes (say, if the Twitter API returns nothing but error messages for an entire night) the limit for the number of subsequent errors is raised each time an is sent, and must be manually reset. 3.3 Tweet Parser The Twitter Client writes the received JSON encoded tweets directly to disk without doing any processing. The primary advantage of this approach is that it is robust against unexpected format changes. It also delays the decision of what information should be discarded or retained. But the raw data is unfortunately not well suited for later processing stages; parsing it is computationally expensive, and it consumes much more space than necessary. For these reasons the Tweet Parser, written in Ruby just like the Twitter Client, acts as a preprocessing stage that transforms the JSON encoded messages into a CSV (comma-separated values) format more suitable for time series processing Description of Parsing Process The Tweet Parser begins by reading a configuration file which in particular specifies an input directory containing data stored by the Twitter Client, and an output directory where the resulting CSV files will be written. For each tweet found in the set of input files, the application performs the following steps:

30 16 Chapter 3. System Design and Implementation 1. Extract key-value pairs from the message using regular expressions. 2. Optionally validate the parsed data by checking that all expected keys have been found. May be disabled for performance. 3. Optionally skip non-english tweets by checking the field iso language code. 4. Determine tweet sentiment by calling the Sentiment Analyzer described in Section Make the text field uppercase to simplify case-insensitive search, and replace commas with dashes. 6. Write the result to a CSV file. See Section below for a description of the data format Description of Output Data Each line of the Tweet Parser output represents one tweet and has the following format: 1. from user id the sender ID number. 2. to user id the receiver ID number, if specified. Usually left blank. 3. id a unique message ID number. 4. datetime date and time when the tweet was posted. 5. positive score and negative score sentiment score as determined by the Sentiment Analyzer. The value range depends on the configuration used by the Tweet Parser, but in this project it is restricted to integers text the message text. 7. source the application used for submitting the message. The Tweet Parser may direct all output to a single file, or use different files for each day of gathered data. In practice, the latter approach is usually wise on current hardware since the output data can easily consume several gigabytes if a much discussed topic is monitored more than a few weeks. 3.4 Tiered Lexica Sentiment Analyzer The sentiment analyzer assigns a positive and negative sentiment score to a piece of text by matching the text against one or more tiered sentiment lexica. A sentiment lexicon consists of a set of words (or word-like tokens) that are tagged as being either positive or negative. Both [35] and [21] have demonstrated that quite simple lexical approaches to sentiment analysis can be effective when applied to determining large scale aggregate sentiment. However, when constructing a sentiment lexicon one quickly comes across a problem: There is a trade-off between the number of false positives (i.e., texts that are marked as having a positive or negative score above even though they express no positive or negative sentiment) and false negatives (i.e., texts that are assigned a positive or negative score of in spite of expressing positive or negative sentiment).

31 3.4. Tiered Lexica Sentiment Analyzer 17 This is at least in part due to the fact that different terms are more or less frequently used to express sentiment; a word such as abandon is probably more often used to express negative sentiment than positive, but it is also quite often used in a way that is neutral or ambiguous (e.g., Company x today abandoned attempts to acquire company b ). A word such as sucks, on the other hand, is usually a fairly clear indicator that the author is expressing negative sentiment about something. Therefore, if you include too many terms that are weak indicators of sentiment you will end up with a lot of false positives, whereas if you restrict the lexicon to strong indicators you will have a lot of false negatives. To mitigate this problem, the sentiment analyzer is designed to used several different tiered lexica. The lexica are tiered in the sense that they may be given different priorities (so that a high-priority lexicon can override a low-priority one) and weights (so that the positive and negative score given to a piece of text can differ depending on what lexicon it matches). This means that one can have an expansive lexicon that includes weak sentiment indicators, and one or more higher-priority lexica that are restricted to terms that are strong indicators of sentiment Algorithmic Description of Sentiment Analysis Assume an array of sentiment lexica L = [L 1,L 2,...,L n ],n 1, where each lexica L i is defined by: A set of positive sentiment indicators: L i.positive. A set of negative sentiment indicators: L i.negative. Priority : L i.priority determines which lexica to use if a string matches more than one. Weight: L i.weight,> determines how many positive or negative sentiment points to award a string if it matches one of the sets of sentiment words, but does not match any lexica with a higher priority. Further, assume that the array L is in a non-decreasing order with respect to priority; in other words, L i.priority L i+1.priority for all i < n. The following algorithm can then be used to determine the sentiment score s of the string str: s.positive := s.negative := t := tokenize(str) for i = 1 to n do if (s.positive s.negative ) L i.priority < L i+1.priority then return s else for all t i t do if t i L i.positive s.positive < L i.weight then s.positive := L i.weight end if if t i L i.negative s.negative < L i.weight then s.negative := L i.weight end if end for

32 18 Chapter 3. System Design and Implementation end if end for return s The specific behavior of the tokenize() function may wary depending on the content of the lexica. For example, in this implementation the function must in addition to word-tokens also return emoticons (such as :-) and :/ ) while ignoring other patterns that may contain the emoticon as a substring (in particular ) Examples This section provides a few examples of how the sentiment analysis algorithm operates. Assume that we have a lexicon L 1 of weak sentiment indicators and two lexica L 2 and L 3 of strong indicators, as shown in Table 3.1. Table 3.2 then shows the results of applying the sentiment analysis algorithm to several different strings. Note in particular the second example, which demonstrates how the use of tiered lexica helps in avoiding certain types of spurious results. The last example also illustrates the difficulty of handling irony, which is described in [39] as the big challange of handling negations in automated sentiment analysis Description of Lexica The version of the sentiment analyzer used in this project includes the following lexica: Subjectivity clues - developed by the authors of [4] for the system OpinionFinder and also used in [35], subjectivity clues contains more than 78 words that are marked as either positive or negative. Some terms from the original lexicon are excluded in this implementation because they were found to cause large numbers of false positives. Emoticons - includes commonly used textual representations of facial expressions such as :-) and :-(. Slang - includes a small number of commonly used slang terms. Idol - a small lexicon of Swedish terms used to predict Idol results (see Section 4.2). The priority and weight of each lexicon may be set in the configuration of the Tweet Parser. If no values are specified, the default values will be as specified in Table Query Processing The gathered data may contain millions of tweets (for instance, in this project over 25 million messages concerning consumer electronics were gathered) and the CSV files produced by the Tweet Parser can consist of several gigabytes of text data (6.6 GB in the example above). For these reasons, and because typically only a small subset of the messages will be of interest, it is not efficient to do statistical analysis over the set of all tweets. Therefore Query Processing is used to produce time-sliced aggregate statistics concerning messages containing one or more search terms (e.g., NASDAQ symbols, names of Idol contestants). In practice, using the CSV data produced by the Tweet Parser as input, the Query Processing produces a set of statistics for each day covered by the data based on all tweets

33 3.5. Query Processing 19 Table 3.1: Examples of three sentiment lexica L 1, L 2 and L 3 with different priorities and weights. Lexicon Positive indicators Negative indicators Priority Weight L 1 abandons, abandoned 1 1 L 2 wonderful sucks 2 2 L 2 :-) :-( 2 3 String Table 3.2: Examples of sentiment analysis using tiered and weighted lexica. Positive score Negative score Comment I feel abandoned. 1 The string only matches a negative indicator from L 1 and therefore gets a negative score given by the weight of L 1. Nokia abandons Symbian. About time! :-) I feel wonderful tonight! :-) The dinner was wonderful, unlike the movie afterwards. :-( Just busted the screen on my iphone. Wonderful. 3 The string matches both a negative indicator from L 1 and a positive indicator from L 3. Since L 3.priority > L 1.priority, the string only obtains the positive score given by L 3. This example shows how the use of tiered lexica can help avoid spurious results due to weak sentiment indicators. 3 If the string matches the same sentiment polarity of two lexica with the same priority, the one with the highest weight is used. 2 3 If the string matches different sentiment polarity of two lexica with the same priority, both are used to determine the sentiment score. 2 This approach to sentiment analysis can clearly not avoid all spurious results, as this example shows. Table 3.3: Default priority and weight for the lexica used in the project. Lexicon Priority Weight Subjectivity clues 1 1 Emoticons 2 2 Slang 2 2 Idol

34 2 Chapter 3. System Design and Implementation that matches a non-empty set of search terms and a possibly empty set of terms to exclude. In the Idol results use-case the time was also restricted to the voting period see Section 4.2 for details. These statistics will be referred to as the raw statistics, and are described in following section. The processing is done using the Q programming language, which is the query language used in KDB Description of Raw Statistics Based on the positive and negative sentiment scores s.positive and s.negative awarded by the Sentiment Analyzer described above, a tweet is defined as positive, negative, mixed or neutral as follows: Definition A tweet is defined as: positive if s.positive > and s.negative = negative if s.positive = and s.negative > mixed if s.positive > and s.negative > neutral if s.positive = and s.negative = Using these definitions, the following raw statistics can be defined: Definition The basic raw statistics: hype Approximately the proportion of all tweets that match the search. tot cnt The total number of tweets matching the search. pos cnt The number of positive tweets matching the search. pos prop The proportion of tweets in tot cnt that are positive (i.e., pos prop = pos cnt / tot cnt). pos val The sum of all positive scores of all positive tweets. The difference compared to pos cnt is that it reflects the different weights of the sentiment lexica that are used. This value will typically be close to pos cnt if the default weights and priorities of the Sentiment Analyzer are used. pos adj val The size-adjusted value pos val / tot cnt. Note that this is not a proper proportion as it is not normalized in the range 1. neg cnt, neg prop, neg val andneg adj val Defined similarly aspos cnt, pos prop, pos val and pos adj val, but using number of negative tweets rather than positive. mix cnt and mix prop Defined similarly as pos cnt and pos prop, but using number of mixed-sentiment tweets rather than positive. neu cnt and neu prop Defined similarly as pos cnt and pos prop, but using number of neutral tweets rather than positive.

35 3.5. Query Processing 21 Further, it is plausible that some degree of filtering (i.e., excluding certain types of tweets) will cause a raw statistic to correlate better with the underlying phenomenon of interest. For this reason, the following two default filtering levels are defined: Definition Default filtering levels: no urls Excludes all tweets containing www or http. This is meant to exclude messages that are primarily informational or promotional. no urls or rts Excludes all tweets containing www, http or rt. The letters rt is an abbreviation of re-tweet and signifies that the message is a forwarded version of another users message. Combining the basic raw statistics in Definition with the filtering levels in Definition 3.5.3, the full set of raw statistics is: 1. hype 2. tot cnt 3. pos cnt 4. pos prop 5. pos val 6. pos adj val 7. neg cnt 8. neg prop 9. neg val 1. neg adj val 11. mix cnt 12. mix prop 13. neu cnt 14. neu prop 15. no urls hype 16. no urls tot cnt 17. no urls pos cnt 18. no urls pos prop 19. no urls pos val 2. no urls pos adj val 21. no urls neg cnt 22. no urls neg prop 23. no urls neg val 24. no urls neg adj val 25. no urls mix cnt 26. no urls mix prop 27. no urls neu cnt 28. no urls neu prop 29. no urls or rts hype 3. no urls or rts tot cnt 31. no urls or rts pos cnt 32. no urls or rts pos prop 33. no urls or rts pos val 34. no urls or rts pos adj val 35. no urls or rts neg cnt 36. no urls or rts neg prop 37. no urls or rts neg val 38. no urls or rts neg adj val 39. no urls or rts mix cnt 4. no urls or rts mix prop 41. no urls or rts neu cnt 42. no urls or rts neu prop

36 22 Chapter 3. System Design and Implementation Determining Number of Daily Tweets and the Difference Between hype and tot cnt Since the raw statistic hype is defined as the proportion of all tweets posted to Twitter that match a given query it is necessary to determine with some accuracy the total number of daily tweets. For the first few weeks of this project this was fairly straightforward. Each posted tweet would be assigned a consecutively increasing ID number, which meant that the Twitter-wide number of tweets for day n could be accurately estimated by taking the maximum observed tweet ID for that day, and subtracting the maximum observed ID for day n 1. This property also allowed for more fine-grained (e.g., hourly) estimates of tweet frequency, and could have been used to adjust accurately for various types of cyclical behavior. Unfortunately the definition of the ID number was changed by Twitter on November 3rd 21, making this no longer possible. The number daily tweets for the period following this date therefore had to be estimated from the previously gathered data. A clear trend was a reduction in message volume on Saturdays, and the mean for Saturdays was therefore estimated separately to with a sample standard deviation of For all other days the mean was estimated at with a sample standard deviation of These numbers are then used to produce the hype raw statistic. This means in effect that hype takes into account the reduction in overall message volume on Saturdays, but that for other weekdays it will only differ from tot cnt by orders of magnitude. 3.6 Statistical Analysis Tools The Statistical Analysis Tools are a set of tools are developed in Matlab, primarily to efficiently find and evaluate a large number of correlations between bodies of data. Since the precise functionality of most of these tools is closely related to the statistical evaluations performed in relation to each set of experiments, a more detailed description is deferred to the respective chapters of each experiment. Two tools of a more general nature are nevertheless described here. The first of these supplies functionality for the hassle-free creation of time series structs from data stored as CSV files. This expands on the rather limited Matlab function csvread. The second tool is responsible for making such time series structs congruent, by removing from each struct all elements which does not have a corresponding element in the other. 3.7 Visualization Tools As this project features a fair amount of data visualization in the form of plotting, it has been useful to have tools for making the creation of such plots simple, dynamic and reproducible. The primary example of the need for this automation is given by the process for largescale correlation analysis described in Section 6.2.5, which produces thousands of plots as output. As a result of this work, all plots in this document are fully created using scripts and can with relative ease be recreated from the raw data.

37 Chapter 4 Predicting Idol Results Idol is a Swedish televised music-competition show based on the same format as American Idol, the UK show Pop Idol, as well as numerous others. In each episode the contestants perform one or more songs, and the TV audience can call in to cast votes for their favorite contestant. Towards the end of the show the names of the two contestants who have received the least amount of votes are announced. These two are then subjected to a second round of voting, and the contestant who receives the least amount of votes in this second round is eliminated from the competition. The exception to this is the final episode, when there are only two contestants left and therefore no need for a second round of voting. 4.1 Hypothesis One can suspect that popular contestants will generate an online buzz as well as votes in the competition. If this is the case, the message volume mentioning a contestant will correlate with their competition results. However, it is also reasonable to assume that some contestants will generate a degree of negative sentiment which translates into tweets but not into votes. These notions can be formalized into the following hypothesis: Hypothesis 4.1: Discounting tweets that are deemed negative, the number of times a contestant is mentioned on Twitter between the beginning of a show and the end of the first round of voting will correlate with the number of votes the contestant receives. This means that contestants with fewer mentions will have a higher risk of being among the bottom two in the first round, risking elimination. The hypothesis is restricted to the first round of voting because the second round only lasts for a few minutes. It is restricted to predicting the bottom two since no other information is made public concerning the placement of the contestants. 4.2 Method The base dataset consists of all tweets posted during the testing period containing the terms #idol, #idol21, idol or idol21 and which were written in Swedish. A special sentiment lexicon was assembled by inspecting the dataset and identifying a small number of words 23

38 24 Chapter 4. Predicting Idol Results that were generally linked to a negative sentiment concerning a contestant or a belief that the contestant was about to be eliminated. For each round of competition, the score for each contestant was calculated as the number of tweets that fulfilled the following requirements: Contains the name of the contestant. Posted between 19: (the starting time of the show) and 21:4 (a few minutes before the end of the first round of voting). Does not contain any of the negative words from the sentiment lexicon. 4.3 Result Figure 4.1 shows the results of applying the method described above to the six final rounds of competition. Hypothesis 4.1 predicts that the two contestants (one in the final round) with the fewest number of mentions will get the fewest number of votes, risking elimination. The bars of contestant(s) who actually received the fewest votes are marked red. The names of the contestants are: 1. Jay Smith. 2. Minnah Karlsson. 3. Olle Hedberg. 4. Linnea Henriksson. 5. Andreas Weise. 6. Elin Blom. 7. Geir Rönning. Summarizing the results, the method correctly identified 8 out of 11 contestants risking elimination. (In the competition of November 12 contestant number 2 beat contestant number 6 by a slim margin 94 to 95.) The probability getting a result of 8 or more by simply guessing is 2.7% the proof of this is given in the section below. 4.4 Statistical Evaluation of Results In order to gauge the significance of the results, let us consider the possibility that they would occur by chance alone. This effort is divided into determining the probability of correctly guessing the outcome of a single competition (episode), and the probability of obtaining a certain number of correct guesses over a set of episodes.

39 4.4. Statistical Evaluation of Results Nov Nov-21 # of mentions Contestant # # of mentions Contestant # 4 19-Nov Nov-21 # of mentions Contestant # # of mentions Contestant # 4 3-Dec Dec-21 # of mentions Contestant # # of mentions Contestant # Figure 4.1: Number of Twitter mentions of each Idol contestant; the bars of contestant(s) who received the fewest votes are marked red Determining the Probability Mass Function of a Single Competition Given a competition with n contestants (n {3,4,...}) where two are to be identified for potential elimination, let A be the event of correctly identifying such a contestant by the first guess, and let A be the event of misidentifying such a contestant by the first guess. Then: and P(A) = 2 n, (4.1) P( A) = n 2 n. (4.2) Further, let B be the probability of correctly identifying such a contestant by the second guess. Notice that the probability of B and B will by necessity depend on the outcome of the first guess, since it will have eliminated an option. Therefore, by conditional probability: P(B A) = 1 n 1, (4.3) P(B A) = 2 n 1, (4.4)

40 26 Chapter 4. Predicting Idol Results and P( B A) = n 2 n 1, (4.5) P( B A) = n 3 n 1. (4.6) By Equations 4.2 and 4.6, the probability that neither guess will be correct is: P( A B) = P( A) P( B A) = n 2 n Similarly, the probability that both guesses will be correct is: n 3 n 1 = n2 5n + 6 n 2. (4.7) n P(A B) = P(A) P(B A) = 2 n 1 n 1 = 2 n 2 n. (4.8) Finally, the probability that exactly one guess will be correct is then: P((A B) ( A B)) = 1 (P(A B) P( B A)) = 1 n 2 n 3 n n 1 2 n 1 n 1 = 4n 8 n 2 n. (4.9) Equations constitutes the probability mass function that fully describes the possible outcomes of guessing which contestants will be in the bottom two, when the number of contestants is three or more. By also considering the special case that occurs in the final episode, when n = 2 and there is a 5% chance of guessing correctly, the chance of obtaining x number of correct guesses in a contest with n contestants (n {2,3,...}) can thus be summarized as: 1 2, x {,1},n = 2, n 2 5n+6 n 2 n, x =,n 3, 4n 8 f X (x,n) = n 2 n, x = 1,n 3, (4.1) 2 n 2 n, x = 2,n 3,, otherwise Determining the Probability Mass Function of a Set of Competitions Given Equation 4.1, what is the probability of obtaining a certain number of correct guesses from several rounds of competition? Let f Xn (x) and f Xm (x) be two discrete distributions as described by Equation 4.1 for some given number of contestants n and m. The probability of obtaining some given number of correct guesses X n,m = X n +X m is then given by the convolution of f Xn (x) and f Xm (x): f Xn,m (j) = k f Xn (k) f Xm (j k). (4.11) For example, the convolution f X2,3 which describes the final two competitions f X2 and f X3 is:

41 4.4. Statistical Evaluation of Results 27 f X2,3 () = k f X2,3 (1) = k f X2,3 (2) = k f X2,3 (3) = k f X2 (k) f X3 ( k) = f X2 (k) f X3 (1 k) = =, f X2 (k) f X3 (2 k) = = 1 3 2, = 1 3, f X2 (k) f X3 (3 k) = = 1 6. (4.12) This notion can be expanded to cover as many rounds of competition as desired. For instance, if f Xn m (describing the probabilities of obtaining x correct guesses in a series of competitions with n through m contestants) has already been computed, and if p = m + 1, then f Xn p (j) = k f Xn m (k) f Xp (j k) (4.13) describes the distribution of correct guesses from a set of competitions with n through p contestants. By successive applications of Equation 4.13 one can then compute a function f X2 7 (x), describing the probabilities of obtaining x number of correct guesses over six competitions with with 2 7 contestants: f X2 7 (x) = , x = 1, , x = 2, , x = 3, , x = 4, , x = 5, , x = 6, , x = 7, , x = 8, , x = 9, , x = 1, , x = 11,, otherwise. The probability of x or more correct guesses is therefore: p = 11 i=x (4.14) f X2 7 (i). (4.15) Equation 4.15 gives p =.27 for x = 8, and thus a 2.7% chance of getting 8 or more correct guesses if there is no correlation between message volume and competition outcomes. The p-value has also been evaluated using stochastic simulation, and the results agree excellently with the theoretical result above.

42 28 Chapter 4. Predicting Idol Results 4.5 Conclusions and Discussion The results presented above demonstrates that it is possible to use Twitter data to predict the outcomes of televised popularity contests before the official results are announced, and to do so with a fairly high degree of accuracy. Analysis shows that the probability of achieving the same success rate by chance alone is only 2.7%. It is worth noting that this was possible despite several complicating factors. To begin with, only a very small percentage of Swedes (.39%) are active Twitter users as is noted in Section With no way of compensating for, or even detecting, demographic factors this raises the concern that the Twitter user base may be statistically tilted when compared to the population at large. Further, in absolute numbers quite few tweets were posted at each round of the competition. Add to this that a tweet cannot be compared to a response in an opinion poll; a bare mention cannot automatically be equated with an endorsement even if some obviously negative messages are filtered out. This suggests that there is a significant potential for increasing the accuracy of social media based predictions and polling if the growth of the types of social media where all content is public, such as Twitter, continues.

43 Chapter 5 Correlating Twitter Message Volume and News Events It is perhaps not unreasonable to suppose that highly publicized news stories concerning a topic will result in an increased level of social media chatter. If true, this would suggest a means of measuring public interest and concern in a given news story. The following chapter will attempt to evaluate that supposition by comparing spikes in Twitter message volume regarding certain subjects with news archives from the same days. The subjects in this case are the NASDAQ symbols of a number of technology companies. 5.1 Hypothesis The main formal hypothesis of this Chapter is: Hypothesis 5.1: In the case of a major breaking news story regarding a particular company, this will coincide with a marked increase in Twitter message volume regarding the NASDAQ stock symbol of that company. Coincide shall in this context mean that the two events are not separated by more than one day. This is to allow for the possibility that stories might spread as a rumor on Twitter before being published, or that there may be a delayed response by Twitter users (e.g., if the news story is published just before midnight). 5.2 Method The base dataset used in this experimental setup consists of tweets that match one of the NASDAQ symbols of a major technology company. The data was gathered and processed using the basic methods described in Sections 3.2 to 3.5. Two separate methods were used to identify events that might confirm or falsify the hypothesis: 1. Major news events that were registered throughout the course of the project were later compared against the gathered Twitter data. 29

44 3 Chapter 5. Correlating Twitter Message Volume and News Events 2. Notable spikes in the gathered Twitter data were compared to news archives using Google News Archive Search 1. Comparisons were made using unfiltered Twitter data as well as the two filtering levels described in Definition Notice that what constitutes a major news story has not been rigorously defined. Such a definition would in order to be useful in this context need to take into account the fact that what would arguably be a major story for a small company might be fairly insignificant for a large company. This has therefore been deemed to lie outside of the scope of this project. 5.3 Results The following sections show the results of applying the method described above to several different technology companies. Two of the companies (Google and Microsoft) have products that are used by a large number of end-users. Two others (Cisco and Oracle) can be expected to primarily generate interest from professionals (e.g., business analysts and CIOs). The last one (Intel) can be expected to lie somewhere in between. In each test case, a time series plot is shown covering a given test period. The plot shows daily message volume for the NASDAQ symbol of the company in question using both unfiltered data and two levels of filtering. Each plot also features a number of marked points-of-interest that are described in the text. Notice also that in several cases the data exhibits a clear cyclical behavior, with message volume dropping sharply during weekends Google Figure 5.1 shows message volume for the search-term GOOG from October through December 21. Two dates stand out in particular: 1. Google presents an unexpectedly strong income increase of 32%. Date: October 14th, 21. Example: Google s Income Rises 32%, Topping Forecast. Publication: New York Times. URL: 2. The European Union launches an antitrust inquiry targeting Google. Date: November 3th, 21. Example: E.U. launches formal antitrust investigation into Google. Publication: The Washington Post. URL: formal_antitrust_i.html. This example in particular shows the value of filtering, as the these peaks are much less prominent in the unfiltered data. 1

45 5.3. Results Total count, unfiltered Total count, no URLs 5 1 Total count, no URLs or RTs 5 Oct 2 Oct 9 Oct 16 Oct 23 Oct 3 Nov 6 Nov 13 Nov 2 Nov 27 Dec 4 Dec 11 Dec 18 Dec 25 Figure 5.1: Message volume concerning GOOG Microsoft Figure 5.2 show the message volume for the search-term MSFT from late September to the end of November 21. Three peaks in particular stand out: 1. This spike appears to be primarily caused by rumors concerning a possible acquisition of Adobe by Microsoft: Date: October 7th, 21. Example: NYT: Adobe, Microsoft executives discussed sale of Adobe to Microsoft. Publication: Reuters. URL: idus Another significant news story of the day is pre-launch buzz concerning Microsoft s upcoming mobile OS: Date: October 11th, 21. Example: Microsoft s mobile operating system: Windows or curtains. Publication: The Economist.

46 32 Chapter 5. Correlating Twitter Message Volume and News Events URL: 2. Microsoft officially unveils the new mobile operating system Windows Phone 7. Date: October 11th, 21. Example: Microsoft s Windows Phone 7 to replace Windows Mobile. Publication: USA Today. URL: 3. Microsoft reports a 51% increase in profits, exceeding market expectations. Date: October 28th, 21. Example: Microsoft quarterly profit jumps 51 percent. Publication: The Washington Post. URL: 28/AR html Total count, unfiltered Total count, no URLs 5 4 Total count, no URLs or RTs Oct 2 Oct 9 Oct 16 Oct 23 Oct 3 Nov 6 Nov 13 Nov 2 Nov 27 Figure 5.2: Message volume concerning MSFT.

47 5.3. Results Cisco Figure 5.3 shows message volume for the search-term CSCO during November 21. In this example the plot shows a single prominent spike: 1. Cisco releases a revenue forecast far below market expectations on November 1th, translating into a sharp drop in stock price as the market opens on November 11th. Most news articles on the subject appears to be published on November 11th, explaining why the level of chatter appears to grow on the second day of the story. Date: November 1th, 21. Example: Cisco s outlook disappoints as tech shares wane. Publication: Reuters. URL: Total count, unfiltered Total count, no URLs 5 1 Total count, no URLs or RTs 5 Nov 6 Nov 13 Nov 2 Nov 27 Figure 5.3: Message volume concerning CSCO. The low message volume on days other than November 1th and 11th extends beyond the period shown in the plot, and in fact looks much the same throughout the period for which there exists sample data. This illustrates the dramatic percentage-wise impact that a news story can have on message volume when the company is one that is normally not much talked about.

48 34 Chapter 5. Correlating Twitter Message Volume and News Events Oracle Figure 5.4 shows message volume for the search-term ORCL from October 3th to December 24th, 21. Four prominent spikes were found that seems to coincide with significant news stories: 1. Oracle announces deal to buy e-commerce software provider Art Technology Group for $1 billion. Date: November 2nd, 21. Example: Oracle to Buy Art Technology Group for $1 Billion. Publication: New York Times. URL: 1-billion/. 2. Oracle wins a long legal fight against SAP over alleged copyright infringement and is awarded $1.3 billion in damages. Date: November 24th, 21. Example: Oracle awarded $1.3 billion in copyright infringement suit. Publication: Los Angeles Times. URL: The enterprise cloud computing company Salesforce.com, founded by a former Oracle executive, announces the service Database.com to compete with Oracle. Date: December 7th, 21. Example: Salesforce.com takes on Oracle in database market. Publication: Reuters. URL: 4. Oracle trumps market expectations by reporting a 28% jump in profits and 47% jump in revenue. Date: December 16th, 21. Example: Oracle.s Profit Up 28%, Beating Forecast. Publication: New York Times. URL:

49 5.3. Results Total count, unfiltered 5 15 Total count, no URLs Total count, no URLs or RTs Oct 3 Nov 6 Nov 13 Nov 2 Nov 27 Dec 4 Dec 11 Dec 18 Figure 5.4: Message volume concerning ORCL Intel Figure 5.5 shows message volume for the search-term INTC from December 1st 21 to February 9th 211. This example is instructive because it contains spikes caused by what is for purposes of this project noise, in addition to spikes correlated with news events The higher-than-normal message volume on these days (and to a lesser extent on some other occasions) were largely caused by a single users sending a large number of tweets containing the word intc that were unrelated to Intel. Dates: December 6th, 7th and 9th, This spike was caused by a large number of promotional messages. The messages all contained the same text but was posted by different user accounts. Date: December 3th, Intel reports a 48% jump in profits. Date: January 13th, 21. Example: Intel Reports Record Profit and Exudes Confidence. Publication: New York Times. URL:

50 36 Chapter 5. Correlating Twitter Message Volume and News Events 6. Intel stops shipments of one of it chips after a design flaw is i discovered. The total cost of correcting the error is estimated at $7 million. Date: January 31th, 21. Example: Instant View: Intel cuts revenue forecast after chip flaw. Publication: Reuters. URL: Total count, unfiltered 5 4 Total count, no URLs Total count, no URLs or RTs Dec 4 Dec 11 Dec 18 Dec 25 Jan 1 Jan 8 Jan 15 Jan 22 Jan 29 Feb 5 Figure 5.5: Message volume concerning INTC. 5.4 Conclusions and Discussion The results above show several examples where an increased volume of Twitter messages coincides with high-profile news stories. While the stated hypothesis does not lend itself well to formal hypothesis testing (in part because of the aforementioned difficulty of defining clearly what constitutes a major news story), it has been the impression of this author that whenever a big news story broke concerning one of the tracked companies, this would indeed be reflected in gathered data. Further, where the raw data has been examined it has been found to contain references to the news stories indicated above.

51 5.4. Conclusions and Discussion 37 Given this, I find it reasonable to conclude, even absent hard statistical evidence, that Twitter message volume does indeed correlate with major news stories. A secondary purpose of this chapter has been to review the gathered data, as it will also form the basis for subsequent chapters. This review has revealed several notable features of the dataset. The first of these is the importance of filtering. Particularly in the case of Google, the unfiltered indicator looks more or less like noise, with the two points-of-interest only becoming clearly discernible once tweets containing URLs are filtered out. Also in the cases of Microsoft and Intel there are points-of-interest that do not stand out particularly. These two, along with the case of Oracle, also exhibits the converse of this behavior; that there are apparent message volume peaks in the unfiltered indicator that disappear when the filtering in applied. It seems clear that excluding messages containing URLs is a simple and effective way of greatly reducing the noise level, in particular as promotional messages (as exemplified by the case of Intel) are more or less guaranteed to contain one. However, it is also obvious that this greatly reduces the amount of (non-noise) data available for analysis, as the inclusion of a link is a common practice also in legitimate traffic. The case of Intel, where the first three points-of-interest consisted almost entirely of tweets unrelated to the company, illustrates that even when using generic filtering it may be useful to examine the actual data manually from time to time. A possible mechanism for dealing with this particular dataset problem might be to institute a per-user limit, based on the assumption that if a single user produces hundreds of messages on the same topic during a single day this is likely to be, for practical purposes, noise. Notice however that this would not address the problem of the fourth point-of-interest in the Intel data, as this noise peak was distributed across many user accounts one might suspect in order to avoid triggering a Twitter spam filter. The filtering also revealed a somewhat curious phenomena; in several instances the message level directly following a peak drops off much faster in the filtered indicators than in the unfiltered one. Examples include point three in the case of Microsoft, point four in the case of Oracle, and point five in the case of Intel. In each of these cases the peak coincides with the release of a financial statement by the company. It seems plausible that this behavior can be explained by the first day consisting of discussion and reactions relating to the report itself, and the second day being more concerned with (and thus linking to) the related news stories. These results also suggest that there is a general interest component to the flow of Twitter messages. This is useful to keep in mind for later chapters as it highlights that even if there are strong correlations between market indicators and Twitter data, these correlations cannot be expected to be consistent in magnitude; sometimes there will be news stories that are highly talked about without affecting the market a great deal, and vice versa. The general interest component is also evident in the day to day message volume between peaks; even though all of these companies have market values in the same order of magnitude, there is a considerable difference in message volume between companies that produce consumer goods and services and the ones that don t. It would be interesting to investigate the possible correlation between the size of the message spike with the number of published news stories. Unfortunately, Google News appears to treat date limits as helpful hints rather than hard restrictions, making this endeavor somewhat cumbersome.

52 38 Chapter 5. Correlating Twitter Message Volume and News Events

53 Chapter 6 Stock Market Correlations The following chapter will investigate if and to what extent Twitter data can be found to correlate with, or even foreshadow, stock market movements. This is accomplished by analyzing over 9 million tweets relating to specific stocks gathered over a period of more than five months. The data is compared to daily movements of the NASDAQ stock exchange using large scale automated correlation analysis. The analysis is aimed both at finding simple correlations (i.e., that Twitter data co-varies with stock market data), as well as instances where Twitter activity appears to foreshadow stock market movements. Finally, the results of the correlation analysis are used to evaluate some aspects of the implemented methods, as well as which aspects of the market data has the highest levels of correlation with the social media data. 6.1 Hypotheses and Problem Statement In essence, the purpose of this chapter is to test the following two hypotheses: Hypothesis 6.1: There exists statistically significant correlations between indicators derived from Twitter data, and market data or indicators derived from market data. Hypothesis 6.2: For a given Twitter indicator and a given market indicator that have been found to correlate, the Twitter indicator will at times act as a leading indicator of the market indicator. A secondary goal is to evaluate how the high-value correlations are distributed among, say, various market statistics or lag-levels: Problem Statement 6.1: Evaluate the distributions of high-value correlations with respect to various method components used for constructing the Twitter indicator and market indicator used in those correlations. 39

54 4 Chapter 6. Stock Market Correlations 6.2 Method Description of Gathered Data The Twitter data used in theses analyses consists of tweets containing the stock symbol of one of 14 companies listed on the NASDAQ stock exchange. The chosen companies were the ones with the highest market value as of September 27 21, excluding LM Ericsson and Vodafone. These where excluded because their stock symbols (ERIC and VOD) caused too many false positives. Searches included the symbols themselves as well as symbols preceded by the hash-symbol (#). The included stock symbols are: AAPL Apple Inc. MSFT Microsoft Corporation. ORCL Oracle Corporation. GOOG Google Inc. CSCO Cisco Systems, Inc. INTC Intel Corporation. AMZN Amazon.com, Inc. QCOM Cisco Systems, Inc. AMGN Amgen Inc. TEVA Teva Pharmaceutical Industries Limited. CMCSA Comcast Corporation. INFY Infosys Technologies Limited. DTV DIRECTV. EBAY ebay Inc. The data covers a period from September 29th 21 (for some symbols a few days earlier) until March 4th 211, and includes over 9 million tweets. The data gathering is described in more technical detail in Section 3.2. The gathered data in then pre-processed according to the general methods described in Section 3.3, and message-level sentiment is determined in the manner described in Section 3.4. The method described in Section 3.5 is then used to create time-sliced aggregates of the data, creating the raw statistics described in Section These statistics are what will be used as input to the analyses described below. The stock market data used has been gathered from Yahoo Finance. For each trading day, the data contains the following information: Open the stocks opening price. Close the stocks closing price.

55 6.2. Method 41 High the highest figure at which the stock was traded during the day. Low the lowest figure at which the stock was traded during the day. Adjusted Close usually equal to close but may be adjusted to reflect some types of information surfacing after the market closes (e.g., splits). Volume the number of shares traded during the day. For purposes of this report these time series may also be referred to as raw statistics, similar to the Twitter raw statistics defined in Section Summary of Time Series Vector Notation In the following subsections, as well as later in Section 7.2, a vector notation is used to describe time series data, such as the daily market statistics described above. For convenience, this notation is summarized in the following definitions. Definition A time series vector x is a series of n observations [x 1,x 2,...,x n ], where each observation x i is the value of the time series at day i. Definition A subset x a,b (a < b n) of a time series vector x consists of all observations between day a and day b: [x a,x a+1,...,x b 1,x b ] Derivatives Suppose that news surfaces on one day that causes a surge in a stocks value, followed a few weeks later by news that sends the stock crashing down. In both instances one would however expect to observe a surge in message volume mentioning the stock. Further, the effect of the news events (i.e., the change in the valuation of the stock) may be sustained even though the online chatter (i.e., Twitter message volume) quickly subsides. These example suggests that even though there may be a significant pattern that relates market behavior to a series of Twitter data the pattern may not be immediately obvious. In particular, it may not produce a significant result when testing the level of correlation between the datasets. For this reason the concept of derivatives 1 is introduced. A derivative is a transformation of a time series data that is intended to improve the correlation with another time series under certain circumstances. Formally, the notion can be defined as follows: Definition A derivative x = f(x) of a time series x shall refer to the output of some derivative function f such that x i is calculated using at most the values {x 1,x 2,...,x i } of x. Further, the concepts of raw statistics and derivatives are combined as follows: Definition The term indicator shall refer generally to either a Twitter or market raw statistic, or to some derivative thereof. Excluding some experimental functions, the derivative functions used in this project are described in the following subsections. 1 Not to be confused with the concepts in mathematics and finance bearing the same name.

56 42 Chapter 6. Stock Market Correlations Median Differential The median differential for day i is the difference between the current value x i and the sliding median over the last n days. Formally, each element x i of the derivative x is given by: where x i n,i is the median of the values {x i n,x i n+1,...,x i }. Absolute Median Differential x i = x i x i n,i (6.1) The absolute median differential is simply the absolute value of the median differential function defined above: x i = x i x i n,i. (6.2) This function is designed to address situations where a movement of one indicator in either direction is likely to correspond to a movement in a particular direction of the other indicator. Rate-of-Change The rate-of-change of x for day i is defined as: This is the same definition as is used in technical analysis. Absolute Rate-of-Change x i = x i x i 1 x i 1. (6.3) The absolute rate-of-change of x for day i is defined as: x i = x i x i 1 x i 1. (6.4) The motivation for this function is similar to that of the absolute median differential described above. %b %b is a technical analysis indicator derived from Bollinger Bands, described in [1], that gives a measure of how high or low the current value is relation to the recent past. Slightly more formally, %b of a time series tells us where the current value is in relation to two bands k standard deviations above or below a moving average. Formally, the value of %b of x at day i is: x i = which may be simplified to: where: x i ( x i n,i kσ i n,i ) ( x i n,i + kσ i n,i ) ( x i n,i kσ i n,i ), (6.5) x i = x i x i n,i kσ i n,i 2kσ i n,i, (6.6)

57 6.2. Method 43 x i n,i is the average of the last n days up to and including day i, σ i n,i is the population standard deviation, calculated over the last n days, k is a constant, and x i n,i +kσ i n,i and x i n,i kσ i n,i is the upper and lower Bollinger Band, respectively. Bollinger Bandwidth The Bollinger bandwidth of a time series is a measure of volatility, also described in [1]. Using the notation defined for %b, it is the distance between two bands k standard deviations above or below a moving average: x i = ( x i n,i + kσ i n,i ) ( x i n,i kσ i n,i ) x i n,i, (6.7) which simplifies to: x i = kσ i n,i x i n,i. (6.8) Correlation Tests There exists several statistical tests designed to measure the level of correlation between two sets of data. As these tests are designed to measure correlation in different ways the interpretation of each result will differ. A dataset that produces a high correlation score using one method may produce a low score using another method. Restricting the measure of correlation to a single figure therefore risks ignoring interesting results, in particular when the characteristics of the underlying data are unknown. For this reason two different commonly used statistical measures of correlation are described below, as well as methods for computing the p-value and confidence interval of each measure. Pearson Product-Moment Correlation Coefficient r The PPMCC, described in [24], is a measure of the linear correlation between two sets of data. For time series x and y of size n, the correlation coefficient r is given by: n (x i x)(y i ȳ) i=1 r =. (6.9) n n (x i x) 2 (y i ȳ) 2 i=1 The coefficient r will tend to 1 when there is a strong positive correlation between x and y, and to 1 when there is a strong negative correlation. To get a feeling for how the size of r relates to the data, notice that each term (x i x)(y i ȳ) will grow large precisely when both x i and y i deviate significantly from their respective means x and ȳ. Also, notice that when x and y are uncorrelated the sign of each term of the summation will be random, and the summation will therefore tend to zero. i=1

58 44 Chapter 6. Stock Market Correlations Computing the p-value of Pearson s r, Standard Method For the standard method of computing the p-value of r when the sample size is n, let n 2 T = r. (6.1) 1 r 2 If the data x and y are normally distributed, T will follow a Student s t-distribution with n 2 degrees of freedom, and the p-value can be easily be computed using this distribution [1]. This standard p-value will be referred to as p r. However, if the data is not normally distributed the method may give misleading results. The effects of assuming normal data, and reasons for believing that the data is non-normal in this particular case, is discussed in Section 8.5. Computing the p-value of Pearson s r, Bootstrap Method When calculating, for instance, p-values or confidence intervals, you are in effect making a statement about the probability distribution of your data. One may, as in the calculation of p r described above, make some implicit or explicit assumption about the distribution of the data. If one has a good reason to believe that the assumption is correct, or at least not too wrong, this can be a good idea. A second alternative is to use a non-parametric method known as bootstrapping. Bootstrapping does away with making assumptions about the underlying distribution by instead utilizing the one fact one can state with certainty: Whatever the distribution of the observed data, the data will consist of observations from that distribution. The following bootstrap-based method for calculating the p-value of r is suggested in [1]. To distinguish the value from that of the standard version, this p-value will be designated p b. The method begins with drawing bootstrap samples from the original sets of data. Let {x,y} be two time series of size n. A bootstrap sample {x,y } can be constructed from {x,y} by randomly choosing n observation pairs {x i,y i },1 i n from {x,y} with replacement and with equal probability of choosing each i. By creating n boot such bootstrap samples {X,Y } = {{x 1,y 1}, {x 2,y 2},..., {x n boot,y n boot }}, the p-value can then be estimated in the following way: 1. Perform the correlation test for Pearson s r as described above for each of the n boot bootstrap samples, creating a set of correlation coefficients r = {r 1,r 2,...,r nboot } where each r i corresponds to a bootstrap sample {x i,y i }. 2. If the correlation is positive, the p-value is equal to 2 times the proportion of r that is less than. Adopting the convention that false = and true = 1, this can be expressed as: n boot p = 2 (r i < )/n boot. (6.11) i=1 If the correlation is negative, the proportion of r greater than is used instead: n boot p = 2 (r i < )/n boot. (6.12) i=1

59 6.2. Method 45 The method has been tested using data known to be normal; is such cases the bootstrap method described here produces results close to those of the standard method described above. The general method of bootstrapping is described in more detail in [17]. Confidence Interval for Pearson s r The 1(1 α)% confidence interval for Pearson s r correlation coefficient is also calculated using the bootstrap samples described above. The process involves the following steps: 1. Construct the set of bootstrap based correlation coefficients r as described above. 2. Sort r in ascending order. 3. Let i lower be the nearest integer to n boot α/2. 4. Let i upper be the nearest integer to n boot (1 α/2). 5. Finally, let CI r,α = [r ilower,r iupper ] be the 1(1 α)% confidence interval for Pearson s r. Kendall Rank Correlation Coefficient τ The KRCC, also described in [24], is based on the concept of concordance and discordance. A pair of observations {x i,y i } and {x j,y j } is said to be concordant if (x i > x j y i > y j ) (x i < x j y i < y j ) (6.13) and discordant otherwise. Between the dataset x and y of size n there are 1 2n(n 1) such pairs. Let c be the number of these pairs that are concordant and d the number of pairs that are discordant. Then the correlation coefficient τ is given by: τ = c d 1 2 n(n 1) (6.14) As with the PPMCC, τ will between 1 (strong positive correlation) and 1 (strong negative correlation). Informally, the measure τ can be interpreted as follows: If x i is greater than x j, how sure can one be that y i will be greater than y j? The p-value of Kendall s τ The p-value for τ is computed using permutations of the data {x,y} in a way that does not require the data to be normally distributed. This value will be termed p τ to distinguish it value from the p-values p r and p b defined for Pearson s r. Confidence Interval for Kendall s τ The 1(1 α)% confidence interval for the τ correlation coefficient, CI τ,α, is calculated using a bootstrap method similar to that of the confidence interval for Pearsons s r described above.

60 46 Chapter 6. Stock Market Correlations Heuristics for Evaluating Correlations An automated process is used to perform a large number of correlation tests and to identify the most interesting relationships. This section will begin with an explanation of why this is a necessary approach. Suppose that there are n raw Twitter statistics and m raw market statistics, concerning s different subjects (i.e., stocks). Suppose also that there are p derivative functions, so that the total number of Twitter- and market indicators for each subject becomes n (p+1) and m (p + 1), respectively. This means that the total number of possible unique correlations between the sets of indicators is s n (p + 1) m (p + 1). In this case the figure is = However, this project is not only concerned with finding mere correlations, but also with finding instances where Twitter data is a leading indicator of the market data. Therefore it is also of interest to test correlation levels with market indicators lagged by, say, one or two days compared to the Twitter indicator. This brings the number of possible correlations up to 3 s n (p + 1) m (p + 1). Further, it is likely that the relationship between Twitter data and market data is not consistent over time. In other words, it may be considered unlikely that there exists a Twitter indicator which consistently predicts market behavior. Rather, one would expect that while a Twitter indicator may sometimes lead a market indicator, it will at other times (and perhaps more usually) coincide with or lag behind the market indicator. Simply put, by only considering the full time series one risks missing some potentially important correlations. Therefore it may be worthwhile to examine subdivisions of the gathered data in addition to the full time series in order to better detect phenomena which are only exhibited temporarily. The data is therefore partitioned in two different ways; by calendar month, and biweekly. The biweekly partitions covers two full workweeks, with the possible exceptions of the first and last period. As the available data in this project covers five full months and roughly eleven two-week intervals, there are a total of t = time partitions to consider. The total number of correlations to examine is then: 3 s n (p + 1) m (p + 1) t = 8,816,472. (6.15) To perform and evaluate all of these correlations manually is clearly not feasible. An alternate approach of only evaluating the correlations that are deemed most likely to produce interesting results would still require a significant effort, and would also carry a risk of overlooking many interesting results. The approach taken here is instead to programmatically perform and evaluate all the possible correlations, and then plot a number of the most promising results for manual evaluation. In short the following steps are performed: 1. Raw statistics in the form of CSV files are read from disk. 2. Indicators are created by applying each derivative function to each raw statistic. 3. The data is split by calendar month and biweekly, creating a set of shorter time series in addition to the full dataset. 4. Lagged versions of the market indicators are created, using a lag factor of one and two.

61 6.2. Method A correlation test is performed for each possible combination of a Twitter indicator and a lagged or unlagged market indicator. Resulting correlation coefficients and p-values for each test is stored. 6. The most promising results are singled out, and a graph comparing the two indicators is produced. The process of determining what constitutes a promising result takes into account the absolute value of the correlation coefficient, per-indicator type limits, and minimum and maximum numbers of results to output per subject and time period Meta Analyses The method described above, where hundreds of thousands of correlation tests are performed, produces a large body of data that may be used for meta analyses. In particular it allows for an evaluation of the relative effectiveness of the various proposed indicators. The following terminology will be used in the text below: Definition ρ shall refer to a set of results of correlation tests, where each result ρ ρ is defined by the following constituents: ρ.r: The linear correlation coefficient r ranging from -1 to 1. ρ.p r : The p-value of ρ.r. ρ.twitter indicator: The name of the Twitter indicator used in the test. ρ.market indicator: The name of the market indicator used in the test. In the tests described below ρ contains 635,4 unique correlations. The correlations where carried out using monthly subdivisions of data gathered between the beginning of October 21 and the end of February 211, and covering all fourteen stock symbols described above. The derivatives used were rate-of-change, absolute rate-of-change, median differential, absolute median differential and %b. Further, the concept of indicator sets will be useful in the following text: Definition θ x shall refer to the set of indicators related to or derived from the raw statistic x. θ x can in most cases be thought of as containing any indicator whose name contains x. In the examples below this holds in all cases except the set θ unfiltered. Comparison of Market Indicators Recall that six raw market statistics, open, close, high, low, adjusted close and volume, for each stock symbol has formed the basis for the market indicators used in this project. This section will describe two methods of comparing how well each of these statistics correlate, in an aggregate sense, with the gathered Twitter data. The first method, referred to as the standard method later in the text, is based on counting the number of times an indicator derived from a particular market statistic is involved in a correlation where the result exceeds some given limit. Formally, using the terminology described above, let Θ = {θ open,θ close,θ high,θ low,θ adjusted close,θ volume } (6.16)

62 48 Chapter 6. Stock Market Correlations be the sets of indicators derived from each market statistic. Then, given the set of tests ρ described above, define a set of scores s x,x {open, close,high,low,adjusted close,volume} as the number of correlations ρ ρ which satisfy ρ.market indicator θ x ρ.r l (6.17) for some limit l < 1. However, depending on the characteristics of the set ρ, this method may give a misleading result. Imagine that the distribution of high-value correlations varies substantially between subjects (e.g., stock A correlates best with θ open while stock B correlates best with θ adjusted close ). Imagine further that the high-value correlations stem disproportionately from a small number of subjects. Then the result of the method described above will disproportionately reflect the characteristics of these particular subjects, at the expense of others. To address this concern the following method, later referred to as the alternate method, is proposed: For each subdivision (e.g., each month) of each subject, let ρ be the n results with highest absolute value of the correlation coefficient. Then, let s x be the number of correlations ρ ρ that satisfies ρ.market indicator θ x. Then, let the final score s x of the indicator set θ x be the sum of all such scores s x over all subdivisions and subjects. Unlike the first method, the alternate method gives equal weight to each subject. Comparison of Filtering Levels Two basic filtering levels were defined in Section 3.5.1; no urls and no urls or rts. We are interested in comparing these methods to each other, and to the option of not filtering the message set at all. Similarly to the preceding subsection, let: Θ = {θ unfiltered,θ no urls,θ no urls or rts } (6.18) be the sets of indicators derived from the unfiltered message sets, and from each of the two filtering levels. The set Θ may then be evaluated using the standard and alternate methods described above. Estimation of the Predictive Power of Twitter Indicators In addition to presenting examples where a Twitter indicator appears to lead a market indicator, and calculating correlation levels for such lagged relationships, it is possible to use a large number of correlation tests to calculate an aggregate measure of how well Twitter indicators correlate with market indicators at some point in the future. Given n sets of correlations ρ 1,ρ 2,...,ρ n, where the correlation coefficients and p-values of ρ i (i {1,2,...,n}) is determined by lagging the market indicator by i days compared to the Twitter indicator, let the score for lag i be the number of correlations ρ ρ i which satisfies: ρ.p r < α (6.19) where α is some chosen significance level. One can expect that this score will be highest for the direct correlations (lag ) and then fall off towards some attractor, representing the number of correlations that will be significant at the level α by a mere fluke, as i grows large. The rate at which this falloff occurs will then provide an indirect hint of the predictive power of the Twitter indicators over the market indicators.

63 6.3. Results 49 If the falloff is very rapid (e.g., if the scores for lags 1 and 2 is as low or nearly as low as for lags that lie near the attractor), this suggests that values of the Twitter indicators today are largely unrelated to the values of the market indicators one or two days from now. On the other hand, if the falloff is slower (e.g., if the scores for lags 1 or 2 is nearly as high as for lag, or at least significantly higher than for later lags), this suggests that values of the Twitter indicators provides some hints of the future values of the market indicators. Evaluation of Sentiment Analysis Approaches Recall from Section that two related sets of raw statistics where defined based on sentiment scores. The difference between the two sets is that one (pos val, pos adj val, neg val and neg adj val) is based on the positive and negative sentiment score ( to 2) of each tweet, while the other set (pos cnt, pos prop, neg cnt and neg prop, as well as the statistics mix cnt, mix prop, neu cnt and neu prop which will not be considered here) is based solely on the number of tweets that are considered positive or negative. The counting method can alternatively be thought of as setting all sentiment scores of 2 to 1. The difference between the related statistics (i.e., pos cnt/pos val, neg cnt/neg val, pos prop/pos adj val and neg prop/neg adj val) will typically be small. This is because the only difference will come as a result of the relatively small amount of tweets that match one of the specialized lexica described in Section However, if there exists a significant difference between the two sets in the correlation levels achieved using the indicators based on each set, this would provide an indirect indication in the relative effectiveness of the two approaches to sentiment analysis. Formally, let Θ = {θ pos val,θ pos adj val,θ neg val,θ neg adj val } (6.2) be the sets of all indicators derived from the scoring-method and ˆΘ = {θ pos cnt,θ pos prop,θ neg cnt,θ neg prop } (6.21) be the sets of all the complementary indicators derived from the counting method. Then, in the same way as the last section, define a set of scores s x,x {pos val, pos adj val,neg val,neg adj val,pos cnt,pos prop,neg cnt,neg prop} as the number of correlations ρ ρ which satisfy for some limit l < 1. ρ.twitter indicator θ x ρ.r l (6.22) 6.3 Results Even for conservative limits the automated method for finding interesting correlations described above will produce thousands of outputs. This means that, as a practical matter, this results section must be restricted to presenting a selection of the results rather than the full result set. Such examples are given in Section 6.3.1, which presents correlations between Twitter indicators and trade volume, and Section 6.3.2, which presents correlations between Twitter indicators and stock value. Each section presents examples of direct correlations as well as examples where the Twitter indicator at times acts as a leading indicator of the market indicator. For this

64 5 Chapter 6. Stock Market Correlations reason, summary statistical results are presented using both lagged and unlagged data at the end of each subsection. Section presents the results of the meta analyses, and gives a clearer picture of the relative effectiveness of the various proposed indicators Stock Market Correlations Related to Trade Volume The following section will give several examples where Twitter indicators related to a stock symbol correlate with the daily trade volume of the stock. For each correlation the following statistical measures, described in detail in Section 6.2.4, have been computed: Linear correlation r Will be close to 1 for perfect positive correlation and close to 1 for perfect negative correlation. p r The standard p-value for r, assumes a normal distribution. p b The bootstrap p-value for r, makes no assumption of distribution. CI r,.1 A 99% confidence interval for r. Rank correlation τ Like r, will range from 1 to 1. p τ The p-value for τ. CI τ,.1 A 99% confidence interval for τ. Some of these numeric results are described in the running text, while the full set of results is given in a table at the end of this section. Correlation Examples Figures 6.1, 6.2 and 6.3 shows the correlation between daily Twitter message volume and trade volume for Apple, Google and Intel, respectively. In each case tweets containing URLs have been filtered out, and in the case of Google also retweeted messages. The correlation coefficients are r =.8 and τ =.4 in the case of Apple, r = 75 and τ =.27 in the case of Google, and r =.63 and τ =.38 in the case of Intel. Figure 6.4, showing message volume mentioning Ciscos stock symbol compared to trade volume, provides one of relatively few examples where the unfiltered message volume provides a larger correlation value compared to the corresponding filtered indicators. Figures 6.5, 6.6 and 6.7 demonstrates how indicators not directly derived from message volume, in this case the positive value statistic, can also correlate well with trade volume. Notice that in this particular case the positive value statistic in effect acts as a filtered version of message volume, and one can therefore not automatically draw the conclusion that trade volume is associated with an increase in positive sentiment. Figures 6.6 and 6.7 also demonstrates how the median differential derivative function can improve the linear correlation level by removing a degree of trend in both datasets. The linear correlation grows from r =.55 for the data in the first figure, to r =.65 in the second. However, the operation reduces the rank correlation τ, which goes from.21 in the first example to.17 in the second.

65 6.3. Results 51 4 AAPL x Total count, no URLs Volume Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb 26 Figure 6.1: Relationship between Twitter indicator Total count, no URLs and market indicator Volume for Apple. 1 GOOG x Total count, no URLs or RTs 5 1 Volume Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb 26 Figure 6.2: Relationship between Twitter indicator Total count, no URLs or RTs and market indicator Volume for Google.

66 52 Chapter 6. Stock Market Correlations 4 INTC x Total count, no URLs Volume Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb 26 Figure 6.3: Relationship between Twitter indicator Total count, no URLs and market indicator Volume for Intel. 3 CSCO x Total count Volume Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb 26 Figure 6.4: Relationship between Twitter indicator Total count and market indicator Volume for Cisco.

67 6.3. Results 53 2 AMGN x Median differential of positive value 1-1 Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb Median differential of volume Figure 6.5: Relationship between Twitter indicator Median differential of positive value and market indicator Median differential of volume for Amgen. 3 QCOM x Positive value, no URLs or RTs Volume Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb 26 Figure 6.6: Relationship between Twitter indicator Positive value, no URLs or RTs and market indicator Volume for Qualcomm.

68 54 Chapter 6. Stock Market Correlations 3 QCOM x Median differential of positive value, no URLs or RTs Median differential of volume -1 Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb 26-2 Figure 6.7: Relationship between Twitter indicator Median differential of positive value, no URLs or RTs and market indicator Median differential of volume for Qualcomm. Several of the examples shows instances where a large spike in the Twitter indicator one day is followed by a large spike in the market indicator the next day. This is reflected in correlation levels that remain high even when the Twitter indicator is compared to the market indicator of the following day. In the case of Qualcomm, for example, the unlagged correlation is r =.55 in Figure 6.6, and the correlation shrinks only slightly to r =.51 when the Twitter indicator is lagged by one day. Summary Statistics Tables 6.1 and 6.2 presents summary statistics for each of the examples given above. Table 6.1 shows the values of the statistics using unlagged data. The second set of results, shown in Table 6.2, were computed with the market indicator lagged by one day compared to the Twitter indicator. Simply put, this table shows how well today s Twitter data correlates with tomorrow s market: The measures are described in Section One million bootstrap samples have been used for each of the bootstrap statistics.

69 6.3. Results 55 Table 6.1: Summary statistics for correlations related to trade volume, using unlagged data. Fig. r p r p b CI r,.1 τ p τ CI τ, [.56,.92] [.23,.53] [.37,.9] [.8,.44] [.37,.84] [.24,.51] [.3,.95] [.14,.49] [.28,.88] [.6,.42] [.19,.75] [.3,.37] [.25,.81] [-.2,.34] Table 6.2: Summary statistics for correlations related to trade volume, with the market indicator lagged by one day compared to the Twitter indicator. Fig. r p r p b CI r,.1 τ p τ CI τ, [.13,.75] [.2,.36] [.8,.83] [-.6,.31] [.9,.26] [.11,.4] [.27,.9] [.8,.44] [-.1,.61] [-.7,.3] [.6,.75] [.1,.35] [.11,.81] [-.5,.32] Stock Market Correlations Related to Stock Value The following sections provides examples where Twitter indicators correlate with the market value of stocks. The same summary statistics as in the previous subsection have been computed for each example, and is presented at the end of the section for both unlagged and lagged data. Correlation Examples Figures 6.8 and 6.9 shows the relationship between Apple s stock value and aggregate Twitter sentiment. The first figure shows a positive correlation (r =.47 and τ =.32) between the rate of change of the daily closing price and the median differential of the proportion of tweets mentioning AAPL that are deemed positive, with URLs and retweets filtered out. In the second figure negative tweet proportion substitutes positive. The correlation also turns negative (r =.53 and τ =.37), as there is a clear tendency for the Twitter indicator to go down as the market indicator goes up, and vice versa. (Notice that it is not automatically the case that if there is a positive correlation between a market indicator and positive Twitter sentiment, there will also be a negative correlation between the market indicator and negative sentiment. This is because tweets are not necessarily regarded as either positive or negative.)

70 56 Chapter 6. Stock Market Correlations.2 AAPL.5 Median differential of positive proportion, no URLs or RTs Rate-of-change of close -.2 Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb Figure 6.8: Relationship between Twitter indicator Median differential of positive proportion, no URLs or RTs and market indicator Rate-of-change of close for Apple..2 AAPL.5 Median differential of negative proportion, no URLs or RTs -.2 Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb Rate-of-change of close Figure 6.9: Relationship between Twitter indicator Median differential of negative proportion, no URLs or RTs and market indicator Rate-of-change of close for Apple. Figures 6.1 and 6.11 shows the same relationships but with the time frame restricted to November 21 for readability. During this period the positive correlation grows compared to the full time frame; for the positive-sentiment correlation to r =.77 and τ =.5, and to r =.68 and τ =.55 for the negative-sentiment correlation.

71 6.3. Results 57.1 AAPL.3 Median differential of positive proportion, no URLs or RTs Rate-of-change of close -.2 Nov 3 Nov 8 Nov 13 Nov 18 Nov 23 Nov Figure 6.1: Relationship during November 21 between Twitter indicator Median differential of positive proportion, no URLs or RTs and market indicator Rate-of-change of close for Apple..2 AAPL.4 Median differential of negative proportion, no URLs or RTs Nov 3 Nov 8 Nov 13 Nov 18 Nov 23 Nov Rate-of-change of close Figure 6.11: Relationship during November 21 between Twitter indicator Median differential of negative proportion, no URLs or RTs and market indicator Rate-of-change of close for Apple.

72 58 Chapter 6. Stock Market Correlations Figure 6.12 shows the relationship for Qualcomm between the positive value Twitter indicator and the closing price of the stock. Notice in particular that all three major spikes in the Twitter indicator is succeeded by a spike in the closing price of the stock on the following day. A noteworthy property of this particular example is that the linear correlation is actually better when the Twitter indicator for each day is compared to the market indicator of the following day; for unlagged data we have r =.47, and r =.6 for the lagged indicator. 3 QCOM 4 Positive value, no URLs or RTs Median differential of close Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb 26-2 Figure 6.12: Relationship between Twitter indicator Positive value, no URLs or RTs and market indicator Median differential of close for Qualcomm. Summary Statistics As with the previous set of results, a number of summary statistics have been computed for each example. These are shown in Table 6.3 (correlations computed using unlagged data) and Table 6.4 (correlations computed with the market indicator lagged by one day compared to the Twitter indicator). As with the previous set of results, one million bootstrap samples were used for all bootstrap statistics. Table 6.3: Summary statistics for correlations related to stock value, using unlagged data. Fig. r p r p b CI r,.1 τ p τ CI τ, [.3,.63] [.17,.46] [-.7, -.31] [-.5, -.22] [.44,.93] [.1,.78] [-.91, -.32] [-.79, -.15] [.19,.71] [.16,.49]

73 6.3. Results 59 Table 6.4: Summary statistics for correlations related to trade volume, with the market indicator lagged by one day compared to the Twitter indicator. Fig. r p r p b CI r,.1 τ p τ CI τ, [-.13,.3].4.52 [-.12,.2] [-.22,.16] [-.18,.15] [-.56,.33] [-.32,.41] [-.6,.44].1.97 [-.43,.5] [.27,.78] [.6,.43] Results of Meta Analyses Comparison of Market Indicators The tests using the first method described above were performed using three different correlation limits:.7,.8 and.9. In all test cases a majority (53%-67%) of the correlations are related to the daily trade volume. Among the other market statistics (opening price, closing price, adjusted closing price, daily high and daily low) the results were distributed fairly equally, with the exceptions of daily high (which generally produced more high-value correlations than the others) and daily low (which generally scored lowest). The proportion of results related to each market statistic for all three tests can be seen in Figure Correlation limit:.7 1% 8% 12% 53% Open Close High Low Adjusted close Volume 7% 1% Correlation limit:.8 Correlation limit:.9 6% 1% 6% 67% 8% 8% 8% 65% 5% 6% 4% 8% Figure 6.13: Proportion of high-value correlations related to each market statistic using three different limits.

74 6 Chapter 6. Stock Market Correlations For the alternate method the n = 1 most significant correlations of each subject and month was used. Though the proportion of correlations related to trade volume was somewhat lower (4%) the results generally agrees with those of the first method, with daily high coming in second and daily low last. The results are shown in Figure % 11% 15% 1% 4% 12% Open Close High Low Adjusted close Volume Figure 6.14: Proportion of high-value correlations related to each market statistic using the alternate method. The complete numeric results for both the standard and alternate methods are shown in Table 6.5. Table 6.5: Number of high-value correlations related to each market statistic using three different limits with the standard method, as well as the alternate method. Indicator set l =.7 l =.8 l =.9 Alternate method θ open θ close θ high θ low θ adjusted close θ volume Comparison of Filtering Levels As in the previous subsection the tests using the first method were performed using three different correlation limits:.7,.8 and.9. The results can be seen in Figure The results of the alternate method, again using the n = 1 most significant correlations, can be seen in Figure The complete numeric results are listed in Table 6.6.

75 6.3. Results 61 Correlation limit:.7 36% 29% Unfiltered No URLs No URLs or RTs 35% Correlation limit:.8 Correlation limit:.9 37% 45% 3% 24% 34% 31% Figure 6.15: Proportion of high-value correlations related to each filtering level using three different limits. 34% 32% Unfiltered No URLs No URLs or RTs 34% Figure 6.16: Proportion of high-value correlations related to each filtering level using the alternate method.

76 62 Chapter 6. Stock Market Correlations Table 6.6: Number of high-value correlations related to each filtering level using three different limits with the standard method, as well as the alternate method. Indicator set l =.7 l =.8 l =.9 Alternate method θ unfiltered θ no urls θ no urls or rts Estimation of the Predictive Power of Twitter Indicators The test was performed using data lagged by -7 days, using the same base set of correlations that was used in the other tests described in this section and a significance level of α =.1. The results are shown in Figure 6.17, and also numerically in Table 6.7. # of significant correlations Lag, # of days Figure 6.17: Number of significant correlations at the.1% significance level for -7 days of lag. As expected, the number of significant correlations is highest for the unlagged data, followed by a falloff period (in this case, lag 1 and 2) and ending in what appears to be an equilibrium state of roughly two thousand correlations for higher lags.

77 6.4. Conclusions and Discussion 63 Table 6.7: Number of significant correlations at the.1% significance level for -7 days of lag. Lag, # # of significant of days correlations Evaluation of Sentiment Analysis Approaches As in the the comparisons of market indicators and filtering levels above, the tests were performed using correlation limits.7,.8 and.9. The results are listed in Table 6.8. Table 6.8: Number of high-value correlations related to each sentiment-related indicator set using three different limits with the standard method, as well as the alternate method. Indicator set l =.7 l =.8 l =.9 θ pos val θ pos cnt θ pos adj val θ pos prop θ neg val θ neg cnt θ neg adj val θ neg prop Each figure in the table represents the number of correlation tests where an indicator from the given indicator set was used and the absolute value of the correlation coefficient exceeded the given limit. For comparison, each indicator set is featured in a total of 4536 correlation tests. In all cases but one (marked in boldface), the count-based indicator sets generated as many or more correlations above the limits than did the score-based indicator sets. 6.4 Conclusions and Discussion The results presented above shows that there exists clear correlations between Twitter indicators and market indicators. The correlations are most prevalent between message volume

78 64 Chapter 6. Stock Market Correlations and trade volume, but are also evident with market value. In several cases a large swing in a market indicator is preceded by a large swing in a Twitter indicator On the Interpretation of Statistical Results The p r -value for the correlation between the median differential of Apples trading volume and the median differential of the hype statistic without URLs, shown in Figure 6.4, is This value is roughly one third of the mass of the electron measured in kilograms. While this figure should not be taken too literally, since it is based on an assumption of normally distributed data, it does rather conclusively rule out the possibility that the correlation is solely the result of a statistical fluke. Nor do the bootstrapped p-values for r and the p-values for τ, neither of which assumes normality, leave much room for the possibility of statistical flukes, either in this particular example or the others. Choudbury [13] provides rules of thumb for interpreting different values of the linear correlation r. A value of r of.3 to.5 or.3 to.5 can be said to signify a moderate linear relationship, while an r of.5 to 1 or.5 to 1 can be said to signify a strong relationship. Given these metrics, one can state that the relationships shown in Figure 6.1, as well as the relationships shown in Figures A.1, A.2, A.4 and A.9 in Appendix A are strong, based on the fact that the 99% confidence interval for r is in each case fully contained in the range.5 to 1 or.5 to 1. By that same token, one can claim that most of the other examples shown above exhibit correlation levels that are at least moderate. However, one should remember that any reduction of a complex relationship into a single numeric result is bound to discard large amounts of information. This is perhaps best illustrated by Anscombe s quartet, described in [2] and shown in Figure The quartet consists of four datasets with very similar summary statistics; the mean and variance of x and y, the linear regression line and, most importantly in this context, the linear correlation r is in each of the four cases equal to within two decimal points or more. In spite of this it becomes obvious when looking at the plotted data that each dataset describes a different underlying phenomena. In particular, it is worth keeping in mind that r is sensitive to outliers in two different senses. First, in the sense that if the outliers of two statistics correlate well r will be high even if the correlation is fairly weak for normal days. Second, in the sense that r will be low if there is a mismatch in outliers, even if the correlation is very strong at other times. Kendall s τ does not suffer from this sensitivity. Further, while statistics such as r and τ may be useful tools, and while a statement such as the curves look really similar is obviously ill-suited for making any kind of scientific statement, it is entirely possible to have near-perfect correlation (both in the r-sense and τ-sense) without this relationship contributing any sort of useful information. It would for example be entirely trivial to define a derivative function that produces perfect correlations all the time; simply have the function provide linearly increasing output for any input. A less extreme example, used in this project, would be the derivative function Bollinger bandwidth. This function tended to produce many high-value correlations by smoothing out everything but the extreme swings, thus yielding high r values but not much information that was not already present in other indicators. It is partly because of this information-destroying property of some derivative functions that no statistical evaluation of their relative efficiency was carried out. As a practical matter, this raises questions of whether it is possible to construct a method

79 6.4. Conclusions and Discussion y 8 y x x y 8 y x x Figure 6.18: Anscombe s quartet. An example of four very different datasets with very similar summary statistics. of better separating high-r or high-τ correlations that contain useful information from ones that do not, or if it would perhaps be better to replace these measures of correlation entirely. One measure that might prove useful is what one might call directional correlation. Simply put, if x i is greater than x i 1, what is the probability that y i will also be greater than y i 1? 2 This measure would have the benefits of having a clear-cut interpretation, would most likely not be as sensitive as Kendall s τ to de-trending derivative functions, and would also be entirely distribution-agnostic. The measure itself, on the other hand, would have a well-known (binomial) distribution, eliminating the need for computationally intensive bootstrap methods. Evaluation of the Bootstrap Statistics One example of the usefulness of the bootstrap method is given by the correlation between Cisco trade volume and total message volume shown in Figure 6.4. The linear correlation is very high (r =.84), due in large part to two major spikes where the Twitter indicator coincides with the market indicator. This is reflected by the low end of the bootstrapped 99% confidence interval;.3. Contrast this with, for example, Figure 6.1 which compares Apples trade volume with a filtered measure of message volume. Even though the linear correlation is lower in this 2 Notice that this bears some resemblance to the definition of Kendall s τ given earlier. This is however mostly superficial.

80 66 Chapter 6. Stock Market Correlations case (r =.8), the low end of the confidence interval is significantly higher (.56). This is because the correlation in this case is more consistently present throughout the sampled period, and the result is therefore less dependent on a couple of major spikes. The standard method of constructing confidence intervals for r (which also assumes normally distributed inputs) does not have this useful property Meta Analyses A general note regarding the interpretation of results of the meta analyses: The availability of hundreds of thousands of results of correlation tests is certainly an advantage when performing tests of this nature. If each correlation test was fully independent of every other, one could likely attribute a high degree of statistical significance even to results with a small effect size. However, one should keep in mind that this is not the case here. If, for example, one finds that a Twitter indicator x correlates well with some market indicator y for a given subject and time period, one would not be surprised to find that a slightly more filtered indicator x also correlates well with y. The exact ways in which this interdependence may affect the results is difficult to assess. Partly because of this difficulty, formal tests of statistical significance have not been performed. Comparison of Market Indicators The results presented above agrees with what may the most reasonable a priori assumption; that the clearest and most significant correlations between social media data and market data will involve trade volume. After all, any major movement (up or down) is likely to generate online chatter, as will general market nervousness (which may generate a high trade volume even in the absence of exceptional price shifts). The fact that four tests using two different methods all paint the same basic picture suggests strongly that this assumption is indeed in line with the underlying situation. A somewhat more interesting question is whether any conclusions can be drawn regarding the remaining market statistics. The results suggest that correlations with the daily high are the most frequent and correlations with daily low the rarest, but unlike the case with trade volume there is no (immediately apparent) causal relationship that might explain why this would be the case. Even in the absence of such an explanation, however, the result does not necessarily need to be a fluke there may be some statistical property of the daily high that happens to agree better with the Twitter indicators that does the other statistics. To say for sure, more study would be needed. The need for the alternate method stems from observation; in reviewing the results of the correlation analysis it became clear that some subjects caused significantly more high-value correlations than others. The dataset is therefore at risk (though not known to suffer from) the type of problem the that the alternate method is meant to address. The results of the alternate method largely confirms the results of the first method, but nevertheless suggests that there may exist some significant differences between subjects. This suggests that the data may not, after all, suffer from the aforementioned type of problem. Comparison of Filtering Levels The results of the comparison of the different filtering levels shows an advantage for filtering out messages containing URLs compared to using no filtering, and an advantage for filtering

81 6.4. Conclusions and Discussion 67 both URLs and retweets over just filtering URLs. The differences grow largest for the highest correlation limit (r >.9) using the standard method, but were quite modest for the other tests. It should be noted, however, that this does not mean that the three methods produce roughly the same results. This is because these tests only counts how many correlations fulfill certain criteria, but does not look at which specific correlations those are. The results do however suggest, as has been speculated earlier, that the blanket removal of all tweets containing URLs may at times be overly harsh, and strip the dataset of useful information. Therefore an investigation of more targeted filtering levels may be a worthwhile effort. Estimation of the Predictive Power of Twitter Indicators The results show that the three first values (unlagged: 15,646, lagged by one day: 11,767 and lagged by two days: 5,366) stand out quite clearly from the remaining five results, which cluster around two thousand significant correlations (more precisely, these five values have a mean of 2,99.6 and a standard deviation of 482.8) with no clear trend of increase or decrease. This suggests that the Twitter indicators may have some predictive power over market indicators up to two days into the future. However, even the five remaining values exhibit many more significant correlations than one would expect, assuming uncorrelated normally distributed random data if that were the case, one would expect numbers around 635. An enthusiastic observer might take this as an implication that Twitter indicators contain information foreshadowing market movements nearly two weeks 3 into the future. I believe, however, that this interpretation of the results concerning lags greater than a couple of days would be to stretch the hypothesis from plausible to slightly magical, and that the true explanation is some sort of statistical anomaly. Candidate causes for this include the assumption of normality used in the calculation of the p-values combined with the apparent non-normality of the underlying data 4, and the possibly correlation-exaggerating effect of certain derivative functions. Evaluation of Sentiment Analysis Approaches To reiterate, the purpose of the experiment was to investigate which of two classes of sentiment-based indicators produces the most high-value correlations with market data. The first method is based on a positive or negative sentiment score in the range of to 2, while the other method simply counts the number of positive and negative tweets. The results show a slight advantage in favor of the somewhat more naive method of counting, suggesting that in the present setting that method is to be preferred. However, one should not should not take this as a test of the general validity of using weighted lexica scoring. The reasons for this is the relatively small effect size observed, coupled with the statistical arguments concerning dependent variables mentioned above, as well as the fact that this test only covers a single subject matter and only uses two quite small specialized lexica. 3 The data contains only weekdays, so that an eight day sample will span ten to twelve calendar days, depending on the starting day. 4 Discussed in Section 8.5.

82 68 Chapter 6. Stock Market Correlations

83 Chapter 7 Predicting Market Movements As shown in Chapter 6 the gathered Twitter data exhibits clear correlations with market data, and at times appears to foreshadow market movements. However, to detect such foreshadowings without the benefit of hindsight in effect, predicting market behavior is an entirely different proposition. One problem, which was briefly touched on in the previous chapter, that must be addressed if one is to attempt to automate such predictions is the following conjecture: No social media indicator can reasonably be expected to correlate with market data in a way that is fully consistent over time. In other words, while the social media chatter may sometimes swell up before the market has time to respond, at other times the chatter will come at the same time as, or after, the market movement. An example of this is shown in Figure 6.2 in the previous chapter, which exhibits swings in the Twitter indicator preceding swings in the market indicator as well as a coinciding swing in both indicators. The following chapter will suggest and evaluate an approach for addressing this particular problem. 7.1 Problem Statement Assume that we have a social media indicator which has been found to correlate with a market indicator, as has been previously described. Consider then the hypothetical data shown in Figure 7.1, illustrating three basic event types: 1. A major market swing precedes a major swing in Twitter data. 2. A major swing in Twitter data precedes a major market swing. 3. A major market swing coincides with a major swing in Twitter data. This section will focus on how events of type 2 can be detected automatically as the data becomes available over time. The problem can be stated as follows: Problem Statement 7.1: Define an indicator which detects precisely those events where there is an exceptional swing in the social media indicator, without a coinciding or immediately preceding exceptional swing in the corresponding market indicator. 69

84 7 Chapter 7. Predicting Market Movements Hypothetical Twitter data Hypothetical market data Figure 7.1: Example data showing three event types: Market leads Twitter reaction, Twitter leads market reaction, and both react simultaneously. 7.2 Method The approach taken here can be described as attenuating the Twitter indicator based on the data found in the market indicator. This is done in such a way that the event types that we are not interested in are dampened. This attenuated indicator can then be scanned for any remaining exceptional events. The process can be broken down into the following steps: 1. Define an attenuation factor based on the market data. 2. Define an attenuation method. This method should, based on the attenuation factor and Twitter indicator, produce a new indicator with events of type 1 and 3 filtered out but events of type 2 remaining. 3. Define what constitutes an exceptional event in the new indicator produced by the attenuation method. In other words, under what circumstances should the new indicator be interpreted as foreshadowing a market swing? The attenuation factor is defined in terms of a sliding median and a sliding standard deviation. Formally, using the notation for vector time series described in Section 6.2.2, for day i the attenuation factor a i is defined as: where: a i = max(b i,b i 1 ), (7.1) b i = x i x i 1 n,i 1 /σ i 1 n,i 1 + 1, x i is the value of the market indicator x at day i, x i 1 n,i 1 is the median of the market indicator for the n days preceding day i, and σ i 1 n,i 1 is the standard deviation of the market indicator for the n days preceding day i. The attenuation factor is taken as the maximum of the two intermediate factors b i and b i 1 in order to deal with situations where the market indicator flares up and then subsides before the Twitter indicator has time to react (e.g., the first event shown in Figure 7.1).

85 7.3. Results 71 The attenuated value y i of the Twitter indicator y at day i is then: y i = (y i ỹ i 1,i 1 n )/a i + ỹ i 1,i 1 n (7.2) where y i is the value of the Twitter indicator at day i and ỹ i 1 n,i 1 is the median of the indicator for the n days preceding day i. The definition of y i means that that y will be compressed towards it s sliding median by a factor a i which will grow large in the event of an exceptional event in the market indicator x. Further, the definition of a i does not depend to the relative orders of magnitude of x and y. In other words, it does not matter if the market indicator is measured in the thousands and the Twitter indicator is measured as fractions of one, or vice versa. Finally, in order to define what constitutes an exceptional event in the attenuated Twitter indicator y the indicator %b is used, which gives a measure of how high or low the current value is in relation to the recent past. It was defined as a derivative function in Section 6.2.3, but in order to avoid confusion over differently named variables the definition is restated here: where: %b i = y i (ȳ i n,i kσ i n,i) (ȳ i n,i + kσ i n,i) (ȳ i n,i kσ i n,i) = y i (ȳ i n,i kσ i n,i) 2kσ i n,i (7.3) ȳ i n,i is the average of the last n days up to and including day i, σ i n,i is the population standard deviation, calculated over the last n days, k is a constant, and ȳ i n,i +kσ i n,i and ȳ i n,i kσ i n,i is the upper and lower Bollinger Band, respectively. Equation 7.3 means in effect that if %b i > 1, y i is more than k standard deviations above the average of the last n days. Conversely, %b i < means that y i is more than k standard deviations below the n-day average. For some suitable value of k, this can be said to signify that y i is exceptionally high or exceptionally low. 7.3 Results In order to provide a better understanding of how the method described above operates, Section demonstrates it s application to hypothetical Twitter- and market-data. Section tests the method using indicators derived from real Twitter- and market-data Example Using Hypothetical Data Figure 7.2 shows time series of hypothetical Twitter- and market-data, and the results of applying the method described above to it. The topmost plot contains the input data. It shows the three basic event types described in Section 7.1, with indicators moving in the negative as well as in the positive direction. The middle plot shows the effect of applying the attenuation method to the Twitter data. The last plot shows the %b prediction indicator. The indicator is only greater than 1 or less than in exactly those instances where there is a major swing in the Twitter indicator that does not coincide with or is immediately preceded by a major swing in the market indicator. This is fully in line with the intention of the method described above.

86 72 Chapter 7. Predicting Market Movements Example data Hypothetical Twitter data Hypothetical market data Attenuated hypothetical Twitter data %b Figure 7.2: The prediction method applied to example data Hypothetical market data Examples Using Real-World Data The following section shows the results of applying the prediction method described above to several real-world examples. The first example, seen in Figure 7.3, shows the Twitter indicator positive value with messages containing URLs and retweets filtered out compared to the median differential of the closing price of Qualcomm. This is the same example as previously seen in Section 6.3.2, Figure The example shows three instances where there is a major surge in the Twitter indicator preceding a corresponding surge in the market indicator, and in each example the prediction indicator %b exceeds the boundary of 1, as intended. In the second instance %b remains over the boundary for two days, as this particular surge is spread over two days before peaking. The second example, shown in Figure 7.4, shows the result of applying the method to the rate-of-change of a filtered measure of daily message volume to the rate-of-change of the daily closing price of the DirecTV stock. Again the %b indicator exceeds 1 in exactly the location where a swing in the Twitter indicator occurs just before a swing in the market indicator. The final example, spread across figures 7.5 and 7.6 for readability, compares a filtered version of message volume against the daily high of Cisco stock value. Each figure shows one instance where a swing in the Twitter indicator precedes a swing in the market indicator, and in each instance the %b indicator exceeds 1, as intended.

87 7.3. Results 73 Qualcomm Positive value, no URLs or RTs %b Nov 1 Dec 1 Jan 1 Feb 1 Mar 1 Median differential of close Figure 7.3: Prediction indicator for Qualcomm. 4 DirecTV.5 Rate of change of total count, no URLs or RTs %b Oct 15 Nov 1 Nov Rate of change of high Figure 7.4: Prediction indicator for DirecTV.

88 74 Chapter 7. Predicting Market Movements Cisco 1 Total count, no URLs or RTs %b Nov 1 Dec Median differential of high Figure 7.5: Prediction indicator for Cisco, first time period. Cisco Total count, no URLs or RTs %b Jan 16 Jan 23 Jan 3 Feb 6 Feb 13 Feb 2 Feb Median differential of high Figure 7.6: Prediction indicator for Cisco, second time period.

89 7.4. Conclusions and Discussion Conclusions and Discussion The results above show that the suggested method can be used to predict a certain class of swings in market data based on preceding swings in Twitter data, while filtering out swings in Twitter data that does not have any predictive power over the market data. This has been demonstrated using both synthetic and real-world data. However, it is of course easy to conceive situations where the proposed method would not be effective. It is therefore worth keeping in mind that while this method may be effective in the type of situations it was designed to handle, it does not purport to be the end-all of predicting market swings based on social media data Evaluating the Prediction Method One can imagine a number of different prediction indicators that may differ from the method described here, either by some small detail or more fundamentally. Therefore the following section will describe some of the considerations that have gone into the design of the method, as well as some properties that the method has that are likely to be desirable for any indicator with the same purpose. To begin with, the method is to some extent self adaptive. In the event of sustained turbulence in one of the ingoing indicators the method will quickly come to regard this as the new normal without any intervention, rather than regarding each swing as an exceptional event. Likewise, the method adapts to gradual changes in the characteristics of the input data (e.g., increased message volume over time). Further, the attenuation is defined in a way that disregards the direction of movement. This means that a sudden plunge in the market indicator will attenuate the Twitter indicator in roughly the same way as a sudden surge of the same magnitude. This is a useful property because it matches how Twitter indicators frequently (though far from always) relates to market data; for example, both a surge and a plunge in a stocks price will likely cause a surge in message volume. As has been noted above, the definition of the attenuation factor means that the relative orders of magnitude input data is irrelevant. The decision to use a sliding median rather than a sliding mean for the computation of the attenuation factor (Equation 7.1) and the values of the attenuated Twitter indicator (Equation 7.2) is motivated by two factors. First, as is discussed in Section 8.5, both the market data and Twitter data follows a fat-tailed distribution. This means that the largest swings would be given undue weight if the sample mean was used, in particular for short window sizes. Second, the distributions of both message volume and trade volume (and possibly other statistics as well) are not symmetric about the mean; large swings are much more likely to be in the upward direction. This means that the sample mean will in many cases overestimate what constitutes a normal value of these statistics.

90 76 Chapter 7. Predicting Market Movements

91 Chapter 8 Conclusions and Discussion In addition to conclusions and discussions presented at the end of each of the Chapters 4-7 the following chapter will address matters relating to more than one of the previous chapters, as well as matters relating more generally to the topic of social media analysis. Possible improvements to the previously presented method will also be discussed. The chapter will also summarize the major results of the project. 8.1 Summary of Results The following list summarizes the primary results of this project: A new method for estimating writer sentiment in short messages, using tiered lexica, was proposed and implemented in Chapter 3. A method for predicting the results of televised competitions determined by public voting before the announcement of the official result was described in Chapter 4. The method was found to perform well when applied to the Idol television show. Chapter 5 demonstrated several examples where the publication of major news stories concerning a certain topic resulted in a large increase in message volume concerning that topic. Chapter 6 demonstrated several significant correlations between Twitter data and NASDAQ market data, or derivatives thereof, with further examples provided in Appendix A. The examples include several instances where a major swing in a Twitter indicator precedes a major swing in a market indicator. Chapter 6 also described meta-analyses based on a large number of correlation tests. These include an indirect measure of the predictive power of Twitter indicators with respect to market indicators. The results suggest that the Twitter data may contain information that can be used to predict market movements one or two days into the future. A method for automatically detecting certain instances where a movement in a Twitter indicator foreshadows a movement in a market indicator was proposed, implemented and tested in Chapter 7. 77

92 78 Chapter 8. Conclusions and Discussion 8.2 Can Twitter Indicators Predict Market Behavior? Chapter 6 showed, among other things, examples where a movement in a Twitter indicator clearly precedes a movements in a corresponding market indicator, and Chapter 7 demonstrated a method of automatically detecting such instances. Chapter 6 also provided an aggregate statistic of correlation levels for lagged data, suggesting that some Twitter indicators may provide a degree of predictive power over market indicators. This raises two interesting related questions. The first is, can the Twitter indicators ultimately be said to predict market movements? In the limited sense of the word prediction that is used in Chapter 7 the answer is clearly yes, as that chapter demonstrates. However, one should be careful not to attribute magical properties to these relationships; recall that there will always exist some underlying event or mechanism that drives the movements of both the Twitter and market indicators. This leads to the related question of whether or not these predictions reveals information that has business value and would not otherwise be available? Or, in the language of the Efficient-Market Hypothesis described in Section 2.2.1, is the market informationally efficient with respect to the information set θ twitter? One should assume, in the absence of hard evidence to the contrary, that most of the swings that would be predicted by the method proposed in Chapter 7 would also be predicted by a good business analyst. This includes events such as the release of business statements. However, it is plausible that the method could also be used to provide early warning for black-swan events, in particular if the event in question is exogenous to the information stream typically used to make business decisions. 8.3 Some Potential Dataset Problems in Social Media Analysis Sections to below will discuss some potential dataset problems which could in principle affect any attempt at social media analysis Demographic Tilting The basic idea of social media analysis is to use what is written by one population (in this case Twitter users) to draw conclusions about some other population (in this case stock market participants and Swedish television viewers). It may therefore be of some importance to know to what extent the former models the latter. As is noted in Section 2.1.2, there are reasons to believe that there are significant statistical discrepancies between the set of Twitter users and the population as a whole. However, there are also reasons reasons to believe that this particular problem is of somewhat less concern compared to the others described in this section. If the goal is to model a process which continuously produces new datapoints to compare against, demographic discrepancies can be adjusted for. If the goal is to forecast the outcome of a single event, such as a presidential election, this is somewhat more difficult however. Further, demographics are unlikely to change rapidly in the middle of a sampling period for a well-established forum such as Twitter. This type of problem is therefore unlikely to pop up as a big surprise in the middle of a testing period.

93 8.4. Summary Evaluations of Methods Tilting Due to Automatic Postings During the course of the project there were instances where what appeared to be large numbers of automatic posts made by applications had a significant impact on certain topics. This was especially prevalent in a gathered corpus of tweets related to consumer electronics 1. Examples include a game for the iphone which appears to automatically post updates concerning the players progress. A second example, described in Section 5.3.5, consisted of hundreds of tweets posted over several days containing the Intel stock symbol, but which were unrelated to Intel. These tweets did not contain any URLs, and would hence not have been filtered out by either of the filtering levels used in this project Tilting Due to Public Relations Efforts and Targeted Attacks Posting overtly or covertly promotional content to Twitter for the advancement of some product, service or political candidate is a rather obvious use-case for social media. Section shows the risk of such efforts tilting the results of social media analysis. In this particular case, there were hundreds of messages promoting a market analysis service. The tweets, though near-identical in content, were nevertheless spread across multiple users in, one suspects, an effort to disguise the fact that this was in effect spam. [33] documents an effort to affect the outcome of a United States Senate race by generating Twitter buzz. The authors describe a practice that they term a Twitter bomb, comparable to the practice of Google bombing to inflate the search ranking of a particular web-page. The described practice is thought to be designed to circumvent spam detection tools. One should also be aware of the risk that, as the use of social media analysis grows, there may also be attempts made at subverting the results of such analyses. Without proper safeguards, such an attack would be trivial to carry out for anyone with only a rough knowledge about a particular analysis effort. 8.4 Summary Evaluations of Methods The following section evaluates the choices of search terms, the sentiment analysis and the filtering levels used in this project Choice of Search Terms The dataset used in Chapters 5 to 7 uses only messages containing the stock symbol of each company of interest. The clear alternatives to this choice of search terms would be to also include the common name of each company, or to use the common name exclusively. It is likely that the choice made played a part in making the correlations with stock market movements as clear as they were, as people tweeting stock symbols are likely to have a keen interest in, or help shape, market movements. In particular, this applies to companies where there is a clear public interest component, such as Google and Apple, where a large portion of the chatter using the companies common names will likely be unrelated to market movements. On the other hand, this is a choice that discards large amounts of information. If the set of all tweets contains information that may be used to explain or predict market movements 1 These were not ultimately used in any experiment.

94 8 Chapter 8. Conclusions and Discussion in excess of what a good business analyst would be capable of, this information must be exogenous to the information stream that such an analyst would typically use. This will not generally be the case for tweets containing stock symbols. Therefore, it is likely that such information would instead be found in tweets containing the common name of each company, or even some related search terms such as the names of commonly used products produced by the company. Such a widening of search terms would, however, make a dataset that is already in many ways ill-behaved even harder to handle by making the public interest component, established in Chapter 5, even stronger. It is entirely possible that this would weaken, or eliminate entirely, the direct correlations found in Chapter Sentiment Analysis Results such as the relationship between sentiment levels in tweets mentioning Apples stock symbol and the daily closing price of Apple stock, shown in Figures 6.8 to 6.11 in Section suggests that the approach to sentiment analysis used in this work is effective to some degree. The fact that there is both a positive correlation between positive sentiment and stock value, and a negative correlation between negative sentiment and stock value is especially encouraging. An informal review of the tiered-lexica approach also that this particular aspect contributes positively to the performance of the analyze. The approach is, however, somewhat simplistic and could likely be improved further Filtering Levels Section 8.3 above describes several potential dataset problems that this type of social media analysis is susceptible to. These problems may be addressed by applying various levels of filtering. The filtering levels used in this project were shown in Section to produce somewhat better results than the unfiltered indicators. In particular the practice of excluding tweets containing URLs is effective in weeding out promotional content. However, this filtering will also exclude a lot of legitimate content and may be a bit more heavy-handed than one could wish. In order to better address the concerns described above it may prove fruitful to expand the filtering by one or more of the following methods: Applying per-user limits, as suggested in 5.4. Only considering tweets containing unique text, or even messages containing mostly unique text. Utilizing social-graph information. For example, excluding tweets from recently created users or users who are not followed by any other users. 8.5 On the Tailedness of Markets and Social Media Data For the statistical analysis of data there is sometimes a simplifying assumption made that the underlying data follows the normal, bell curve, distribution. One prime example of this

95 8.6. Some Final Thoughts 81 is the Black-Scholes model, described in [7], which is a set of partial differential equations widely used to mathematically describe markets and investment risk. In [31] mathematician Benoit Mandelbrot argues forcefully that this assumption of normality is not only wrong but also dangerous, and that markets instead follows a distribution with fat tails. The distinguishing feature of a distribution with fat tails compared to a normal distribution is that the former will exhibit many more events that deviate from the mean by more than a few standard deviations. Among other evidence, Mandelbrot points to the Dow Jones exhibiting many more daily changes over 5σ since 1915 than the normal distribution could possibly account for. This leads, in effect, to an underestimation of risk (technically, kurtosis risk). A simple example can illustrate that the Twitter data examined here is also unlikely to be normally distributed. Looking at the message volume for CSCO, the four top-most observations deviate from the mean by 8.σ (a one-in-one-million-billion event, under the assumption of normality), 5.8σ (one-in-142-million), 3.9σ (one-in-263), and 3.4σ (one-in- 152), all within a time frame of just 163 data points. This suggests that methods which rely on an assumption of normality may give misleading results. An effort has therefore been made in this work to either avoid such methods completely, or to at least complement the results with a comparable measure that does not assume normality. This applies, in particular, to the bootstrap-based methods described in Section Some Final Thoughts For all the time that has been spent on this project, for every aspect that has been evaluated from this angle and that, for every use-case that has been considered and for every statistical test that has been applied to test those use-cases, there is a feeling that the surface has hardly been scratched. This project has certainly not been starved for interesting problems to investigate. With almost 1 million tweets posted each day by millions of users expressing themselves concerning a myriad of subjects, this is hardly surprising. To find that there are no correlations between the flow of messages and some real-life phenomena would have been a more remarkable result than any presented in this thesis. Therefore, this chapter ends with a few thoughts concerning subjects that for practical reasons were left out of the main scope of the project Twitter s Reaction Time One of the aspects I would have liked to investigate further is Twitter s reaction time. For the final episode of the Idol experiment described in Chapter 4 I had set up a script that allowed me to update the message count for each contestant continuously. A few minutes into the show the previously eliminated contestants made a joint performance, and their tweet counts immediately started rising. After this performance the rise halted and most of the subsequent tweets were about the two finalists. This phenomenon of Twitter responding quickly to real-time events is also illustrated in Figure 8.1. Given this one wonders, for example, what the results would be if the stock market correlations were tested using finer time-slices.

96 82 Chapter 8. Conclusions and Discussion Figure 8.1: Episode of the web-comic xkcd, illustrating Twitter s reaction time. Used with permission; original available at Fuzzy Correlations It is somewhat difficult to imagine that there would be a law, or set of laws, of nature dictating the relationship between Twitter message flow and market movements (or other phenomena of interest, for that matter). The set of circumstances which causes Twitter to predict one market movement will be different at other times, instead causing the market movement to drive the Twitter message flow. Correlations may be present for some periods of time, and be drowned by noise at others. And so on. This problem is touched on in several of the preciding chapters. Chapter 7 addresses one aspect the problem in the definition of the prediction indicator. Some of the derivative functions proposed in Chapter 6 act to de-trend time series, which is useful when two series correlate in small scale movements but diverge in long-time trends. Chapter 5 describes a public-interest component, which may cause large swings in Twitter data without a corresponding movement in market data. One problem with the measures of correlation used in this project, however, is that they implicitly assume just such a stable relationship. On day i, the Twitter data is only ever compared to the market data for day i (or, in the case of lagged correlations, to market data for day i+lag factor). One wonders then if the correlation measures could be adapted to account in a meaningful way for this variability? Recall from Section that the correlation coefficient r was defined as: n (x i x)(y i ȳ) i=1 r =. (8.1) n n (x i x) 2 (y i ȳ) 2 i=1 This measure can be adapted to what one might call fuzzy correlations, by in the numerator using the value of the Twitter indicator x either from today, tomorrow or yesterday according to which contributes most to the value of r: i=1 r = n max((x i 1 x)(y i ȳ),(x i x)(y i ȳ),(x i+1 x)(y i ȳ)) i=1 (8.2) n n (x i x) 2 (y i ȳ) 2 i=1 i=1

97 8.6. Some Final Thoughts 83 (for positve correlations the case for negative correlations may be defined by instead taking the minimum available values). This measure takes into account that Twitter may alternate between being a lagging and leading indicator. Whether this particular measure can contribute much to the understanding of the relationships between social media and real-world indicators remains to be seen. But further study of the variability of relationships is no doubt warranted Buying on Rumor and Selling on News A conjecture that is sometimes stated as buy the rumor, sell the news says that prices tend to go up in anticipation of an event (e.g., Nintento is expected to announce their new video game console on such-and-such date the rumor ), and fall when the event actually occurs (e.g., Nintendo unveils the new console the news ). The company Recorded Future 2 recently asserted that they had for the first time proven this rule of thumb [28]. This suggests a particularly interesting use-case for social media analysis, as social media chatter could potentially be used to gauge various aspects of the rumor. For instance, is general sentiment concerning the rumor positive or negative? Is the chatter inflated when compared to recent, similar events, perhaps suggesting an expectation bubble? Again, the stock market provides an excellent test bed for this use-case since there is ample market data against which one may test hypotheses. 2

98 84 Chapter 8. Conclusions and Discussion

99 Chapter 9 Acknowledgments 9.1 Thanks! I would like to thank my advisor at Umeå University, Martin Berglund, in particular for the many comments that helped raise the quality of this report. I would also like to thank the people at Nomura Sweden, where much of the work for this project was done, for their continued interest in this project. In particular I would like to thank my Nomura advisor Daniel Brändström for, among other things, his help in dealing with the interesting but at times bewildering Q programming language, and comments for this report. And, last but not least, I would like to thank Mari Hansson for her support throughout the project, and for putting up with me during times when the project consumed more time than can possibly be good for you. 9.2 Open Source Acknowledgments The following pieces of open source software have, in addition to available basic libraries, been utilized in the completion of this thesis project: The Ruby Twitter Gem 1 by John Nunemaker, used to simplify communication with the Twitter API. The GRIDXY 2 Matlab function by Jos van der Geest, used for adding horizontal and vertical grid lines to plots. The project has also benefited from a great number of guides, code snippets and general tips posted online. For this, a general nod of gratitude is directed to everyone who makes a habit of contributing to the public good by asking and answering questions online. 1 Available at: 2 Available at: 85

100 86 Chapter 9. Acknowledgments

101 References [1] Mikael Andersson. Laboration 3: Icke-parametrisk korrelations- och regressionsanalys Last accessed on March 23, 211. [2] Francis John Anscombe. Graphs in statistical analysis. The American Statistician, 27(1):17 21, February [3] Sitaram Asur and Bernarde A. Huberman. Predicting the future with social media. arxiv: v1 [cs.cy], March 21. [4] Nicholas Barberis, Andrei Shleifer, and Robert Vishny. A model of investor sentiment. Journal of Financial Economics, 49:37 343, [5] Nicholas Barberis and Richard Thaler. A survey of behavioral finance, 23. In Handbook of the Economics of Finance. [6] Meredith Beechey, David Gruen, and James Vickery. The efficient market hypothesis: A survey. Economic Research Department, Reserve Bank of Australia, January 2. [7] Fischer Black and Myron Scholes. The pricing of options and corporate liabilities. The Journal of Political Economy, 81(3): , [8] Johan Bollen, Huina Mao, and Xiao-Jun Zeng. Twitter mood predicts the stock market. arxiv:11.33v1 [cs.ce], October 21. [9] Johan Bollen, Alberto Pepe, and Huina Mao. Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. arxiv: vl [cs.cy], November 29. [1] John A. Bollinger. Bollinger on Bollringer Bands. McGraw-Hill Professional Publishing, July 21. ISBN: [11] Martin Bryant. Investment fund set to use twitter to judge emotion in the market. December 21. Last accessed on January 21, 211. [12] Hampus Brynolf. Twittercensus /2/Twittercensus.pdf. By Intellecta Corporate. Last accessed on March 18,

102 88 REFERENCES [13] Amit Choudhury. Statistical correlation. Experiment Resources: experiment-resources.com/statistical-correlation.html, 29. Last accessed on March 23, 211. [14] Aron Culotta. Detecting influenza outbreaks by analyzing twitter messages. arxiv: v1 [cs.ir], July 21. [15] Xiaowen Ding, Bing Liu, and Philip S. Yu. A holistic lexicon-based approach to opinion mining. Proceedings of the international conference on Web search and web data mining, February 28. [16] Ahmet Duran and Gunduz Caginalp. Overreaction diamonds: Precursors and aftershocks for significant price changes. Quantitative Finance, 7(3): , June 27. [17] B. Efron and R. Tibshirani. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1(1):54 77, [18] Eugene F. Fama. Efficient capital markets: A review of theory and empirical work. Journal of Finance, 25(2): , 197. [19] Susannah Fox, Kathryn Zickuhr, and Aaron Smith. Twitter and status updating, fall 29. Pew Internet & American Life Project, October 29. [2] Kenneth R. French and James M. Poterba. Investor diversification and international equity markets. The American Economic Review, 81(2): , May [21] Namrata Godbole, Manjunath Srinivasaiah, and Steven Skiena. Large-scale sentiment analysis for news and blogs. Proceedings of the International Conference on Weblogs and Social Media (ICWSM), 27. [22] Sandra Gonzalez-Bailon, Rafael E. Banchs, and Andreas Kaltenbrunner. Emotional reactions and the pulse of public opinion: Measuring the impact of political events on the sentiment of online discussions. arxiv:19.419vl [cs.cy], September 21. [23] Alan Greenspan. The challenge of central banking in a democratic society. December Remarks by Chairman Alan Greenspan At the Annual Dinner and Francis Boyer Lecture of The American Enterprise Institute for Public Policy Research, Washington, D.C. Last accessed on January 18, 211. [24] Michiel Hazewinkel, editor. Encyclopaedia of Mathematics. Springer, 22. ISBN: [25] Michael C. Jensen. Efficient capital markets: A review of theory and empirical work. Journal of Financial Economics, 6(2/3):95 11, [26] Jack Jordan. Hedge fund will track twitter to predict stock moves. December 21. Last accessed on January 21, 211. [27] Amanda Lenhart and Susannah Fox. Twitter and status updating. Pew Internet & American Life Project, February 29.

103 REFERENCES 89 [28] Mats Lewan. Ny svensk söktjänst har järnkoll på tiden. tidningen/article ece, April 211. Last accessed on April 19, 211. [29] Burton Malkiel. A Random Walk Down Wall Street. W. W. Norton & Company, Inc., ISBN: [3] Burton G. Malkiel. The efficient market hypothesis and its critics. Journal of Economic Perspectives, 17, 23. [31] Benoit B. Mandelbrot and Richard L. Hudson. The (Mis)behaviour of Markets: A Fractal View of Risk, Ruin and Reward. Profile Books, 28. ISBN: [32] Prem Melville, Wojciech Gryc, and Richard D. Lawrence. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 9, pages , New York, NY, USA, 29. ACM. [33] Panagiotis Takis Metaxas and Eni Mustafaraj. From obscurity to prominence in minutes: Political speech and real-time search. Web Science Conf., April 21. [34] David Murphy. Twitter: On-track for 2 million users by year s end. pcmag.com/article2/,2817, ,.asp. Last accessed on November 25, 21. [35] Brendan O Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. From tweets to polls: Linking text sentiment to public opinion time series. Proceedings of the International AAAI Conference on Weblogs and Social Media, Washington, D.C, May 21. [36] Cheol-Ho Park and Scott H. Irwin. What do we know about the profitability of technical analysis? Journal of Economic Surveys, 21(4): , 27. [37] Twitter. dev.twitter.com: Api documentation. Last accessed on November 25, 21. [38] Tom Webster. Twitter usage in america: 21. The Edison Research/Arbitron Internet and Multimedia Study, 21. [39] Michael Wiegand, Alexandra Balahur, Benjamin Roth, Dietrich Klakow, and Andrés Montoyo. A survey on the role of negation in sentiment analysis. Proceedings of the NeSp-NLP Workshop, 21. [4] Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. Recognizing contextual polarity in phrase-level sentiment analysis. Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, February 25.

104 9 REFERENCES

105 Appendix A Further Examples of Twitter/Market Correlations The following Appendix provides examples of correlations between Twitter- and marketindicators in addition to those described in Section 6.3. The examples are shown in Figures A.1-A.1, with summary statistics for both lagged and unlagged data shown in Tables A.1 and A.2. 91

106 92 Chapter A. Further Examples of Twitter/Market Correlations 4 x 1-5 AAPL x Median differential of hype, no URLs 2 5 Median differential of volume -2 Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb 26-5 Figure A.1: Relationship between Twitter indicator Median differential of hype, no URLs and market indicator Median differential of volume for Apple. 8 AAPL 4 Bollinger bandwidth of positive value, no URLs Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb Bollinger bandwidth of volume Figure A.2: Relationship between Twitter indicator Bollinger bandwidth of positive value, no URLs and market indicator Bollinger bandwidth of volume for Apple.

107 93 15 AMGN x Positive value Volume Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb 26 Figure A.3: Relationship between Twitter indicator Positive value and market indicator Volume for Amgen. 4 AMZN x Total count Median differential of high 2.5 Dec 3 Dec 8 Dec 13 Dec 18 Dec 23 Dec 28-4 Figure A.4: Relationship during December 21 between Twitter indicator Total count and market indicator High for Amazon.

108 94 Chapter A. Further Examples of Twitter/Market Correlations 3 AMZN 3 Rate-of-change of total count, no URLs Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan Rate-of-change of volume Figure A.5: Relationship between Twitter indicator Rate-of-change of total count, no URLs and market indicator Rate-of-change of volume for Amazon. 3 CSCO x Absolute median differential of total count 2 1 Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb Absolute median differential of volume Figure A.6: Relationship between Twitter indicator Absolute median differential of total count and market indicator Absolute median differential volume for Cisco.

109 95 6 GOOG x Median differential of total count, no URLs or RTs Median differential of volume -2 Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb 26-5 Figure A.7: Relationship between Twitter indicator Median differential of total count, no URLs or RTs and market indicator Median differential of volume for Google. 6 GOOG x Absolute median differential of total count, no URLs or RTs 4 2 Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb Absolute median differential of volume Figure A.8: Relationship between Twitter indicator Absolute median differential of total count, no URLs or RTs and market indicator Absolute median differential of volume for Google.

110 96 Chapter A. Further Examples of Twitter/Market Correlations 1 MSFT x Median differential of total count, no URLs 5 5 Median differential of volume -5 Jan 7 Jan 12 Jan 17 Jan 22 Jan 27-5 Figure A.9: Relationship during January 211 between Twitter indicator Median differential of total count, no URLs and market indicator Median differential of volume for Microsoft. 2 ORCL x Total count, no URLs 1 5 Volume Oct 9 Oct 29 Nov 18 Dec 8 Dec 28 Jan 17 Feb 6 Feb 26 Figure A.1: Relationship between Twitter indicator Total count, no URLs and market indicator Volume for Oracle.

111 97 Table A.1: Summary statistics for Figures A.1 to A.1 using unlagged indicators. Fig. r p r p b CI r,.1 τ p τ CI τ,.1 A [.6,.95] [.4,.67] A [.79,.9] [.58,.75] A [.21,.87] [.4,.41] A [.53,.96] [.14,.81] A [.24,.84] [.9,.46] A [.28,.97].7.29 [-.12,.26] A [.41,.92] [.19,.55] A [.43,.95] [.16,.49] A [.5,.97] [-.5,.78] A [.2,.78] [.16,.51] Table A.2: Summary statistics for Figures A.1 to A.1, with the market indicator lagged by one day compared to the Twitter indicator. Fig. r p r p b CI r,.1 τ p τ CI τ,.1 A [.13,.79] [.12,.46] A [.61,.89] [.54,.72] A [-.1,.63] [-.7,.3] A [-.3,.91] [-.11,.74] A [-.28,.17] [-.27,.14] A [.7,.94] [-.6,.33] A [.1,.84] [.4,.41] A [.8,.88].15.3 [-.4,.32] A [-.16,.91] [-.14,.63] A [.4,.76] [., 32]