Correlating Finanial Time Series with Miro-Blogging Ativity Eduardo J. Ruiz, Vagelis Hristidis Department of Computer Siene & Engineering University of California at Riverside Riverside, California, USA {eruiz009,vagelis}s.ur.edu Carlos Castillo, Aristides Gionis, Alejandro Jaimes Yahoo! Researh Barelona Barelona, Spain {hato,gionis,ajaimes}yahoo-in.om ABSTRACT We study the problem of orrelating miro-blogging ativity with stok-market events, defined as hanges in the prie and traded volume of stoks. Speifially, we ollet messages related to a number of ompanies, and we searh for orrelations between stok-market events for those ompanies and features extrated from the miroblogging messages. The features we extrat an be ategorized in two groups. Features in the first group measure the overall ativity in the miro-blogging platform, suh as number of posts, number of re-posts, and so on. Features in the seond group measure properties of an indued interation graph, for instane, the number of onneted omponents, statistis on the degree distribution, and other graph-based properties. We present detailed experimental results measuring the orrelation of the stok market events with these features, using Twitter as a data soure. Our results show that the most orrelated features are the number of onneted omponents and the number of nodes of the interation graph. The orrelation is stronger with the traded volume than with the prie of the stok. However, by using a simulator we show that even relatively small orrelations between prie and miro-blogging features an be exploited to drive a stok trading strategy that outperforms other baseline strategies. Categories and Subjet Desriptors H.3.4 [Information Systems Appliations-Systems and Software]: Information networks; J.4 [Soial and Behavioral Sienes]: Eonomis General Terms Algorithms, Experimentation Keywords Soial Networks, Finanial Time Series, Miro-Blogging Permission to make digital or hard opies of all or part of this work for personal or lassroom use is granted without fee provided that opies are not made or distributed for profit or ommerial advantage and that opies bear this notie and the full itation on the first page. To opy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speifi permission and/or a fee. WSDM 12, February 8 12, 2012, Seattle, Washingtion, USA. Copyright 2012 ACM 978-1-4503-0747-5/12/02...$10.00. 1. INTRODUCTION As the volume of data from online soial networks inreases, sientists are trying to find ways to understand and extrat knowledge from this data. In this paper we study how the ativity in a popular miro-blogging platform (Twitter) is orrelated to time series from the finanial domain, speifially stok pries and traded volume. We ompute a large number of features extrated from postings ( tweets ) related to ertain publily-traded ompanies. Our goal is to find out whih of these features are more orrelated with hanges in the stok of the ompanies. We start by arefully reating filters to selet the relevant tweets for a ompany. We study various filtering approahes suh as using the stok symbol, the ompany name or variations of the two. We also evaluate the effets of expanding this set of tweets by inluding losely related tweets. Next, in order to enrih the feature-extration mehanism, we represent the tweets during a time interval as an interation graph, an example of whih is shown in Figure 1. The nodes in this graph are tweets, users, URLs and hash-tags. The edges express relationships among the nodes, suh as authorship, re-tweeting and referening. On these graphs, whih we all onstrained subgraphs, we define a large number of features, divided in two groups: ativity-based and graph-based features. Ativity-based features measure quantities suh as the number of hashtags, the number of tweets, and so on. Graph-based features apture the link-struture of the graph. We then study how these features are orrelated with the prie and traded volume time-series of stok. Our first key result is that the traded volume for a stok is orrelated with the number of onneted omponents in its onstrained subgraph, as well as with the number of tweets in the graph. Intuitively, we expet that the traded volume is orrelated with the number of tweets. Surprisingly, it is slightly more orrelated with the number of onneted omponents. On the other hand, the stok prie is not strongly orrelated with any of the features we extrated, but it is only slightly orrelated with the number of onneted omponents and even less with the number of nodes in the onstrained subgraph. We found that other graph-based features, suh as PageRank and degree, are effetive for larger onstrained graphs built around groups of stoks (e.g., finanial indexes). Clearly, finding a orrelation with prie hange has wider impliations than finding a orrelation with traded volume. Therefore, we test how the slight orrelation of the prie with the miro-blogging features an be applied to guide a stok trader. Speifially, we reate a stok trading simulation, and ompare various trading strategies. The seond key result of this paper is that by using the Twitter onstrained subgraph features of the previous days, we an develop a trading strategy that is suessful when ompared against several baselines.
Figure 1: Example of a onstrained subgraph for one day and one stok (YHOO). Tweets are presented with red olor (), users are presented with green (), and URLs with blue (). Light gray are the similarity nodes ( ) Our main ontributions an be summarized as follows: We ompare alternative filtering methods to reate graphs of postings about a ompany during a time interval (Setion 2). We also present a suite of features that an be omputed from these graphs (Setion 3). We study the orrelation of the proposed features with the time series of stok prie and traded volume. We also show how these orrelations an be stronger or weaker depending on finanial indiators of ompanies, for instane, on their urrent level of debt (Setion 4). We perform a study on the appliation of the orrelation patterns found to guide a stok trading strategy. We show that it an lead to a strategy that is ompetitive when ompared to other automati trading strategies (Setion 5). Roadmap. In Setion 2 we disuss the data used in our analysis and the preproessing steps we performed in order to ompute the features. A detailed desription of the features we use is given in Setion 3. In Setion 4 we present orrelation results between the proposed features for a ompany, and the finanial time series for its stok, in terms of volume traded or hange in prie. In Setion-5 we disuss how the orrelations with prie hange an be used to develop a trading strategy via simulation. Finally, Setion 6 outlines related work, while Setion 7 presents our onlusions. 2. DATA PROCESSING We start our presentation by desribing the data used for our analysis, and the proessing done in order to ompute the features. 2.1 Data aquisition and pre-proessing Stok market data: We obtained stok data from Yahoo! Finane (http://finane.yahoo.om/) for 150 (randomly seleted) ompanies in the S&P 500 index for the first half of 2010. For eah stok we reorded the daily losing prie and daily traded volume. Then, we transformed the prie series into its daily relative hange, i.e., if the series for prie is p i, we used p i p i 1 p i 1. In the ase of traded volume, we normalized by dividing the volume of eah day by the mean traded volume observed for that ompany during the entire half of the year. Twitter data: We set filters to obtain all the relevant tweets on the first half of 2010. By onvention, Twitter in disussions about a stok usually inlude the stok symbol prefixed by a dollar sign (e.g., $MSFT for Mirosoft Corp.). We use a series of regular expressions that find the name of the ompany, inluding the ompany tiker name and hash-tags assoiated with the ompany. The expressions were heked manually, looking at the output tweets, to remove those that extrated many spurious tweets. For example, the filter expression for Yahoo is: #YHOO $YHOO #Yahoo. To refine this expression we randomly seleted 30 tweets from eah ompany, and re-wrote the extration rules for those sets that had less that 50% of tweets related to the ompany. To be aeptable, tweets should be related to the ompany, e.g., mentioning their produts or their finanial situation. When we determined that a rule-based approah was not feasible, we removed the ompany from our dataset. For instane, onsider the ompanies with tikers YHOO, AAPL and APOL, for whih the extration rules had to be rewritten. The short name for Yahoo is used in many tweets that are related with the news servie provided by the same ompany (Yahoo! News). In the seond ase, Apple is a ommon noun and is also used widely for spamming purposes ( Win a free ipad sams). The last ompany, Apollo, is also the name of a deity in Greek mythology and it appears in many ontext that are unrelated to the stok. 2.2 Graph representation We represent eah olletion of filtered tweets as a graph ontaining different entities the relationships among these entities. Figure 2 shows the graph shema, whih is also desribed in Table 1. The nodes in this graph are: the tweets themselves, the users who tweet or who are mentioned in the tweets, and the hashtags and URLs inluded in the tweets. The relations in this graph are: re-tweets (between two tweets), authorship (between a tweet and its author), hash-tag presene (between a hash-tag and the tweets that ontain it), URL presene, (between a URL and the tweets that ontain it), et. Figure 2: Graph Shema.
Nodes Tweet User Url Hashtag Edges Annotated Re-tweeted Mentioned Cited Created Table 1: Shemas. Shema and desription (TweetId, Text, Company, Time) A miroblog posting (UserId, Name, #Followers, #Friends, Loation, Time) A user that posts a tweet or is mentioned (Url, ExpandedUrl, Time) A URL inluded in a tweet (Hashtag, Time) An annotation used in one tweet Shema and desription (TweetId, Hashtag, Timestamp) Relate a tweet with one hash-tag (RTId, TweetId, Time) Represents the re-tweet ation (TweetId, UserId, Time) A expliit mention of another user (TweetId,Url,Time) Connets a URL with tweets inluding it (TweetId, UserId, Time) Connets a tweet with its author HASHTAG 2010-01-28 AAPL #mkt TWEET 2010-03-12 AAPL 1XX7XXXXX08 TWEET 2010-03-12 AAPL 1XXX1XXXX11 USER 2010-05-16 AAPL 1XX6XXX83 USER 2010-05-16 AAPL 1XX1XXX2 URL 2010-06-28 AAPL http://bit.ly/bxxus URL 2010-06-28 AAPL http://bit.ly/bxxl3 USRMENTION 2010-06-15 AAPL CNNMoney Figure 3: Example nodes (node type, timestamp, stok symbol, node identity) on the onstrained graph of a ompany. Additionally, nodes and edges have timestamps at a granularity of one day, orresponding to the granularity of our stok-market time series. Tweets are timestamped with the day they were posted. The rest of the nodes are timestamped with the day they were used for the first time in any tweet (i.e., for a user we set as a timestamp his first tweet). As every edge is inident on a tweet we use the timestamp of the tweet for the edge. For re-tweet edges we use the timestamp of the earliest tweet. Figure 3 shows sample entries of events extrated for the ompany Apple. Eah entry orresponds to a node in the onstrained graph. For instane, the first line means the hash-tag #mkt was used on Jan 28 by some tweet related to Apple. The last line states that the Twitter aount CNNMoney was mentioned in some tweet related to Apple on June 15. We are now ready to define the onept of data graph. Definition [Data Graph] The data graph G = (V,E) is a graph whose nodes and edges onform to the shemas in Tables 1. Some statistis on our data graph are shown in Table 2. We are interested in subgraphs onstrained to a partiular time interval and/or a partiular ompany. A onstrained subgraph, suh as the one depited in Figure 1, is a subgraph Gt1,t2 of G that only ontains nodes with timestamps in time interval [t1,t2], and is about ompany. Our definition of onstrained subgraph is the following. Table 2: Data graph statistis for the normal and the expanded graph (whih is desribed in Setion 4.4) Normal Expanded Tweets 176 K 26.8 M Nodes (TweetsUsersURLsHashtags) 640 K 98.9 M Edges 493 K 76.7 M Compressed Size 48MB 1.4GB Definition [Constrained Subgraph] Let G be a data graph. The onstrained subgraph Gt1,t2 = (V,E) ontains the nodes V of G that are either tweets with timestamps in interval [t1,t2], or non-tweet nodes onneted through an edge to the seleted tweet nodes. All the edges E in G whose end-nodes are in V are added to Gt1,t2. 2.3 Graph post-proessing Most of the information that we inlude for eah node and edge is straightforward to obtain from the Twitter stream. However, there are some data proessing aspets that require speial handling: Mapping user names to IDs: The Twitter stream relates the tweets with internal user identifiers, while user mentions are expressed as user names. To math them, we use the Twitter API to resolve the user-id and user-name referene. URL shortening: A tweet is onstrained to 140 haraters, so most URLs are shortened using a URL shortening servie suh as http: //bit.ly/. The problem here is that a single URL an be referred to as several different short URLs. We solve this alling the interfae of URL shortening servies to get the original URLs. Re-tweets: In the ase of re-tweets, in most ases the original tweet of a re-tweet is referened (by tweet-id). However, we found many ases where the referene to the original tweet is not present. To resolve those ases, instead of using just expliitly referened retweets we augment the graph adding a new similarity node (see Figure 1) that links all similar tweets. We define two tweets to be similar if the Jaard Distane between the bag of words for both tweets is greater than some value α. We set α = 0.8 in our experiments, whih is a onservative setting, meaning that tweets having this level of similarity are almost always re-tweets or minor variations of eah other. 3. FEATURES We extrat two groups of features from the onstrained subgraphs: ativity features and graph features. Both are listed in Table 3. Ativity features simply ount the number of nodes of a partiular type, suh as number of tweets, number of users, number of hashtags, et. Graph features measure properties of the link struture of the graph. For salability, feature omputation is done using Map-Redue (http://hadoop.apahe.org/). Feature normalization and seasonality: Most of the feature values are normalized to lie in the interval [0,1]. For example, if we onsider all the onstrained subgraphs within a k-days interval, we an normalize the number of tweets on suh a subgraph by dividing by the maximum value aross all suh subgraphs. The same normalization strategy an be used for users and re-tweets. Other features like number of URLs, hash-tags, et., are normalized using the number of tweets for the full day. It is important to onsider the effet of seasonality in this graph. The number of tweets is inreasing (Twitter s user base grew during our observation period) and has a weekly seasonal effet. We normalize the feature values with a time-dependent normalization
Ativity features RTID RTU TGEO TID TUSM UFRN Table 3: Features. Desription number of re-tweets in Gt1,t2 number of different users that have re-tweeted in Gt1,t2 number of tweets with geo-loation in Gt1,t2 number of tweets in Gt1,t2 number of tweets that mention any user in Gt1,t2 average number of friends for user that posted in Gt1,t2 THTG number of hash-tags used in all the tweets in Gt 1,t 2 TURL number of tweets with URLs in Gt 1,t 2 UFLW average number of followers for user that posted in Gt 1,t 2 UID number different users that posted a tweet in Gt 1,t 2 Graph features Desription NUM_NODES number of nodes of Gt 1,t 2 NUM_EDGES number of edges of Gt 1,t 2 NUM_CMP number of onneted omponents of Gt 1,t 2 MAX_DIST maximum diameter for any omponent of Gt 1,t 2 PAGERANK statistis on the page rank distribution for Gt 1,t 2 (AVG, STDV, QUARTILES, SKEWNESS, KURTOSIS) COMPONENT statistis on the onneted omponent distribution for Gt 1,t 2 (same as above) DEGREE statistis on the node degree distribution for Gt 1,t 2 (same as above) fator that onsiders seasonality. This fator is proportional to the total number of messages on eah day. 4. TIME SERIES CORRELATION In this setion, we start by looking for orrelations between the proposed features for a ompany, and the finanial time series for its stok, in terms of volume traded or hange in prie. Next, we onsider how this orrelation hanges under (i) an analysis isolating different types of ompanies, (ii) an analysis aggregating ompanies into an index, and (iii) hanges to the filtering strategy. 4.1 Correlation with volume and prie We use the ross-orrelation oeffiient (CCF) to estimate how variables are related at different time lags. The CCF value at lag τ between two series X, Y, measures the orrelation of the first series with respet to the seond series shifted by an amount τ. This an be omputed as R(τ) = i ((X(i) µ X )(Y (i τ) µ Y )) i (X(i) µ X ) 2 i (Y (i τ) µ Y ) 2 If we find a orrelation at a negative lag, this means that the input features ould be used to predit the outome series. Tables 4 and 5 report the average ross-orrelation values for traded volume and prie respetively, for the 50 ompanies with most tweets in the observation period, at different lags. We only report the top 5 features for eah ase, i.e., those having the higher orrelation at lag 0. Interestingly, the top features are similar in both lists. Table 4 shows that the number of omponents (NUM-CMP) of the onstrained sub-graph is the feature that has the best orrelation with traded volume. Other good features for this objetive are the number of tweets, the number of different users and the total number of nodes on eah graph. We also see that there is a positive orrelation at lag 1, meaning that these features have some preditive power on the value on the next day. On the other hand, Table 5 shows that the prie hange is not strongly orrelated with any of the proposed features. Table 4: Average orrelation of traded volume and features. Lag [days] Feature -3-2 -1 0 1 2 3 NUM-CMP 0.09 0.11 0.21 0.52 0.33 0.16 0.10 TID 0.09 0.10 0.19 0.49 0.31 0.15 0.09 UID 0.09 0.11 0.21 0.49 0.31 0.15 0.10 NUM-NODES 0.09 0.10 0.20 0.49 0.31 0.15 0.09 NUM-EDGES 0.09 0.09 0.18 0.45 0.29 0.14 0.09 Table 5: Average orrelation of prie and features. Lag [days] Feature -3-2 -1 0 1 2 3 NUM-CMP 0.08 0.09 0.10 0.13 0.07 0.07 0.07 NUM-NODES 0.07 0.09 0.10 0.11 0.08 0.07 0.07 TID 0.06 0.08 0.07 0.10 0.07 0.08 0.08 UID 0.07 0.08 0.08 0.10 0.07 0.08 0.07 NUM-EDGES 0.07 0.08 0.09 0.10 0.08 0.07 0.06 4.2 Separating ompanies by type Figure 4 shows the ross-orrelation oeffiient (CCF) values for two seleted ompanies (A.I.G. and Teradyne, In.) in our data-set. In Figure 4(a) we see a strong orrelation of the stok volume with the four best features of Table 4. On the other hand, Figure 4(b) does not show this orrelation. The next question then is to find out fators that affet the orrelation between miro-blogging ativity and the ompanies stok. We obtained a series of finanial indiators for eah ompany from Yahoo! Finane. For eah suh indiator, we separated the 50 ompanies in 3 quantiles. The average orrelation between NUM-CMP for eah group is shown in Table 6, for the five finanial indiators that exhibit the largest variane aross their three groups. The bounds are the ut-off points of the quantiles. The table shows that the orrelation is stronger for ompanies with low debt, regardless of whether their finanial indiators are healthy or not. This ould be related to stoks that are expeted to surge or that may be andidates for short selling. The users tweets also orrelate better with the stoks for
(a) A.I.G. (AIG) (b) Teradyne, In. (TER) Figure 4: Correlations for two different ompanies. Table 6: Average orrelation of traded volumes for different ompanies aording to several finanial indiators. Finanial indiators are disretized in 3 quantiles (low, medium, high) aording to the bounds shown. Quantile Indiator and bounds Low Medium High Current Ratio (mrq) 0.42 0.62 0.52 bounds: 1.34,2.39,9.41 Gross Profit (ttm) 0.59 0.54 0.42 bounds: $2B,$9B,$103B Enterprise Value/EBITDA (ttm) 0.54 0.43 0.59 bounds: 6.22,11.78,20.21 PEG Ratio (5 yr expeted) 0.51 0.44 0.61 bounds: 1.04,1.51,35.34 Float 0.61 0.46 0.48 bounds: $272MM,$914MM,$10B Beta 0.47 0.51 0.58 bounds: 0.98,1.34,3.95 ompanies having high beta and low float, again suggesting that Twitter ativity seems to be better orrelated with traded volume for ompanies whose finanes flutuate a lot. 4.3 Aggregating ompanies in an index In Setions 4.1 and 4.2 we built single-stok onstrained subgraphs, whih are often too small to reliably ompute graph features like PageRank. In this setion, we onsider a stoks index I onsisting of the n = 20 biggest (in terms of market apitalization) ompanies 1,, n in our dataset, and build index-based onstrained sub-graphs. We an define the index hange for eah date d as follows: Idx(I,d) = priechange(, d) weight() I where priechange(, d) is the differene between the open and lose prie for and d, and the weight is the importane (market apitalization) of eah ompany. In partiular, as usually done in finanial indexes, we define the importane for eah ompany as: weight() = MarketCap() max I MarketCap( ). We also define the index trade volume for a partiular date as: VolumeIdx(I, d) = volumetraded(, d) weight(). I The index data graph onsiders the tweets that are posted in the first half of 2010. The graph has 108,702 nodes and 209,714 edges. We repeat the orrelation experiments of Setion 4.1. The results are shown in Tables 7 and 8. The key differene from Tables 4 and 5 is that in the larger index onstrained graphs, graph entrality measures like PAGERANK and DEGREE get more reliable estimations and
Table 7: Correlation of traded volume and features, for a syntheti index of top 20 ompanies. Lag [days] Feature -3-2 -1 0 1 2 3 NUM-CMP 0.00 0.12 0.20 0.24 0.15 0.26 0.21 P.RANK-AVG 0.03 0.15 0.20 0.24 0.12 0.16 0.14 TID -0.01 0.14 0.17 0.23 0.19 0.27 0.21 UID -0.03 0.11 0.15 0.22 0.20 0.26 0.23 NUM-EDGES 0.02 0.12 0.14 0.22 0.19 0.24 0.20 Table 8: Correlation of prie hange and features, for a syntheti index of top 20 ompanies. Lag [days] Feature -3-2 -1 0 1 2 3 DEG.-STD 0.08 0.05 0.10 0.12 0.10 0.07-0.04 DEG.-SKW 0.04 0.02 0.07 0.11 0.06 0.03 0.02 P.RANK-SKW 0.02 0.03 0.08 0.10 0.06 0.02 0.03 DEG.-KURT 0.02-0.01 0.05 0.10 0.08 0.05-0.00 P.RANK-STD 0.08-0.01 0.12 0.09 0.04-0.03 0.05 are shown to be more strongly orrelated to both prie and traded volume. Another interesting observation is that the trading volume is less orrelated (Table 7) than in the ase of individual stoks (Table 4). We have observed that inreases in the ativity of some ompanies is often ompensated by the inativity of others, leading to more stable onstrained graphs. In partiular, we measured that the variane of the number of onneted omponents, whih was the best feature, is lower ompared to the average variane for individual ompanies. 4.4 Modifying the filtering strategy In Setion 2.1 we presented one strategy for filtering the stream. The rationale behind this strategy is that we only fous on tweets related to the finanial domain. However, we may end up filtering out some messages that are related to a ompany but do not mention it expliitly. We study more loose filtering strategies, and obtain negative results: indeed, we degrade the quality of the orrelations. We onsider the following strategies: 1. Restrited Graph: Presented in Setion 2.1. 2. Expanded Graph: We onsider all tweets that: ontain the tiker preeded by the $ or # harater, or ontain the full name of the ompany, or the short name version after removing ommon suffixes (e.g., in or orp), or the short name as a hash-tag. For instane, for Yahoo! the new expression is: #YHOO $YHOO #Yahoo Yahoo Yahoo In. 3. RestExp: This ombines the previous two strategies. We add to the restrited graph the tweets of the expanded graph that are reahable from the nodes of the restrited graph through a path (e.g., through a ommon author or a re-tweet). Again we do some visual inspetions in a small sample to see if the rules were related with the ompany. If a rule was very generi (ambiguous) we remove it. For example for the ompany APOL we remove the rules that use APOLO as short name. The expanded graph size is shown in Table 2; it has about 150 times more nodes and edges than the normal graph for the same period of time. The restrited graph is more preise than the expanded graph, but has lower reall. The expanded graph is more noisy, as it may ontain many tweets that are related to spam, ommon onversation (e.g., I want to eat an apple ) or tweets related to non-finanial disussions (e.g., omplaints about ustomer servie). (a) Traded volume. (b) Prie hange. Figure 5: Miroblogging ativity for different strategies, orrelated with (a) traded volume (b) prie hange. Figure 5 ompares the volume and prie orrelation with the number of omponents (NUM-CMP), for the different filtering strategies. We observe that the restrited graph strategy has the best orrelation, despite its smaller size. Using this expansion strategy of the graph, we add more noise than useful tweets.aaa 5. SIMULATION In this setion we study whether the orrelation with prie hange an be used to develop a trading strategy in the stok market. We simulate daily trading [5, 22, 25] of stoks and try to predit the final prie on eah day of the simulation. We ompare various trading strategies inluded simple regression models, augmented regression, random and fixed seletion. 5.1 Strategies We model an automated investor who buys and sells stok. The behavior model for this investor is the following: 1. The investor starts with an initial apital endowment C 0. 2. In the morning of every day t, she buys K different stoks using all of the available apital C t. The investor uses various algorithms to selet whih stoks to buy and how many shares to buy from eah of them. The ompanies in our simulated stok market are the same random seletion from the S&P500 desribed in Setion 2 3. The investor holds the stoks all day long. 4. She sells all the stoks at the losing time of day t. The amount she obtains will be her new apital C t1 and will be used again in step 2. This proess finishes on the last day of the simulation. 5. We ompare the final apital against the initial investment. We plot the perent of money win or lossed eah day against the original investment. This simple simulation does not onsider external effets like defiit of stoks, or the possibility of selling the stoks at the final prie. Our aim is to determine if the proposed Twitter features have the potential of improving over other baseline strategies. The stok seletion algorithms evaluated are the following: Random: the investor selets K stoks at random eah morning. To diversify the investment the amount of money invested in eah stok is C t /K (uniformly shared). Fixed: the investor piks K stoks using a partiular finanial indiator (market apitalization, ompany size, total debt) and buys
Figure 6: Simulation results for different trading strategies. from the same ompanies every day. To diversify the investment the amount of money invested in eah stok is C t /K (uniformly shared). Auto Regression: the investor buys the K stoks whose prie hanges will be larger, predited using an auto-regression (AR(s)) model. This model predits the prie x t of a ompany at time t using a linear ombination of the prie in the previous s days: x t = a 1 x t 1 a 2 x t 1...a m x t m, where eah a i are the parameters of model and is a onstant. Parameters are learned with simple linear regression on a provided training data of L samples. To diversify the market we have two options: the first one is the uniform split whih we already disussed and the seond one weighs eah stok using the predited prie hange. This last strategy is onsistent with a very simple heuristi used for the bin paking problem where we prefer those items with high prie-weight ratio. In our ase: weight = prie differene open prie Twitter-Augmented Regression: the investor buys the best K stoks that are predited using a vetor auto-regressive (VAR(s)) model. This model onsiders, in addition to the prie of the previous days, a Twitter feature (e.g., number of omponents) as observed in the previous days. The model predits the prie of a stok at time t using a linear ombination of the prie in the previous s days and the values of the augmenting series (Twitter feature) in the same dates: x t = a 1 x t 1 a 2 x t 2...a m x t m b 1 y t 1 b 2 y t 2...b m y t m. Training details are similar to the one disussed for the AR(s) model. The strategies to diversify are also the same (uniform and weighted).. 5.2 Results We simulate a series of investments between Marh 1, 2010 and June 30, 2010. We use the data from January 1, 2010 to February 28, 2010 as training data for the regression models. We keep a window of L = 35 training examples and use the previous s = 5 days to train. Eah model is trained using the Ordinary Least Squares (OLS) method. The initial apital is C 0 = 10000$. The investor buys stok from K = 10 different ompanies every day in every strategy. For the Twitter-Augmented Regression we use the following features: TID, UID, NUM-CMP, NUM-NODES,NUM-EDGES. We use all the graphs that were desribed in Setion 4.4. We also try the weighed and uniform share options for both the the AR and VAR model. Figure 6 shows the simulation results for the disussed period. We show the behavior of a sample of all trading tehniques disussed: speifially, we show the best approah for eah ategory (e.g., for Twitter-Augmented Regression we only show the behavior of the NUM-CMP feature). Our two baselines for the rest of the disussion are the random strategy and the Auto-Regression strategy. The average loss for the random strategy is 5.52% and the one for the AR models are 8.9% (Uniform) and 13.08% (Weighted). Only one of the fixed models (Profit Margin) have a better behavior than the default random ( 3.8%). All the rest of the models that improve the baseline are VAR models. The best one uses the number of omponents on the RestExp graph with a uniform share for a 0.32% gain. The models obtained with the restrited graph and number of omponents average a 2.4% loss that is still better that the random model. Figure 6 also inludes the Dow Jones Index Average (DJA) for the same period. As we an see the behavior of all the strategies is onsistent with this index s behavior. Our proposed strategy is the only ones that manages to obtain a profit during this period in whih
the Dow Jones fell 4.2% (Nasdaq (NDQ) also drops in a similar 4.7%). The best feature is again the number of omponents. 6. RELATED WORK Miroblogging data: In reent years several studies (e.g. [17, 19, 16]) have analyzed Twitter data to desribe the different types of users, their behavior, the ontent of the tweets and the way that they are related to trends. In partiular, [21] desribe the relationship between tweets and trends in traditional news media, as well as query volume on major searh engines. Our work is informed by this general knowledge about Twitter, but we fous our attention in a partiular domain instead of attempting a general study. The relations among users, entities and topis in Twitter have been desribed by a graph and exploited in previous work. For instane, [31] starts by identifying similar users based on their favorite topis and their soial onnetions. Then, a modified version of PageRank is used to find the most influential authors on the Twitter graph. Yamaguhi et al. [32] extend the Objet Rank Algorithm [1] to onsider different types of verties. These works only onsider a fixed point in time and do not onsider the hanges on the graph struture over time. News artiles and the stok market: The literature relating news stories with finanial events is vast. Here we outline some reent works on the subjet. Hayo and Kutan [15] present a pure eonomi predition model to study the effet of other markets on the Russian market. A variable in this model defines if the news were positive or negative in the past. Although the news lassifiation is manual this study shows the importane of news on the market behavior. Lavrenko et al. [23] present a model to predit the behavior of the stok of a ompany using news stories related to the ompany. The system builds a language model for positive and negative stories and predits the future behavior heking the language model of the news that appear in the previous hours. Our work does not pretend to be a predition model: we measure the orrelation of the behavior of Twitter with the hanges in the stok market. Shumaker and Chen [26] present a system that learns the importane of a news on the performane of a stok. Again, ompared with this work we do not try to make a predition but find a explanation of the hange. Moreover, we use miroblogging posts instead of news stories. DeChoudhury et al. [7] shows that disussions on blogpost are also orrelated with the diretions of the stok market. Yi [33] presents a study to approximate the daily losing value of a stok using data from Twitter. This work also disusses several feature that an be used in the predition and presents a model that an improve a simple moving average, reduing the error. Sprenger and Welpe [27] and Bollen et al. [3] also show the relations of the stok market or partiular stoks with the sentiment of the tweets and how it an be used to improve the predition. None of this work onsiders graph features. Time series regression from Web data: The use of web data for prediting the behavior of a real time series is related with our work. Ginsberg et al. [12] present a method to approximate the ases of influenza of the US using the query log of a searh engine. Corley et al. [6] makes a similar predition using blog ontent. Other work [8] use searh logs to predit the job market. Hagedorn et al. [13] argue that while these predition models are orret, they are not really ompetitive or add any information when ompared with models that use domain knowledge. Their appliation on predition of musi, video games and movie hits shows that other, better-known and simple features are good enough. Gayo et al. [11] have similar objetions, and in addition show evidene of differenes between the distribution of demographi harateristis of Twitter users from a ountry ompared to the general population of that ountry. These differenes are substantial and make it impossible to obtain a uniform random sampling of e.g. itizens voting in an elletion. Furthermore, Gayo [10] warns against jumping to onlusions too quikly when analyzing soial media data, reminding that just being large does not make suh olletions statistially representative of the population as a whole. Data filtering and spam removal: The filtering of data an have an important effet in the performane of our method. This problem an be divided in two parts: finding related tweets and removing those that are spam. In [14] the authors propose a supervised multi-lassifier that an distinguish if an RSS feed is related with a partiular ompany. We an use this kind of strategy to find better filters. Finin et al. [9] propose rowdsouring strategies to annotate the entities that appear in the tweets. This knowledge ould be used later to train Name Entity Reognizers that an be adapted to the partiular twitter harateristis. Our work ould be extended by leveraging features used in the elimination of spam from tweets. Wang [30] and Benevenuto et al. [2] present features that an be used to detet spammers and how they differ from relevant users. Castillo et al. [4] go farther as they study the veraity of tweets for partiular events. We think that our work an be improved if we utilize this knowledge to improve the filtering phase. (Dynami) graph features: The graph-based features we use are a subset of those present in previous work [18, 24]. We an extend our work by inluding more graph features. For instane, Kumar et al. [20] present a work on the evolution of soial networks in the blogosphere. This work shows that there are hanges on the struture that are related with hanges in the real world. Other works [29, 28] propose algorithms to mine patterns on massive data graphs. In partiular [28] shows how the struture of the graph hanges over time. 7. CONCLUSIONS We presented a framework to extrat messages from Twitter about ompany stoks, and represent that information through graphs apturing different aspets of the onversation around those stoks. We then used these time-onstrained graphs to evaluate a wide range of features in terms of their degree of orrelation to hanges in stok prie and traded volume. We show that the number of onneted omponents of the onstrained subgraph is generally the best feature in terms of orrelation, espeially in relation to traded volume. Graph entrality features like PageRank and average degree beome effetive for bigger graphs, whih an be obtained for multi-ompany indexes. Finally, we used simulation to show that these features are useful in order to improve a trading strategy in the stok market. 8. ACKNOWLEDGMENTS Eduardo Ruiz and Vagelis Hristidis were supported in part by National Siene Foundation grants OISE-0730065, IIS-0811922, IIS- 0952347 and CNS-1126619. Carlos Castillo, Aristides Gionis and Alejandro Jaimes were partially supported by the Torres Quevedo Program of the Spanish Ministry of Siene and Innovation, ofunded by the European Soial Fund, and by the Spanish Centre for the Development of Industrial Tehnology under the CENIT program, projet CEN-20101037, Soial Media (http://www.enitsoialmedia.es/).
9. REFERENCES [1] A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objetrank: authority-based keyword searh in databases. In Proeedings of the 13th international onferene on Very Large Data Bases, pages 564 575, 2004. [2] F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. Deteting Spammers on Twitter. Collaboration, Eletroni messaging, Anti-Abuse and Spam Conferene (CEAS), 2010. [3] J. Bollen, H. Mao, and X.-J. Zeng. Twitter mood predits the stok market. Journal of Computational Siene, abs/1010.3003, 2010. [4] C. Castillo, M. Mendoza, and B. Poblete. Information Credibility on Twitter. In Proeedings of World Wide Web Conferene (WWW), 2011. [5] A.-S. Chen, M. T. Leung, and H. Daouk. Appliation of neural networks to an emerging finanial market: foreasting and trading the taiwan stok index. Computers & Operations Researh, 30(6):901 923, 2003. [6] C. Corley, A. R. Mikler, K. P. Singh, and D. J. Cook. Monitoring influenza trends through mining soial media. In BIOCOMP, pages 340 346, 2009. [7] M. DeChoudhury, H. Sundaram, A. John, and D. D. Seligmann. Can blog ommuniation dynamis be orrelated with stok market ativity? In Proeedings of the 20th ACM onferene on Hypertext and Hypermedia, 2008. [8] M. Ettredge, J. Gerdes, and G. Karuga. Using web-based searh data to predit maroeonomi statistis. Communiations of the ACM, 48:87 92, 2005. [9] T. Finin, W. Murnane, A. Karandikar, N. Keller, J. Martineau, and M. Dredze. Annotating named entities in twitter data with rowdsouring. In Proeedings of the NAACL HLT 2010 Workshop on Creating Speeh and Language Data with Amazon s Mehanial Turk, pages 80 88, 2010. [10] D. Gayo-Avello. A warning against onverting soial media into the next literary digest. Communiations of the ACM, 2011. [11] D. Gayo-Avello, P. T. Metaxas, and E. Mustafaraj. Limits of eletoral preditions using twitter. In International AAAI Conferene on Weblogs and Soial Media (posters), 2011. [12] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, and L. Brilliant. Deteting influenza epidemis using searh engine query data. Nature, 457(7232):1012 1014, 2009. [13] S. Goel, J. M. Hofman, S. Lahaie, D. M. Pennok, and D. J. Watts. Whata an searh predit? In Proeedings of World Wide Web Conferene (WWW), 2010. [14] B. A. Hagedorn, M. Ciaramita, and J. Atserias. World knowledge in broad-overage information filtering. In Proeedings of the 30th annual International ACM SIGIR Conferene on Researh and Development in Information Retrieval, pages 801 802, 2007. [15] B. Hayo and A. M. Kutan. The impat of news, oil pries, and global market developments on russian finanial markets. The Eonomis of Transition, 13(2):373 393, 2005. [16] B. A. Huberman, D. M. Romero, and F. Wu. Soial networks that matter: Twitter under the mirosope. First Monday, 14(1), 2009. [17] A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding miroblogging usage and ommunities. In Proeedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and soial network analysis, pages 56 65, 2007. [18] J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. S. Tomkins. The web as a graph: measurements, models, and methods. In Proeedings of the 5th annual international onferene on Computing and ombinatoris, pages 1 17, 1999. [19] B. Krishnamurthy, P. Gill, and M. Arlitt. A few hirps about twitter. In Proeedings of the First Workshop on Online Soial Networks, pages 19 24, 2008. [20] R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. Struture and evolution of blogspae. Communiations of the ACM, 47:35 39, Deember 2004. [21] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a soial network or a news media? In Proeedings of the 19th international onferene on World wide web, pages 591 600, 2010. [22] V. Lavrenko, M. Shmill, D. Lawrie, P. Ogilvie, D. Jensen, and J. Allan. Language models for finanial news reommendation. In Proeedings of the 9th international onferene on Information and knowledge management, pages 389 396, 2000. [23] V. Lavrenko, M. Shmill, D. Lawrie, P. Ogilvie, D. Jensen, and J. Allan. Mining of onurrent text and time series. In Proeedings of the 6th ACM SIGKDD International Conferene on Knowledge Disovery and Data Mining Workshop on Text Mining, pages 37 44, 2000. [24] M. E. J. Newman. The struture and funtion of omplex networks. SIAM Review, 45(2):167 256, 2003. [25] T. Preis and H. E. Stanley. Trend swithing proesses in finanial markets. In Eonophysis Approahes to Large-Sale Business Data and Finanial Crisis, pages 3 26. 2010. [26] R. P. Shumaker and H. Chen. Textual analysis of stok market predition using breaking finanial news: The azfin text system. ACM Transation Information Systems, 27:12:1 12:19, 2009. [27] T. O. Sprenger and I. M. Welpe. Tweets and trades: The information ontent of stok mirologs. Work in progress in Soial Siene Researh Network. [28] J. Sun, C. Faloutsos, S. Papadimitriou, and P. S. Yu. Graphsope: parameter-free mining of large time-evolving graphs. In Proeedings of the 13th ACM SIGKDD international onferene on Knowledge disovery and data mining, pages 687 696, 2007. [29] H. Tong, S. Papadimitriou, J. Sun, P. S. Yu, and C. Faloutsos. Colibri: fast mining of large stati and dynami graphs. In Proeeding of the 14th ACM SIGKDD International Conferene on Knowledge Disovery and Data Mining, pages 686 694, 2008. [30] A. H. Wang. Don t follow me - spam detetion in twitter. In SECRYPT 10, pages 142 151, 2010. [31] J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topi-sensitive influential twitterers. In Proeedings of the 3rd ACM International Conferene on Web Searh and Data Mining, pages 261 270, 2010. [32] Y. Yamaguhi, T. Takahashi, T. Amagasa, and H. Kitagawa. Turank: Twitter user ranking based on user-tweet graph analysis. In Web Information Systems Engineering-WISE 2010, pages 240 253. 2010. [33] A. Yi. Stok market predition based on publi attentions: a soial web mining approah. Master s thesis, University of Edinburgh, 2009.