POLITECNICO DI MILANO Scuola di Ingegneria dell Informazione POLO TERRITORIALE DI COMO Master of Science in Computer Engineering A CRF-based approach to find stock price correlation with company-related Twitter sentiment Master Graduation Thesis by: Ekaterina Shabunina Supervisor: Prof. Marco Brambilla Academic Year 212/13
This is an example of how powerful can a Twitter post be: On 23 of April 213 at 1:7 pm, the hacked Twitter account of Associated Press posted a false tweet saying: Two explosions in the White House and Barack Obama is injured causing a flash crash on the stock market as auto-trading computer systems on autopilot sold $134 billion dollars worth of stocks. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Twitter Background 554,75, registered users, out of which 288 million monthly active, with on average 5 million tweets posted a day with an estimated rate of 9,1 tweets per second; Users have public by default profiles; Users from all over the world with different age, nationality, household income, professions, and hobbies distributions. Cashtags clickable ticker symbols with a dollar sign prefix (for example, $goog), which takes a user to the search results about company s finance and stock. Sentiment Analysis (multi-class) Determining the attitude within a tweet with respect to the company in experiment (in this thesis context). Hard over only 14 symbols long Twitter micro-blogs Even harder over special financial domain, which employs a very specific set of jargons and slangs, with particular abbreviations and symbols and in which many words imply different meanings and associate with distinct emotions. Stock Markets Two categories of analysis performed by the players in financial stock markets in order to determine whether to buy or sell a given security: Technical (an attempt of applying mathematical models) Fundamentalist (based on the study of the value of a company based on its capacity of generating cash in the future) Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Data Pre-Processing Crawling Twitter with Twitter Search API Filtering Data Processing Manually labeling POS tagging Training the model Templates CRF++ tool Twitter data labeling Regression analysis Tools & Methods Stock market data Minitab 16 tool Architectural overview of the proposed approach. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Results Classifier s Performance Templates description: Simple current word and it s POS tag; Previous previous and current words and their POS tags; Prev+Next previous, current and next words and their POS tags; Prevprev+Nextnext - two previous words, current and two next words and their POS tags; Word_combinations - includes Prevprev+Nextnext template features and the combinations: word before previous word / previous words, previous / current, current / next and next / next after next words Classifier s accuracy, average over 1-folds, for Microsoft Inc. and Google Inc. with various templates. For both companies, Microsoft Inc. and Google Inc., the best performance was obtained using Word_combinations template, which was chosen to be used for the labeling of the next one month long time period, from 25 th of April until 24 th of May, 213, to produce the Twitter sentiments daily volume, necessary for the next task of finding correlation of it with the stock values for these companies. Training parameters effect on the classifier s performance, on Google Inc. dataset Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Results Classifier s Performance Performance measures of the resulting classification models for each company, selected out of the 1-folds. Classification models performance for Microsoft Inc. Classification models performance for Google Inc. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Number of Tweets Closing Price (USD) Number of Tweets Closing Price (USD) Results Adherences The initial results are summarized in the charts below for Microsoft Inc. and Google Inc. 75 37 18 1 6 35 16 14 95 9 45 3 15 17/4 24/4 1/5 8/5 15/5 22/5-15 33 31 29 27 12 1 8 6 4 2 17/4 24/4 1/5 8/5 15/5 22/5 85 8 75 7 65 6 55 Total Positive Negative Neutral Closing Price Total Positive Negative Neutral Closing Price Sentiments and Closing price, Microsoft Inc. Sentiments and Closing price, Google Inc. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Number of Tweets Closing Price Variation Number of Tweets Closing Price Variation Results Adherences In these two charts are plotted the value of the variation of the stock price (compared to the closing price of the previous day), the number of positive-classified tweets, the number of negative classified tweets and the net number of positive tweets, i.e. the total number of positive-classified tweets subtracted by the number of the negative-classified tweets. 3 5% 75 5% 15 17/4 24/4 1/5 8/5 15/5 22/5-15 4% 3% 2% 1% % -1% -2% 6 45 3 15 17/4 24/4 1/5 8/5 15/5 22/5-15 4% 3% 2% 1% % -1% -2% -3% Net Positive Positive Negative Price Change (%) Net Positive Positive Negative Price Change (%) Sentiments, Net Positive and Price change, Microsoft Inc. Sentiments, Net Positive and Price change, Google Inc. A visible adherence of the net positive value and the stock daily performance, in Google Inc. case, possibly due to a broader sample size. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Number of Tweets Closing Price (USD) Number of Tweets Closing Price Variation Results Adherences Charts comparing the accumulated value of the net positive tweets and the stock closing price, for every studied day, to cope with the inertial effect presented by the stock markets. 25 2 15 1 5 17/4 24/4 1/5 8/5 15/5 22/5 36 35 34 33 32 31 3 29 28 27 5 45 4 35 3 25 2 15 1 5 17/4 24/4 1/5 8/5 15/5 22/5 93 91 89 87 85 83 81 79 77 75 Net Positive Accumulated Pos. Accumulated Net Positive Accumulated Pos. Accumulated Neg. Accumulated Closing Price Neg. Accumulated Closing Price Accumulated Net Positive and the Closing price, Microsoft Inc. Accumulated Net Positive and the Closing price, Google Inc. In both cases, the plots follow similar patterns, even if the closing price presents a higher volatility, denoting a good adherence of the classification model to the observed performance of the stock price. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Number of Tweets Traded Volume Number of Tweets Traded Volume Results Adherences The comparison of the observed volume and the total number of tweets for each trading day: 7 16,, 18 7,, 6 5 4 3 2 1 14,, 12,, 1,, 8,, 6,, 4,, 2,, 16 14 12 1 8 6 4 2 6,, 5,, 4,, 3,, 2,, 1,, 17/4 24/4 1/5 8/5 15/5 22/5 17/4 24/4 1/5 8/5 15/5 22/5 Total Volume Total Volume Total number of tweets and traded volume, Microsoft Inc. Total number of tweets and traded volume, Google Inc. This comparison was the one with the closest fit among all presented for the both companies experiments. It validates the expectation that notable facts that may impact the number of trades in a day also impact in a similar manner the volume of mentions of that company across the social networks. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Results Regression Analysis Net positive tweets as independent explanatory variable for the daily change, in percent, of the stock closing price: Microsoft Inc. Google Inc. Regression for Price Change (%) vs Net Positive Diagnostic Report 1 Regression for Price Change (%) vs Net Positive Diagnostic Report 1,2 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns.,3 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns.,1,15 Residual, Residual, -,1 -,15 -,2 -,1,,1 Fitted Value,2,3 -,3 -,1,,1 Fitted Value,2,3 Examples of patterns that may indicate problems with the fit of the model: Regression for Price Change (%) vs Net Positive Unequal variation Summary Report Strong curvature Y: Price Change Uneven (%) variability, such as when the spread of Curve in the data that is not well explained by the X: Net Positive points increases as the fitted values increase. If the regression model. If you are already using the best unequal variation is severe, get help to address the fitting Fitted model, Line get help Plot to for address Cubic the Model problem. problem. Y =,191 -,99 X +,14 X**2 -, X**3 Is there a relationship between Y and X? Clusters,5,1 >,5 Groups of points that suggest there may be Yes important X variables that were not included in the No regression model. Get help to address the problem. P =,64 The relationship between Price Change (%) and Net Positive is not statistically significant (p >,5). Price Change (%) 4,% Large residuals 2,%,% Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. Examples of patterns that may indicate problems with the fit of the model: Regression for Price Change (%) vs Net Positive Unequal variation Strong curvature Summary Report Uneven variability, such as when the spread of Y: Price Change (%) points increases as the fitted values increase. If the X: Net Positive unequal variation is severe, get help to address the problem. Is there a relationship between Y and X?,5,1 >,5 Clusters Yes No Groups of points that suggest there may be P =,1 important X variables that were not included in the The relationship regression between model. Price Get Change help to (%) address and Net the problem. Positive is statistically significant (p <,5). Price Change (%) Large 4,% residuals 2,%,% Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Linear Model Y = -,4668 +,81 X Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. 5 1 15 Net Positive 2 % % of variation accounted for by model 1% -2,% 1 2 3 Net Positive 4 % of variation accounted for by model % R-sq (adj) = 15,83% 15,83% of the variation in Price Change (%) can be accounted for by the regression model. 1% Comments The fitted equation for the cubic model that describes the relationship between Y and X is: Y =,191 -,99 X +,14 X**2 -, X**3 If the model fits the data well, this equation can be used to predict Price Change (%) for a value of Net Positive, or find the settings for Net Positive that correspond to a desired value or range of values for Price Change (%). R-sq (adj) = 32,37% 32,37% of the variation in Price Change (%) can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = -,4668 +,81 X If the model fits the data well, this equation can be used to predict Price Change (%) for a value of Net Positive, or find the settings for Net Positive that correspond to a desired value or range of values for Price Change (%). A statistically significant relationship does not imply that X causes Y.,59 The positive correlation (r =,59) indicates that when Net Positive increases, Price Change (%) also tends to increase. A statistically significant relationship does not imply that X causes Y. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Results Regression Analysis Accumulated net positive value versus the closing price behavior: Microsoft Inc. Google Inc. Regression for Closing Price vs Net Positive Accumulated Diagnostic Report 1 Regression for Closing Price vs Net Positive Accumulated Diagnostic Report 1 1, Residuals vs Fitted Values Look for large residuals (marked in red) and patterns. 1 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns.,5 5 Residual, -,5 Residual -5-1, -1 29 3 31 32 Fitted Value 33 34 35 78 8 82 84 Fitted Value 86 88 9 92 Examples of patterns that may indicate problems with the fit of the model: Regression for Closing Price vs Net Positive Accumulated Unequal variation Strong curvature Summary Report Uneven variability, such as when the spread of Y: Closing Price points increases as the fitted values increase. If the X: Net Positive Accumulated unequal variation is severe, get help to address the problem. Clusters Is there a relationship between Y and X?,5,1 Groups of points that suggest there may be>,5 important X variables that were not included in the Yes No regression model. Get help to address the problem. P =, The relationship between Closing Price and Net Positive Accumulated is statistically significant (p <,5). Closing Price 36 Large residuals 34 32 3 Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Quadratic Model Y = 28,81 +,5878 X -,1 X**2 Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. Examples of patterns that may indicate problems with the fit of the model: Regression for Closing Price vs Net Positive Accumulated Unequal variation Strong curvature Summary Report Uneven variability, such as when the spread of Y: Closing Price points increases as the fitted values increase. If the X: Net Positive Accumulated unequal variation is severe, get help to address the problem. Clusters Is there a relationship between Y and X?,5,1 Groups of points that suggest there may be>,5 important X variables that were not included in the Yes No regression model. Get help to address the problem. P =, The relationship between Closing Price and Net Positive Accumulated is statistically significant (p <,5). Closing Price Large 9 residuals 85 8 Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Cubic Model Y = 77,7 +,6772 X +,25 X**2 -, X**3 Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. 5 1 15 Net Positive Accumulated 2 75 1 2 Net Positive Accumulated 3 Comments Comments % of variation accounted for by model % 1% R-sq (adj) = 9,72% 9,72% of the variation in Closing Price can be accounted for by the regression model. The fitted equation for the quadratic model that describes the relationship between Y and X is: Y = 28,81 +,5878 X -,1 X**2 If the model fits the data well, this equation can be used to predict Closing Price for a value of Net Positive Accumulated, or find the settings for Net Positive Accumulated that correspond to a desired value or range of values for Closing Price. A statistically significant relationship does not imply that X causes Y. % of variation accounted for by model % 1% R-sq (adj) = 97,56% 97,56% of the variation in Closing Price can be accounted for by the regression model. The fitted equation for the cubic model that describes the relationship between Y and X is: Y = 77,7 +,6772 X +,25 X**2 -, X**3 If the model fits the data well, this equation can be used to predict Closing Price for a value of Net Positive Accumulated, or find the settings for Net Positive Accumulated that correspond to a desired value or range of values for Closing Price. A statistically significant relationship does not imply that X causes Y. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Results Regression Analysis Traded volume change according to the total number of tweets for a given day: Microsoft Inc. Google Inc. Regression for Volume vs Total Diagnostic Report 1 Regression for Volume vs Total Diagnostic Report 1 5 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns. 2 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns. 25 1 Residual Residual -25-5 3 4 5 6 7 8 9 1 Fitted Value Examples of patterns that may indicate problems with the fit of the model: Regression for Volume vs Total Unequal variation Strong curvature Summary Report Uneven variability, such as when the spread of Y: Volume X: Total points increases as the fitted values increase. If the unequal variation is severe, get help to address the problem. Is there a relationship between Y and X?,5,1 >,5 Clusters Yes Groups of points that suggest there may be No P =,1 important X variables that were not included in the The relationship regression between model. Volume Get help and to Total address is the problem. statistically significant (p <,5). Volume Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Linear Model Y = 2932518 + 1766 X Large residuals 125 Points that are not well fit by the model. Try to understand why the points are unusual. Correct 1 measurement or data entry errors and consider removing data that have special causes. 75-1 1 15 2 points increases as the fitted values increase. If the unequal variation is severe, get help to address the problem. Is there a relationship between Y and X?,5,1 >,5 Clusters Yes Groups of points that suggest there may be No P =, important X variables that were not included in the The relationship regression between model. Volume Get help and to Total address is the problem. statistically significant (p <,5). 25 3 Fitted Value 35 Examples of patterns that may indicate problems with the fit of the model: Regression for Volume vs Total Unequal variation Strong curvature Uneven variability, such as when the spread Summary of Report Y: Volume X: Total Volume 6 Large residuals 4 4 45 Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Linear Model Y = 126474 + 1922 X Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. % of variation accounted for by model 5 % of variation accounted for by model 2 % 1% 15 3 Total 45 6 % 1% 4 8 Total 12 16 R-sq (adj) = 31,21% 31,21% of the variation in Volume can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = 2932518 + 1766 X If the model fits the data well, this equation can be used to predict Volume for a value of Total, or find the settings for Total that correspond to a desired value or range of values for Volume. R-sq (adj) = 59,3% 59,3% of the variation in Volume can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = 126474 + 1922 X If the model fits the data well, this equation can be used to predict Volume for a value of Total, or find the settings for Total that correspond to a desired value or range of values for Volume.,58 The positive correlation (r =,58) indicates that when Total increases, Volume also tends to increase. A statistically significant relationship does not imply that X causes Y.,78 The positive correlation (r =,78) indicates that when Total increases, Volume also tends to increase. A statistically significant relationship does not imply that X causes Y. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Conclusions Multi-class sentiment classification model built with Conditional Random Fields, achieving a good performance, especially for the complex financial domain: 81.67% accuracy for Microsoft Inc. 8.8% accuracy for Google Inc. Interesting patterns and adherences revealed between the company-related Twitter stream sentiments and stock values: for the accumulated net positive versus the stock s closing price: 97.56% explanatory capacity for Google Inc. 9.72% explanatory capacity for Microsoft Inc. The visible correlations of the companyrelated sentiments to the stock values prove also the quality of the built classification models for the companies in the experiment. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Thank you! Master of Science in Computer Engineering Graduation Thesis Student: Ekaterina Shabunina Supervisor: Prof. Marco Brambilla Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Appendix Conditional Random Fields a framework for building probabilistic models to segment and label sequence data. DEFINITION: If X is a random variable over data sequence to be labeled, Y is a random variable over corresponding label sequences. Let G = (V, E) be a graph such that Y = (Yv)v V, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(yv X,Yw,w= v) = p(yv X,Yw,w v), where w v means that w and v are neighbors in G. The joint distribution over the label sequence Y given X has the form: where x is the data sequence, y is the label sequence, v is the vertex from vertex set V, e is the edge set E over V, fk Boolean vertex feature, gk Boolean edge feature, k number of features, λk and µk are parameters to be estimated, y e is the set of components of y defined by edge e, y v is the set of components of y defined by vertex v. Let Y = start and Yn+1 = stop special start and stop states. For each position i in the observation sequence x, defined the Y Y matrix random variable Mi(x) = [Mi(y,y x)] by: where ei is the edge with labels (Yi 1,Yi) and vi is the vertex with label Yi. CRFs use the observation-dependent normalization factor over all state sequences Z(x) for conditional distributions, it is the (start, stop) entry of the product of these matrixes: Then the conditional probability of a label sequence y is written as: where y = start and yn+1 = stop. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Appendix Regression Analysis Is a statistical process for estimating the relationships among variables. A study which seeks to provide an equation that relates two (or more) variables, in the following form: where x1, x2., xk are called factors (or independent variables) and is called error. To understand whether the regression is or not significant, the ANOVE (analysis of variance) methodology is applied to the linear regression: Starting from the set of assumptions: The total variance: And the residual variance: And finally the regression model variance: From these definitions, it becomes possible, to calculate the critical F-value (based on the F-Snedecor distribution) as: Which should be compared to the critical F value where α is the chance of misinterpretation (1minus the desired confidence level). If, should be rejected and therefore it is implied that the linear regression is statistically significant. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213