A CRF-based approach to find stock price correlation with company-related Twitter sentiment
|
|
- Brian Robinson
- 8 years ago
- Views:
Transcription
1 POLITECNICO DI MILANO Scuola di Ingegneria dell Informazione POLO TERRITORIALE DI COMO Master of Science in Computer Engineering A CRF-based approach to find stock price correlation with company-related Twitter sentiment Master Graduation Thesis by: Ekaterina Shabunina Supervisor: Prof. Marco Brambilla Academic Year 212/13
2 This is an example of how powerful can a Twitter post be: On 23 of April 213 at 1:7 pm, the hacked Twitter account of Associated Press posted a false tweet saying: Two explosions in the White House and Barack Obama is injured causing a flash crash on the stock market as auto-trading computer systems on autopilot sold $134 billion dollars worth of stocks. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
3 Twitter Background 554,75, registered users, out of which 288 million monthly active, with on average 5 million tweets posted a day with an estimated rate of 9,1 tweets per second; Users have public by default profiles; Users from all over the world with different age, nationality, household income, professions, and hobbies distributions. Cashtags clickable ticker symbols with a dollar sign prefix (for example, $goog), which takes a user to the search results about company s finance and stock. Sentiment Analysis (multi-class) Determining the attitude within a tweet with respect to the company in experiment (in this thesis context). Hard over only 14 symbols long Twitter micro-blogs Even harder over special financial domain, which employs a very specific set of jargons and slangs, with particular abbreviations and symbols and in which many words imply different meanings and associate with distinct emotions. Stock Markets Two categories of analysis performed by the players in financial stock markets in order to determine whether to buy or sell a given security: Technical (an attempt of applying mathematical models) Fundamentalist (based on the study of the value of a company based on its capacity of generating cash in the future) Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
4 Data Pre-Processing Crawling Twitter with Twitter Search API Filtering Data Processing Manually labeling POS tagging Training the model Templates CRF++ tool Twitter data labeling Regression analysis Tools & Methods Stock market data Minitab 16 tool Architectural overview of the proposed approach. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
5 Results Classifier s Performance Templates description: Simple current word and it s POS tag; Previous previous and current words and their POS tags; Prev+Next previous, current and next words and their POS tags; Prevprev+Nextnext - two previous words, current and two next words and their POS tags; Word_combinations - includes Prevprev+Nextnext template features and the combinations: word before previous word / previous words, previous / current, current / next and next / next after next words Classifier s accuracy, average over 1-folds, for Microsoft Inc. and Google Inc. with various templates. For both companies, Microsoft Inc. and Google Inc., the best performance was obtained using Word_combinations template, which was chosen to be used for the labeling of the next one month long time period, from 25 th of April until 24 th of May, 213, to produce the Twitter sentiments daily volume, necessary for the next task of finding correlation of it with the stock values for these companies. Training parameters effect on the classifier s performance, on Google Inc. dataset Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
6 Results Classifier s Performance Performance measures of the resulting classification models for each company, selected out of the 1-folds. Classification models performance for Microsoft Inc. Classification models performance for Google Inc. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
7 Number of Tweets Closing Price (USD) Number of Tweets Closing Price (USD) Results Adherences The initial results are summarized in the charts below for Microsoft Inc. and Google Inc /4 24/4 1/5 8/5 15/5 22/ /4 24/4 1/5 8/5 15/5 22/ Total Positive Negative Neutral Closing Price Total Positive Negative Neutral Closing Price Sentiments and Closing price, Microsoft Inc. Sentiments and Closing price, Google Inc. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
8 Number of Tweets Closing Price Variation Number of Tweets Closing Price Variation Results Adherences In these two charts are plotted the value of the variation of the stock price (compared to the closing price of the previous day), the number of positive-classified tweets, the number of negative classified tweets and the net number of positive tweets, i.e. the total number of positive-classified tweets subtracted by the number of the negative-classified tweets. 3 5% 75 5% 15 17/4 24/4 1/5 8/5 15/5 22/5-15 4% 3% 2% 1% % -1% -2% /4 24/4 1/5 8/5 15/5 22/5-15 4% 3% 2% 1% % -1% -2% -3% Net Positive Positive Negative Price Change (%) Net Positive Positive Negative Price Change (%) Sentiments, Net Positive and Price change, Microsoft Inc. Sentiments, Net Positive and Price change, Google Inc. A visible adherence of the net positive value and the stock daily performance, in Google Inc. case, possibly due to a broader sample size. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
9 Number of Tweets Closing Price (USD) Number of Tweets Closing Price Variation Results Adherences Charts comparing the accumulated value of the net positive tweets and the stock closing price, for every studied day, to cope with the inertial effect presented by the stock markets /4 24/4 1/5 8/5 15/5 22/ /4 24/4 1/5 8/5 15/5 22/ Net Positive Accumulated Pos. Accumulated Net Positive Accumulated Pos. Accumulated Neg. Accumulated Closing Price Neg. Accumulated Closing Price Accumulated Net Positive and the Closing price, Microsoft Inc. Accumulated Net Positive and the Closing price, Google Inc. In both cases, the plots follow similar patterns, even if the closing price presents a higher volatility, denoting a good adherence of the classification model to the observed performance of the stock price. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
10 Number of Tweets Traded Volume Number of Tweets Traded Volume Results Adherences The comparison of the observed volume and the total number of tweets for each trading day: 7 16,, 18 7,, ,, 12,, 1,, 8,, 6,, 4,, 2,, ,, 5,, 4,, 3,, 2,, 1,, 17/4 24/4 1/5 8/5 15/5 22/5 17/4 24/4 1/5 8/5 15/5 22/5 Total Volume Total Volume Total number of tweets and traded volume, Microsoft Inc. Total number of tweets and traded volume, Google Inc. This comparison was the one with the closest fit among all presented for the both companies experiments. It validates the expectation that notable facts that may impact the number of trades in a day also impact in a similar manner the volume of mentions of that company across the social networks. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
11 Results Regression Analysis Net positive tweets as independent explanatory variable for the daily change, in percent, of the stock closing price: Microsoft Inc. Google Inc. Regression for Price Change (%) vs Net Positive Diagnostic Report 1 Regression for Price Change (%) vs Net Positive Diagnostic Report 1,2 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns.,3 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns.,1,15 Residual, Residual, -,1 -,15 -,2 -,1,,1 Fitted Value,2,3 -,3 -,1,,1 Fitted Value,2,3 Examples of patterns that may indicate problems with the fit of the model: Regression for Price Change (%) vs Net Positive Unequal variation Summary Report Strong curvature Y: Price Change Uneven (%) variability, such as when the spread of Curve in the data that is not well explained by the X: Net Positive points increases as the fitted values increase. If the regression model. If you are already using the best unequal variation is severe, get help to address the fitting Fitted model, Line get help Plot to for address Cubic the Model problem. problem. Y =,191 -,99 X +,14 X**2 -, X**3 Is there a relationship between Y and X? Clusters,5,1 >,5 Groups of points that suggest there may be Yes important X variables that were not included in the No regression model. Get help to address the problem. P =,64 The relationship between Price Change (%) and Net Positive is not statistically significant (p >,5). Price Change (%) 4,% Large residuals 2,%,% Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. Examples of patterns that may indicate problems with the fit of the model: Regression for Price Change (%) vs Net Positive Unequal variation Strong curvature Summary Report Uneven variability, such as when the spread of Y: Price Change (%) points increases as the fitted values increase. If the X: Net Positive unequal variation is severe, get help to address the problem. Is there a relationship between Y and X?,5,1 >,5 Clusters Yes No Groups of points that suggest there may be P =,1 important X variables that were not included in the The relationship regression between model. Price Get Change help to (%) address and Net the problem. Positive is statistically significant (p <,5). Price Change (%) Large 4,% residuals 2,%,% Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Linear Model Y = -,4668 +,81 X Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes Net Positive 2 % % of variation accounted for by model 1% -2,% Net Positive 4 % of variation accounted for by model % R-sq (adj) = 15,83% 15,83% of the variation in Price Change (%) can be accounted for by the regression model. 1% Comments The fitted equation for the cubic model that describes the relationship between Y and X is: Y =,191 -,99 X +,14 X**2 -, X**3 If the model fits the data well, this equation can be used to predict Price Change (%) for a value of Net Positive, or find the settings for Net Positive that correspond to a desired value or range of values for Price Change (%). R-sq (adj) = 32,37% 32,37% of the variation in Price Change (%) can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = -,4668 +,81 X If the model fits the data well, this equation can be used to predict Price Change (%) for a value of Net Positive, or find the settings for Net Positive that correspond to a desired value or range of values for Price Change (%). A statistically significant relationship does not imply that X causes Y.,59 The positive correlation (r =,59) indicates that when Net Positive increases, Price Change (%) also tends to increase. A statistically significant relationship does not imply that X causes Y. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
12 Results Regression Analysis Accumulated net positive value versus the closing price behavior: Microsoft Inc. Google Inc. Regression for Closing Price vs Net Positive Accumulated Diagnostic Report 1 Regression for Closing Price vs Net Positive Accumulated Diagnostic Report 1 1, Residuals vs Fitted Values Look for large residuals (marked in red) and patterns. 1 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns.,5 5 Residual, -,5 Residual -5-1, Fitted Value Fitted Value Examples of patterns that may indicate problems with the fit of the model: Regression for Closing Price vs Net Positive Accumulated Unequal variation Strong curvature Summary Report Uneven variability, such as when the spread of Y: Closing Price points increases as the fitted values increase. If the X: Net Positive Accumulated unequal variation is severe, get help to address the problem. Clusters Is there a relationship between Y and X?,5,1 Groups of points that suggest there may be>,5 important X variables that were not included in the Yes No regression model. Get help to address the problem. P =, The relationship between Closing Price and Net Positive Accumulated is statistically significant (p <,5). Closing Price 36 Large residuals Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Quadratic Model Y = 28,81 +,5878 X -,1 X**2 Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. Examples of patterns that may indicate problems with the fit of the model: Regression for Closing Price vs Net Positive Accumulated Unequal variation Strong curvature Summary Report Uneven variability, such as when the spread of Y: Closing Price points increases as the fitted values increase. If the X: Net Positive Accumulated unequal variation is severe, get help to address the problem. Clusters Is there a relationship between Y and X?,5,1 Groups of points that suggest there may be>,5 important X variables that were not included in the Yes No regression model. Get help to address the problem. P =, The relationship between Closing Price and Net Positive Accumulated is statistically significant (p <,5). Closing Price Large 9 residuals 85 8 Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Cubic Model Y = 77,7 +,6772 X +,25 X**2 -, X**3 Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes Net Positive Accumulated Net Positive Accumulated 3 Comments Comments % of variation accounted for by model % 1% R-sq (adj) = 9,72% 9,72% of the variation in Closing Price can be accounted for by the regression model. The fitted equation for the quadratic model that describes the relationship between Y and X is: Y = 28,81 +,5878 X -,1 X**2 If the model fits the data well, this equation can be used to predict Closing Price for a value of Net Positive Accumulated, or find the settings for Net Positive Accumulated that correspond to a desired value or range of values for Closing Price. A statistically significant relationship does not imply that X causes Y. % of variation accounted for by model % 1% R-sq (adj) = 97,56% 97,56% of the variation in Closing Price can be accounted for by the regression model. The fitted equation for the cubic model that describes the relationship between Y and X is: Y = 77,7 +,6772 X +,25 X**2 -, X**3 If the model fits the data well, this equation can be used to predict Closing Price for a value of Net Positive Accumulated, or find the settings for Net Positive Accumulated that correspond to a desired value or range of values for Closing Price. A statistically significant relationship does not imply that X causes Y. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
13 Results Regression Analysis Traded volume change according to the total number of tweets for a given day: Microsoft Inc. Google Inc. Regression for Volume vs Total Diagnostic Report 1 Regression for Volume vs Total Diagnostic Report 1 5 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns. 2 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns Residual Residual Fitted Value Examples of patterns that may indicate problems with the fit of the model: Regression for Volume vs Total Unequal variation Strong curvature Summary Report Uneven variability, such as when the spread of Y: Volume X: Total points increases as the fitted values increase. If the unequal variation is severe, get help to address the problem. Is there a relationship between Y and X?,5,1 >,5 Clusters Yes Groups of points that suggest there may be No P =,1 important X variables that were not included in the The relationship regression between model. Volume Get help and to Total address is the problem. statistically significant (p <,5). Volume Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Linear Model Y = X Large residuals 125 Points that are not well fit by the model. Try to understand why the points are unusual. Correct 1 measurement or data entry errors and consider removing data that have special causes points increases as the fitted values increase. If the unequal variation is severe, get help to address the problem. Is there a relationship between Y and X?,5,1 >,5 Clusters Yes Groups of points that suggest there may be No P =, important X variables that were not included in the The relationship regression between model. Volume Get help and to Total address is the problem. statistically significant (p <,5) Fitted Value 35 Examples of patterns that may indicate problems with the fit of the model: Regression for Volume vs Total Unequal variation Strong curvature Uneven variability, such as when the spread Summary of Report Y: Volume X: Total Volume 6 Large residuals Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Linear Model Y = X Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. % of variation accounted for by model 5 % of variation accounted for by model 2 % 1% 15 3 Total 45 6 % 1% 4 8 Total R-sq (adj) = 31,21% 31,21% of the variation in Volume can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = X If the model fits the data well, this equation can be used to predict Volume for a value of Total, or find the settings for Total that correspond to a desired value or range of values for Volume. R-sq (adj) = 59,3% 59,3% of the variation in Volume can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = X If the model fits the data well, this equation can be used to predict Volume for a value of Total, or find the settings for Total that correspond to a desired value or range of values for Volume.,58 The positive correlation (r =,58) indicates that when Total increases, Volume also tends to increase. A statistically significant relationship does not imply that X causes Y.,78 The positive correlation (r =,78) indicates that when Total increases, Volume also tends to increase. A statistically significant relationship does not imply that X causes Y. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
14 Conclusions Multi-class sentiment classification model built with Conditional Random Fields, achieving a good performance, especially for the complex financial domain: 81.67% accuracy for Microsoft Inc. 8.8% accuracy for Google Inc. Interesting patterns and adherences revealed between the company-related Twitter stream sentiments and stock values: for the accumulated net positive versus the stock s closing price: 97.56% explanatory capacity for Google Inc. 9.72% explanatory capacity for Microsoft Inc. The visible correlations of the companyrelated sentiments to the stock values prove also the quality of the built classification models for the companies in the experiment. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
15 Thank you! Master of Science in Computer Engineering Graduation Thesis Student: Ekaterina Shabunina Supervisor: Prof. Marco Brambilla Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
16 Appendix Conditional Random Fields a framework for building probabilistic models to segment and label sequence data. DEFINITION: If X is a random variable over data sequence to be labeled, Y is a random variable over corresponding label sequences. Let G = (V, E) be a graph such that Y = (Yv)v V, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(yv X,Yw,w= v) = p(yv X,Yw,w v), where w v means that w and v are neighbors in G. The joint distribution over the label sequence Y given X has the form: where x is the data sequence, y is the label sequence, v is the vertex from vertex set V, e is the edge set E over V, fk Boolean vertex feature, gk Boolean edge feature, k number of features, λk and µk are parameters to be estimated, y e is the set of components of y defined by edge e, y v is the set of components of y defined by vertex v. Let Y = start and Yn+1 = stop special start and stop states. For each position i in the observation sequence x, defined the Y Y matrix random variable Mi(x) = [Mi(y,y x)] by: where ei is the edge with labels (Yi 1,Yi) and vi is the vertex with label Yi. CRFs use the observation-dependent normalization factor over all state sequences Z(x) for conditional distributions, it is the (start, stop) entry of the product of these matrixes: Then the conditional probability of a label sequence y is written as: where y = start and yn+1 = stop. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
17 Appendix Regression Analysis Is a statistical process for estimating the relationships among variables. A study which seeks to provide an equation that relates two (or more) variables, in the following form: where x1, x2., xk are called factors (or independent variables) and is called error. To understand whether the regression is or not significant, the ANOVE (analysis of variance) methodology is applied to the linear regression: Starting from the set of assumptions: The total variance: And the residual variance: And finally the regression model variance: From these definitions, it becomes possible, to calculate the critical F-value (based on the F-Snedecor distribution) as: Which should be compared to the critical F value where α is the chance of misinterpretation (1minus the desired confidence level). If, should be rejected and therefore it is implied that the linear regression is statistically significant. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213
Correlation between Stock Prices and polarity of companies performance in Tweets: a CRF-based Approach
Correlation between Stock Prices and polarity of companies performance in Tweets: a CRF-based Approach Ekaterina Shabunina Università degli Studi di Milano-Bicocca Dipartimento di Informatica Sistemistica
More informationCRF to find stock price correlation with company-related Twitter sentiment
POLITECNICO DI MILANO Scuola di Ingegneria dell Informazione POLO TERRITORIALE DI COMO Master of Science in Computer Engineering CRF to find stock price correlation with company-related Twitter sentiment
More information2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
More informationUnivariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
More information1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ
STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationJetBlue Airways Stock Price Analysis and Prediction
JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue
More informationRegression Analysis: A Complete Example
Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty
More informationForecasting stock markets with Twitter
Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationSentiment analysis using emoticons
Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was
More informationOutline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares
Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation
More informationUsing Twitter as a source of information for stock market prediction
Using Twitter as a source of information for stock market prediction Ramon Xuriguera (rxuriguera@lsi.upc.edu) Joint work with Marta Arias and Argimiro Arratia ERCIM 2011, 17-19 Dec. 2011, University of
More informationHedging Illiquid FX Options: An Empirical Analysis of Alternative Hedging Strategies
Hedging Illiquid FX Options: An Empirical Analysis of Alternative Hedging Strategies Drazen Pesjak Supervised by A.A. Tsvetkov 1, D. Posthuma 2 and S.A. Borovkova 3 MSc. Thesis Finance HONOURS TRACK Quantitative
More informationSentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015
Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015
More informationA Review of Cross Sectional Regression for Financial Data You should already know this material from previous study
A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study But I will offer a review, with a focus on issues which arise in finance 1 TYPES OF FINANCIAL
More informationFour of the precomputed option rankings are based on implied volatility. Two are based on statistical (historical) volatility :
Chapter 8 - Precomputed Rankings Precomputed Rankings Help Help Guide Click PDF to get a PDF printable version of this help file. Each evening, once the end-of-day option data are available online, the
More informationThe Volatility Index Stefan Iacono University System of Maryland Foundation
1 The Volatility Index Stefan Iacono University System of Maryland Foundation 28 May, 2014 Mr. Joe Rinaldi 2 The Volatility Index Introduction The CBOE s VIX, often called the market fear gauge, measures
More informationTraffic Prediction and Analysis using a Big Data and Visualisation Approach
Traffic Prediction and Analysis using a Big Data and Visualisation Approach Declan McHugh 1 1 Department of Computer Science, Institute of Technology Blanchardstown March 10, 2015 Summary This abstract
More informationCHAPTER 6. Topics in Chapter. What are investment returns? Risk, Return, and the Capital Asset Pricing Model
CHAPTER 6 Risk, Return, and the Capital Asset Pricing Model 1 Topics in Chapter Basic return concepts Basic risk concepts Stand-alone risk Portfolio (market) risk Risk and return: CAPM/SML 2 What are investment
More informationScatter Plot, Correlation, and Regression on the TI-83/84
Scatter Plot, Correlation, and Regression on the TI-83/84 Summary: When you have a set of (x,y) data points and want to find the best equation to describe them, you are performing a regression. This page
More informationCHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression
Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the
More informationPredict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationUsing Twitter to Analyze Stock Market and Assist Stock and Options Trading
University of Connecticut DigitalCommons@UConn Doctoral Dissertations University of Connecticut Graduate School 12-17-2015 Using Twitter to Analyze Stock Market and Assist Stock and Options Trading Yuexin
More informationHow To Check For Differences In The One Way Anova
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
More informationPearson's Correlation Tests
Chapter 800 Pearson's Correlation Tests Introduction The correlation coefficient, ρ (rho), is a popular statistic for describing the strength of the relationship between two variables. The correlation
More information1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
More informationSentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationc 2015, Jeffrey S. Simonoff 1
Modeling Lowe s sales Forecasting sales is obviously of crucial importance to businesses. Revenue streams are random, of course, but in some industries general economic factors would be expected to have
More informationSimple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
More informationGetting Correct Results from PROC REG
Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking
More informationExample: Boats and Manatees
Figure 9-6 Example: Boats and Manatees Slide 1 Given the sample data in Table 9-1, find the value of the linear correlation coefficient r, then refer to Table A-6 to determine whether there is a significant
More information203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
More informationData Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
More informationCORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA
We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical
More informationExercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
More informationChapter 13 Introduction to Linear Regression and Correlation Analysis
Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing
More informationStock Market Forecasting Using Machine Learning Algorithms
Stock Market Forecasting Using Machine Learning Algorithms Shunrong Shen, Haomiao Jiang Department of Electrical Engineering Stanford University {conank,hjiang36}@stanford.edu Tongda Zhang Department of
More informationA Primer on Forecasting Business Performance
A Primer on Forecasting Business Performance There are two common approaches to forecasting: qualitative and quantitative. Qualitative forecasting methods are important when historical data is not available.
More informationUnit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a
More informationHomework 8 Solutions
Math 17, Section 2 Spring 2011 Homework 8 Solutions Assignment Chapter 7: 7.36, 7.40 Chapter 8: 8.14, 8.16, 8.28, 8.36 (a-d), 8.38, 8.62 Chapter 9: 9.4, 9.14 Chapter 7 7.36] a) A scatterplot is given below.
More informationTwitter sentiment vs. Stock price!
Twitter sentiment vs. Stock price! Background! On April 24 th 2013, the Twitter account belonging to Associated Press was hacked. Fake posts about the Whitehouse being bombed and the President being injured
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationAnswer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade
Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements
More informationDetecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo
Detecting Email Spam MGS 8040, Data Mining Audrey Gies Matt Labbe Tatiana Restrepo 5 December 2011 INTRODUCTION This report describes a model that may be used to improve likelihood of recognizing undesirable
More informationThe Viability of StockTwits and Google Trends to Predict the Stock Market. By Chris Loughlin and Erik Harnisch
The Viability of StockTwits and Google Trends to Predict the Stock Market By Chris Loughlin and Erik Harnisch Spring 2013 Introduction Investors are always looking to gain an edge on the rest of the market.
More informationIntroduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
More informationMath 1314 Lesson 8 Business Applications: Break Even Analysis, Equilibrium Quantity/Price
Math 1314 Lesson 8 Business Applications: Break Even Analysis, Equilibrium Quantity/Price Three functions of importance in business are cost functions, revenue functions and profit functions. Cost functions
More informationAnalysis of Variance. MINITAB User s Guide 2 3-1
3 Analysis of Variance Analysis of Variance Overview, 3-2 One-Way Analysis of Variance, 3-5 Two-Way Analysis of Variance, 3-11 Analysis of Means, 3-13 Overview of Balanced ANOVA and GLM, 3-18 Balanced
More informationA Quantitative Approach to Commercial Damages. Applying Statistics to the Measurement of Lost Profits + Website
Brochure More information from http://www.researchandmarkets.com/reports/2212877/ A Quantitative Approach to Commercial Damages. Applying Statistics to the Measurement of Lost Profits + Website Description:
More informationE6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationFinal Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
More informationSolution Let us regress percentage of games versus total payroll.
Assignment 3, MATH 2560, Due November 16th Question 1: all graphs and calculations have to be done using the computer The following table gives the 1999 payroll (rounded to the nearest million dolars)
More informationAP Physics 1 and 2 Lab Investigations
AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks
More informationSimple Regression Theory II 2010 Samuel L. Baker
SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the
More informationOUTLIER ANALYSIS. Data Mining 1
OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,
More informationMBA 611 STATISTICS AND QUANTITATIVE METHODS
MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain
More informationSocial Market Analytics, Inc.
S-Factors : Definition, Use, and Significance Social Market Analytics, Inc. Harness the Power of Social Media Intelligence January 2014 P a g e 2 Introduction Social Market Analytics, Inc., (SMA) produces
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationPoint Biserial Correlation Tests
Chapter 807 Point Biserial Correlation Tests Introduction The point biserial correlation coefficient (ρ in this chapter) is the product-moment correlation calculated between a continuous random variable
More information2013 MBA Jump Start Program. Statistics Module Part 3
2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just
More information17. SIMPLE LINEAR REGRESSION II
17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.
More informationFINANCIAL ENGINEERING CLUB TRADING 201
FINANCIAL ENGINEERING CLUB TRADING 201 STOCK PRICING It s all about volatility Volatility is the measure of how much a stock moves The implied volatility (IV) of a stock represents a 1 standard deviation
More informationChapter 4 and 5 solutions
Chapter 4 and 5 solutions 4.4. Three different washing solutions are being compared to study their effectiveness in retarding bacteria growth in five gallon milk containers. The analysis is done in a laboratory,
More informationPart II Management Accounting Decision-Making Tools
Part II Management Accounting Decision-Making Tools Chapter 7 Chapter 8 Chapter 9 Cost-Volume-Profit Analysis Comprehensive Business Budgeting Incremental Analysis and Decision-making Costs Chapter 10
More informationGamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
More informationMASCOT Search Results Interpretation
The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually
More informationBusiness Valuation Review
Business Valuation Review Regression Analysis in Valuation Engagements By: George B. Hawkins, ASA, CFA Introduction Business valuation is as much as art as it is science. Sage advice, however, quantitative
More informationHOW TO USE MINITAB: DESIGN OF EXPERIMENTS. Noelle M. Richard 08/27/14
HOW TO USE MINITAB: DESIGN OF EXPERIMENTS 1 Noelle M. Richard 08/27/14 CONTENTS 1. Terminology 2. Factorial Designs When to Use? (preliminary experiments) Full Factorial Design General Full Factorial Design
More informationSentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies
Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts
More informationPremaster Statistics Tutorial 4 Full solutions
Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for
More informationIntroduction. example of a AA curve appears at the end of this presentation.
1 Introduction The High Quality Market (HQM) Corporate Bond Yield Curve for the Pension Protection Act (PPA) uses a methodology developed at Treasury to construct yield curves from extended regressions
More informationData Mining and Visualization
Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research
More information1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number
1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression
More informationData Mining on Social Networks. Dionysios Sotiropoulos Ph.D.
Data Mining on Social Networks Dionysios Sotiropoulos Ph.D. 1 Contents What are Social Media? Mathematical Representation of Social Networks Fundamental Data Mining Concepts Data Mining Tasks on Digital
More information430 Statistics and Financial Mathematics for Business
Prescription: 430 Statistics and Financial Mathematics for Business Elective prescription Level 4 Credit 20 Version 2 Aim Students will be able to summarise, analyse, interpret and present data, make predictions
More informationTeaching Business Statistics through Problem Solving
Teaching Business Statistics through Problem Solving David M. Levine, Baruch College, CUNY with David F. Stephan, Two Bridges Instructional Technology CONTACT: davidlevine@davidlevinestatistics.com Typical
More informationKeywords social media, internet, data, sentiment analysis, opinion mining, business
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real time Extraction
More informationPolynomial Neural Network Discovery Client User Guide
Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3
More informationSouth Carolina College- and Career-Ready (SCCCR) Probability and Statistics
South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)
More informationJava Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
More informationGetting Started with Minitab 17
2014 by Minitab Inc. All rights reserved. Minitab, Quality. Analysis. Results. and the Minitab logo are registered trademarks of Minitab, Inc., in the United States and other countries. Additional trademarks
More informationAn Introduction to Data Mining
An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
More informationCS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis
CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis Team members: Daniel Debbini, Philippe Estin, Maxime Goutagny Supervisor: Mihai Surdeanu (with John Bauer) 1 Introduction
More informationUsing JMP Version 4 for Time Series Analysis Bill Gjertsen, SAS, Cary, NC
Using JMP Version 4 for Time Series Analysis Bill Gjertsen, SAS, Cary, NC Abstract Three examples of time series will be illustrated. One is the classical airline passenger demand data with definite seasonal
More informationThis unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.
Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course
More informationMULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.
Final Exam Review MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. 1) A researcher for an airline interviews all of the passengers on five randomly
More informationTutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
More information4. Multiple Regression in Practice
30 Multiple Regression in Practice 4. Multiple Regression in Practice The preceding chapters have helped define the broad principles on which regression analysis is based. What features one should look
More informationSection 14 Simple Linear Regression: Introduction to Least Squares Regression
Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship
More informationData Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition
Brochure More information from http://www.researchandmarkets.com/reports/2170926/ Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd
More informationHow to Win the Stock Market Game
How to Win the Stock Market Game 1 Developing Short-Term Stock Trading Strategies by Vladimir Daragan PART 1 Table of Contents 1. Introduction 2. Comparison of trading strategies 3. Return per trade 4.
More informationFacebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
More informationA Martingale System Theorem for Stock Investments
A Martingale System Theorem for Stock Investments Robert J. Vanderbei April 26, 1999 DIMACS New Market Models Workshop 1 Beginning Middle End Controversial Remarks Outline DIMACS New Market Models Workshop
More informationScatter Plots with Error Bars
Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each
More informationCONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
More information