A CRF-based approach to find stock price correlation with company-related Twitter sentiment



Similar documents
CRF to find stock price correlation with company-related Twitter sentiment

2. Simple Linear Regression

Univariate Regression

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

JetBlue Airways Stock Price Analysis and Prediction

Regression Analysis: A Complete Example

Forecasting stock markets with Twitter

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Sentiment analysis using emoticons

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Using Twitter as a source of information for stock market prediction

Hedging Illiquid FX Options: An Empirical Analysis of Alternative Hedging Strategies

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study

Four of the precomputed option rankings are based on implied volatility. Two are based on statistical (historical) volatility :

The Volatility Index Stefan Iacono University System of Maryland Foundation

Traffic Prediction and Analysis using a Big Data and Visualisation Approach

CHAPTER 6. Topics in Chapter. What are investment returns? Risk, Return, and the Capital Asset Pricing Model

Scatter Plot, Correlation, and Regression on the TI-83/84

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Predict the Popularity of YouTube Videos Using Early View Data

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Using Twitter to Analyze Stock Market and Assist Stock and Options Trading

How To Check For Differences In The One Way Anova

Pearson's Correlation Tests

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Sentiment analysis on tweets in a financial domain

Azure Machine Learning, SQL Data Mining and R

c 2015, Jeffrey S. Simonoff 1

Simple Predictive Analytics Curtis Seare

Getting Correct Results from PROC REG

Example: Boats and Manatees

: Introduction to Machine Learning Dr. Rita Osadchy

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

Exercise 1.12 (Pg )

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Stock Market Forecasting Using Machine Learning Algorithms

A Primer on Forecasting Business Performance

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Homework 8 Solutions

Twitter sentiment vs. Stock price!

Supervised Learning (Big Data Analytics)

Social Media Mining. Data Mining Essentials

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Detecting Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo

The Viability of StockTwits and Google Trends to Predict the Stock Market. By Chris Loughlin and Erik Harnisch

Introduction to Regression and Data Analysis

Math 1314 Lesson 8 Business Applications: Break Even Analysis, Equilibrium Quantity/Price

Analysis of Variance. MINITAB User s Guide 2 3-1

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

IBM SPSS Direct Marketing 23

Final Project Report

Solution Let us regress percentage of games versus total payroll.

AP Physics 1 and 2 Lab Investigations

Simple Regression Theory II 2010 Samuel L. Baker

OUTLIER ANALYSIS. Data Mining 1

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Social Market Analytics, Inc.

IBM SPSS Direct Marketing 22

Point Biserial Correlation Tests

2013 MBA Jump Start Program. Statistics Module Part 3

17. SIMPLE LINEAR REGRESSION II

FINANCIAL ENGINEERING CLUB TRADING 201

Chapter 4 and 5 solutions

Part II Management Accounting Decision-Making Tools

Gamma Distribution Fitting

MASCOT Search Results Interpretation

Business Valuation Review

HOW TO USE MINITAB: DESIGN OF EXPERIMENTS. Noelle M. Richard 08/27/14

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Premaster Statistics Tutorial 4 Full solutions

Introduction. example of a AA curve appears at the end of this presentation.

Data Mining and Visualization

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

430 Statistics and Financial Mathematics for Business

Teaching Business Statistics through Problem Solving

Keywords social media, internet, data, sentiment analysis, opinion mining, business

Polynomial Neural Network Discovery Client User Guide

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

Java Modules for Time Series Analysis

Getting Started with Minitab 17

An Introduction to Data Mining

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

Using JMP Version 4 for Time Series Analysis Bill Gjertsen, SAS, Cary, NC

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Tutorial for proteome data analysis using the Perseus software platform

4. Multiple Regression in Practice

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

How to Win the Stock Market Game

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

A Martingale System Theorem for Stock Investments

Scatter Plots with Error Bars

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Transcription:

POLITECNICO DI MILANO Scuola di Ingegneria dell Informazione POLO TERRITORIALE DI COMO Master of Science in Computer Engineering A CRF-based approach to find stock price correlation with company-related Twitter sentiment Master Graduation Thesis by: Ekaterina Shabunina Supervisor: Prof. Marco Brambilla Academic Year 212/13

This is an example of how powerful can a Twitter post be: On 23 of April 213 at 1:7 pm, the hacked Twitter account of Associated Press posted a false tweet saying: Two explosions in the White House and Barack Obama is injured causing a flash crash on the stock market as auto-trading computer systems on autopilot sold $134 billion dollars worth of stocks. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Twitter Background 554,75, registered users, out of which 288 million monthly active, with on average 5 million tweets posted a day with an estimated rate of 9,1 tweets per second; Users have public by default profiles; Users from all over the world with different age, nationality, household income, professions, and hobbies distributions. Cashtags clickable ticker symbols with a dollar sign prefix (for example, $goog), which takes a user to the search results about company s finance and stock. Sentiment Analysis (multi-class) Determining the attitude within a tweet with respect to the company in experiment (in this thesis context). Hard over only 14 symbols long Twitter micro-blogs Even harder over special financial domain, which employs a very specific set of jargons and slangs, with particular abbreviations and symbols and in which many words imply different meanings and associate with distinct emotions. Stock Markets Two categories of analysis performed by the players in financial stock markets in order to determine whether to buy or sell a given security: Technical (an attempt of applying mathematical models) Fundamentalist (based on the study of the value of a company based on its capacity of generating cash in the future) Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Data Pre-Processing Crawling Twitter with Twitter Search API Filtering Data Processing Manually labeling POS tagging Training the model Templates CRF++ tool Twitter data labeling Regression analysis Tools & Methods Stock market data Minitab 16 tool Architectural overview of the proposed approach. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Results Classifier s Performance Templates description: Simple current word and it s POS tag; Previous previous and current words and their POS tags; Prev+Next previous, current and next words and their POS tags; Prevprev+Nextnext - two previous words, current and two next words and their POS tags; Word_combinations - includes Prevprev+Nextnext template features and the combinations: word before previous word / previous words, previous / current, current / next and next / next after next words Classifier s accuracy, average over 1-folds, for Microsoft Inc. and Google Inc. with various templates. For both companies, Microsoft Inc. and Google Inc., the best performance was obtained using Word_combinations template, which was chosen to be used for the labeling of the next one month long time period, from 25 th of April until 24 th of May, 213, to produce the Twitter sentiments daily volume, necessary for the next task of finding correlation of it with the stock values for these companies. Training parameters effect on the classifier s performance, on Google Inc. dataset Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Results Classifier s Performance Performance measures of the resulting classification models for each company, selected out of the 1-folds. Classification models performance for Microsoft Inc. Classification models performance for Google Inc. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Number of Tweets Closing Price (USD) Number of Tweets Closing Price (USD) Results Adherences The initial results are summarized in the charts below for Microsoft Inc. and Google Inc. 75 37 18 1 6 35 16 14 95 9 45 3 15 17/4 24/4 1/5 8/5 15/5 22/5-15 33 31 29 27 12 1 8 6 4 2 17/4 24/4 1/5 8/5 15/5 22/5 85 8 75 7 65 6 55 Total Positive Negative Neutral Closing Price Total Positive Negative Neutral Closing Price Sentiments and Closing price, Microsoft Inc. Sentiments and Closing price, Google Inc. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Number of Tweets Closing Price Variation Number of Tweets Closing Price Variation Results Adherences In these two charts are plotted the value of the variation of the stock price (compared to the closing price of the previous day), the number of positive-classified tweets, the number of negative classified tweets and the net number of positive tweets, i.e. the total number of positive-classified tweets subtracted by the number of the negative-classified tweets. 3 5% 75 5% 15 17/4 24/4 1/5 8/5 15/5 22/5-15 4% 3% 2% 1% % -1% -2% 6 45 3 15 17/4 24/4 1/5 8/5 15/5 22/5-15 4% 3% 2% 1% % -1% -2% -3% Net Positive Positive Negative Price Change (%) Net Positive Positive Negative Price Change (%) Sentiments, Net Positive and Price change, Microsoft Inc. Sentiments, Net Positive and Price change, Google Inc. A visible adherence of the net positive value and the stock daily performance, in Google Inc. case, possibly due to a broader sample size. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Number of Tweets Closing Price (USD) Number of Tweets Closing Price Variation Results Adherences Charts comparing the accumulated value of the net positive tweets and the stock closing price, for every studied day, to cope with the inertial effect presented by the stock markets. 25 2 15 1 5 17/4 24/4 1/5 8/5 15/5 22/5 36 35 34 33 32 31 3 29 28 27 5 45 4 35 3 25 2 15 1 5 17/4 24/4 1/5 8/5 15/5 22/5 93 91 89 87 85 83 81 79 77 75 Net Positive Accumulated Pos. Accumulated Net Positive Accumulated Pos. Accumulated Neg. Accumulated Closing Price Neg. Accumulated Closing Price Accumulated Net Positive and the Closing price, Microsoft Inc. Accumulated Net Positive and the Closing price, Google Inc. In both cases, the plots follow similar patterns, even if the closing price presents a higher volatility, denoting a good adherence of the classification model to the observed performance of the stock price. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Number of Tweets Traded Volume Number of Tweets Traded Volume Results Adherences The comparison of the observed volume and the total number of tweets for each trading day: 7 16,, 18 7,, 6 5 4 3 2 1 14,, 12,, 1,, 8,, 6,, 4,, 2,, 16 14 12 1 8 6 4 2 6,, 5,, 4,, 3,, 2,, 1,, 17/4 24/4 1/5 8/5 15/5 22/5 17/4 24/4 1/5 8/5 15/5 22/5 Total Volume Total Volume Total number of tweets and traded volume, Microsoft Inc. Total number of tweets and traded volume, Google Inc. This comparison was the one with the closest fit among all presented for the both companies experiments. It validates the expectation that notable facts that may impact the number of trades in a day also impact in a similar manner the volume of mentions of that company across the social networks. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Results Regression Analysis Net positive tweets as independent explanatory variable for the daily change, in percent, of the stock closing price: Microsoft Inc. Google Inc. Regression for Price Change (%) vs Net Positive Diagnostic Report 1 Regression for Price Change (%) vs Net Positive Diagnostic Report 1,2 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns.,3 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns.,1,15 Residual, Residual, -,1 -,15 -,2 -,1,,1 Fitted Value,2,3 -,3 -,1,,1 Fitted Value,2,3 Examples of patterns that may indicate problems with the fit of the model: Regression for Price Change (%) vs Net Positive Unequal variation Summary Report Strong curvature Y: Price Change Uneven (%) variability, such as when the spread of Curve in the data that is not well explained by the X: Net Positive points increases as the fitted values increase. If the regression model. If you are already using the best unequal variation is severe, get help to address the fitting Fitted model, Line get help Plot to for address Cubic the Model problem. problem. Y =,191 -,99 X +,14 X**2 -, X**3 Is there a relationship between Y and X? Clusters,5,1 >,5 Groups of points that suggest there may be Yes important X variables that were not included in the No regression model. Get help to address the problem. P =,64 The relationship between Price Change (%) and Net Positive is not statistically significant (p >,5). Price Change (%) 4,% Large residuals 2,%,% Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. Examples of patterns that may indicate problems with the fit of the model: Regression for Price Change (%) vs Net Positive Unequal variation Strong curvature Summary Report Uneven variability, such as when the spread of Y: Price Change (%) points increases as the fitted values increase. If the X: Net Positive unequal variation is severe, get help to address the problem. Is there a relationship between Y and X?,5,1 >,5 Clusters Yes No Groups of points that suggest there may be P =,1 important X variables that were not included in the The relationship regression between model. Price Get Change help to (%) address and Net the problem. Positive is statistically significant (p <,5). Price Change (%) Large 4,% residuals 2,%,% Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Linear Model Y = -,4668 +,81 X Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. 5 1 15 Net Positive 2 % % of variation accounted for by model 1% -2,% 1 2 3 Net Positive 4 % of variation accounted for by model % R-sq (adj) = 15,83% 15,83% of the variation in Price Change (%) can be accounted for by the regression model. 1% Comments The fitted equation for the cubic model that describes the relationship between Y and X is: Y =,191 -,99 X +,14 X**2 -, X**3 If the model fits the data well, this equation can be used to predict Price Change (%) for a value of Net Positive, or find the settings for Net Positive that correspond to a desired value or range of values for Price Change (%). R-sq (adj) = 32,37% 32,37% of the variation in Price Change (%) can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = -,4668 +,81 X If the model fits the data well, this equation can be used to predict Price Change (%) for a value of Net Positive, or find the settings for Net Positive that correspond to a desired value or range of values for Price Change (%). A statistically significant relationship does not imply that X causes Y.,59 The positive correlation (r =,59) indicates that when Net Positive increases, Price Change (%) also tends to increase. A statistically significant relationship does not imply that X causes Y. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Results Regression Analysis Accumulated net positive value versus the closing price behavior: Microsoft Inc. Google Inc. Regression for Closing Price vs Net Positive Accumulated Diagnostic Report 1 Regression for Closing Price vs Net Positive Accumulated Diagnostic Report 1 1, Residuals vs Fitted Values Look for large residuals (marked in red) and patterns. 1 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns.,5 5 Residual, -,5 Residual -5-1, -1 29 3 31 32 Fitted Value 33 34 35 78 8 82 84 Fitted Value 86 88 9 92 Examples of patterns that may indicate problems with the fit of the model: Regression for Closing Price vs Net Positive Accumulated Unequal variation Strong curvature Summary Report Uneven variability, such as when the spread of Y: Closing Price points increases as the fitted values increase. If the X: Net Positive Accumulated unequal variation is severe, get help to address the problem. Clusters Is there a relationship between Y and X?,5,1 Groups of points that suggest there may be>,5 important X variables that were not included in the Yes No regression model. Get help to address the problem. P =, The relationship between Closing Price and Net Positive Accumulated is statistically significant (p <,5). Closing Price 36 Large residuals 34 32 3 Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Quadratic Model Y = 28,81 +,5878 X -,1 X**2 Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. Examples of patterns that may indicate problems with the fit of the model: Regression for Closing Price vs Net Positive Accumulated Unequal variation Strong curvature Summary Report Uneven variability, such as when the spread of Y: Closing Price points increases as the fitted values increase. If the X: Net Positive Accumulated unequal variation is severe, get help to address the problem. Clusters Is there a relationship between Y and X?,5,1 Groups of points that suggest there may be>,5 important X variables that were not included in the Yes No regression model. Get help to address the problem. P =, The relationship between Closing Price and Net Positive Accumulated is statistically significant (p <,5). Closing Price Large 9 residuals 85 8 Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Cubic Model Y = 77,7 +,6772 X +,25 X**2 -, X**3 Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. 5 1 15 Net Positive Accumulated 2 75 1 2 Net Positive Accumulated 3 Comments Comments % of variation accounted for by model % 1% R-sq (adj) = 9,72% 9,72% of the variation in Closing Price can be accounted for by the regression model. The fitted equation for the quadratic model that describes the relationship between Y and X is: Y = 28,81 +,5878 X -,1 X**2 If the model fits the data well, this equation can be used to predict Closing Price for a value of Net Positive Accumulated, or find the settings for Net Positive Accumulated that correspond to a desired value or range of values for Closing Price. A statistically significant relationship does not imply that X causes Y. % of variation accounted for by model % 1% R-sq (adj) = 97,56% 97,56% of the variation in Closing Price can be accounted for by the regression model. The fitted equation for the cubic model that describes the relationship between Y and X is: Y = 77,7 +,6772 X +,25 X**2 -, X**3 If the model fits the data well, this equation can be used to predict Closing Price for a value of Net Positive Accumulated, or find the settings for Net Positive Accumulated that correspond to a desired value or range of values for Closing Price. A statistically significant relationship does not imply that X causes Y. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Results Regression Analysis Traded volume change according to the total number of tweets for a given day: Microsoft Inc. Google Inc. Regression for Volume vs Total Diagnostic Report 1 Regression for Volume vs Total Diagnostic Report 1 5 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns. 2 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns. 25 1 Residual Residual -25-5 3 4 5 6 7 8 9 1 Fitted Value Examples of patterns that may indicate problems with the fit of the model: Regression for Volume vs Total Unequal variation Strong curvature Summary Report Uneven variability, such as when the spread of Y: Volume X: Total points increases as the fitted values increase. If the unequal variation is severe, get help to address the problem. Is there a relationship between Y and X?,5,1 >,5 Clusters Yes Groups of points that suggest there may be No P =,1 important X variables that were not included in the The relationship regression between model. Volume Get help and to Total address is the problem. statistically significant (p <,5). Volume Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Linear Model Y = 2932518 + 1766 X Large residuals 125 Points that are not well fit by the model. Try to understand why the points are unusual. Correct 1 measurement or data entry errors and consider removing data that have special causes. 75-1 1 15 2 points increases as the fitted values increase. If the unequal variation is severe, get help to address the problem. Is there a relationship between Y and X?,5,1 >,5 Clusters Yes Groups of points that suggest there may be No P =, important X variables that were not included in the The relationship regression between model. Volume Get help and to Total address is the problem. statistically significant (p <,5). 25 3 Fitted Value 35 Examples of patterns that may indicate problems with the fit of the model: Regression for Volume vs Total Unequal variation Strong curvature Uneven variability, such as when the spread Summary of Report Y: Volume X: Total Volume 6 Large residuals 4 4 45 Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Fitted Line Plot for Linear Model Y = 126474 + 1922 X Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. % of variation accounted for by model 5 % of variation accounted for by model 2 % 1% 15 3 Total 45 6 % 1% 4 8 Total 12 16 R-sq (adj) = 31,21% 31,21% of the variation in Volume can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = 2932518 + 1766 X If the model fits the data well, this equation can be used to predict Volume for a value of Total, or find the settings for Total that correspond to a desired value or range of values for Volume. R-sq (adj) = 59,3% 59,3% of the variation in Volume can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = 126474 + 1922 X If the model fits the data well, this equation can be used to predict Volume for a value of Total, or find the settings for Total that correspond to a desired value or range of values for Volume.,58 The positive correlation (r =,58) indicates that when Total increases, Volume also tends to increase. A statistically significant relationship does not imply that X causes Y.,78 The positive correlation (r =,78) indicates that when Total increases, Volume also tends to increase. A statistically significant relationship does not imply that X causes Y. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Conclusions Multi-class sentiment classification model built with Conditional Random Fields, achieving a good performance, especially for the complex financial domain: 81.67% accuracy for Microsoft Inc. 8.8% accuracy for Google Inc. Interesting patterns and adherences revealed between the company-related Twitter stream sentiments and stock values: for the accumulated net positive versus the stock s closing price: 97.56% explanatory capacity for Google Inc. 9.72% explanatory capacity for Microsoft Inc. The visible correlations of the companyrelated sentiments to the stock values prove also the quality of the built classification models for the companies in the experiment. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Thank you! Master of Science in Computer Engineering Graduation Thesis Student: Ekaterina Shabunina Supervisor: Prof. Marco Brambilla Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Appendix Conditional Random Fields a framework for building probabilistic models to segment and label sequence data. DEFINITION: If X is a random variable over data sequence to be labeled, Y is a random variable over corresponding label sequences. Let G = (V, E) be a graph such that Y = (Yv)v V, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(yv X,Yw,w= v) = p(yv X,Yw,w v), where w v means that w and v are neighbors in G. The joint distribution over the label sequence Y given X has the form: where x is the data sequence, y is the label sequence, v is the vertex from vertex set V, e is the edge set E over V, fk Boolean vertex feature, gk Boolean edge feature, k number of features, λk and µk are parameters to be estimated, y e is the set of components of y defined by edge e, y v is the set of components of y defined by vertex v. Let Y = start and Yn+1 = stop special start and stop states. For each position i in the observation sequence x, defined the Y Y matrix random variable Mi(x) = [Mi(y,y x)] by: where ei is the edge with labels (Yi 1,Yi) and vi is the vertex with label Yi. CRFs use the observation-dependent normalization factor over all state sequences Z(x) for conditional distributions, it is the (start, stop) entry of the product of these matrixes: Then the conditional probability of a label sequence y is written as: where y = start and yn+1 = stop. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213

Appendix Regression Analysis Is a statistical process for estimating the relationships among variables. A study which seeks to provide an equation that relates two (or more) variables, in the following form: where x1, x2., xk are called factors (or independent variables) and is called error. To understand whether the regression is or not significant, the ANOVE (analysis of variance) methodology is applied to the linear regression: Starting from the set of assumptions: The total variance: And the residual variance: And finally the regression model variance: From these definitions, it becomes possible, to calculate the critical F-value (based on the F-Snedecor distribution) as: Which should be compared to the critical F value where α is the chance of misinterpretation (1minus the desired confidence level). If, should be rejected and therefore it is implied that the linear regression is statistically significant. Shabunina Ekaterina - Politecnico di Milano - Como Campus - 24/7/213