CRF to find stock price correlation with company-related Twitter sentiment

Transcription

1 POLITECNICO DI MILANO Scuola di Ingegneria dell Informazione POLO TERRITORIALE DI COMO Master of Science in Computer Engineering CRF to find stock price correlation with company-related Twitter sentiment Supervisor: Prof. Marco Brambilla Master Graduation Thesis by: Ekaterina Shabunina Student Id. Number: Academic Year 2012/13

2 2 POLITECNICO DI MILANO Scuola di Ingegneria dell Informazione POLO TERRITORIALE DI COMO Corso di Laurea Specialistica in Ingegneria Informatica CRF to find stock price correlation with company-related Twitter sentiment Relatore: Prof. Marco Brambilla Tesi di laurea di: Ekaterina Shabunina matr Anno Accademico 2012/13

3 3 Abstract The present thesis work proposes an innovative approach to sentiment classification through the usage of the Conditional Random Fields probabilistic model applied to a companyrelated Twitter data stream, evaluated using a statistical correlation analysis with the companies` securities prices variation, performed using regression methods. It has been achieved interesting adherences and trends in-between the classified results, labeled with the tailored classification model, which by itself presented solid performance indicators, and the stock market values.

4 4 Sommario Il presente lavoro di tesi propone un approccio innovativo alla classificazione dei sentimenti tra l`uso del modello probabilistico Conditional Random Fields applicato a un flusso di dati estrato da Twitter con l`oggetivo di cercare una correlazione con la variazione dei prezzi delle azioni di due societèa tra l uso di una analisi statistica di regressione. Risultati interessanti sono stati raggiunti, relative alle aderenze e tendenze in-tra i sentimenti, etichettati con il modello di classificazione costruito (con misure di prestazioni affidabili) e i valori di borsa.

5 5 Contents Abstract... 3 Sommario... 4 List of tables... 7 List of figures... 8 Chapter Introduction Main Findings Thesis report structure Chapter Background Sentiment Analysis Social Networking Twitter functionalities Conditional Random Fields Brief Overview of Stock Markets Brief Overview of Regression Analysis Methods Conclusions of the chapter Chapter Literature review From classification to stock prediction with Twitter sentiment Classification Classification of Twitter micro-blogs Classification of sentiments within Twitter messages Earlier works of stock prediction with UGC sentiments Conclusions of the chapter Chapter Methods Data pre-processing Features selection and templates creation Conditional random fields tool Results evaluation metrics Regression analysis tool Technical details and Libraries... 57

6 6 Chapter Experimental evaluation Experimental setup Twitter datasets Cross-validation Sentiment classification Stock prices datasets Chapter Results Classifier s Performance Discussion of the classifier s performance Classification Results and Model Adherence Classification Results and Stock Price Behavior Regression Analysis Discussion of the Results Chapter Conclusions Contributions and conclusions Future work References... 98

7 7 List of tables Table 1 Microsoft Inc. daily Twitter sentiment distribution Table 2 Google Inc. daily Twitter sentiment distribution Table 3 Volumes of sentiments for the companies Microsoft and Google in the time period from 25 th of April until 24 th of May Table 4 Classifier s accuracy, average over 10-folds, for Microsoft Inc. and Google Inc. with various templates Table 5 Training parameters effect on the classifier s performance, on Google Inc. dataset Table 6 classifier s performance measures for Microsoft Inc. dataset, average over 10-folds Table 7 classifier s performance measures for Google Inc. dataset, average over 10-folds Table 8 Classification models performance for Microsoft Inc Table 9 Classification models performance for Google Inc Table 10 Regression results... 93

8 8 List of figures Figure 1 Dow Jones "flash crash" after the hacked tweet from the AP Figure 2 Architectural overview of the proposed approach Figure 3 Example of a post with Twitter line breaks Figure 4 Minitab 16 main screen Figure 5 Minitab 16 Regression assistant Figure 6 Regression analysis setup Figure 7 Minitab 16 Results window Figure 8 Minitab 16 Diagnostic report Figure 9 Residual analysis for Price change and Net positive, Microsoft Inc Figure 10 Regression for Net positive and Price change, Microsoft Inc Figure 11 Residual analysis for Closing price and Accumulated net positive, Microsoft Inc Figure 12 Regression for Closing price and Accumulated net positive, Microsoft Inc Figure 13 Residual analysis for Traded volume and Total number of tweets, Microsoft Inc Figure 14 Regression for Traded volume and Total number of tweets, Microsoft Inc Figure 15 Regression for Traded volume and Total number of tweets (without an outlier of 21/05/2013), Microsoft Inc Figure 16 Residual analysis for Price change and Net positive, Google Inc Figure 17 Regression for Price change and Net positive, Microsoft Inc Figure 18 Residual analysis for Closing price and Accumulated net positive, Google Inc Figure 19 Regression for Closing price and Accumulated net positive, Google Inc Figure 20 Residual analysis for Closing price and Accumulated net positive, Linear regression, Google Inc Figure 21Linear regression for Closing price and Accumulated net positive, Google Inc Figure 22 Residual analysis for Traded volume and Total number of tweets, Google Inc Figure 23 Regression for Traded volume and Total number of tweets, Google Inc

9 9 Chapter 1 Introduction Nowadays, the analysis of sentiments within Twitter is a tremendously relevant task in many research directions, which can be found within a broad array of different applications, leveraged by the steady and consistent growth of the Twitter popularity, as more and more micro-blogging platform users post more and more their feelings and thought, reacting in nearly real time speed to world events, making Twitter an unmatched source of data for opinion mining. In this thesis it was chosen to focus on the companies stock-related chatter, extracting tweets containing companies ticker tags, or cashtag as they are referred to in Twitter. Therefore, the chosen aim of the present thesis study was to find correlation in-between sentiments within Twitter IT sector company-related micro-blogs and their stock values movements. It was assumed that such a correlation would be highly prone to exist due to the fact that investors, traders and (most of) the other players in the financial markets are human beings and are indeed effected by different emotions and events during their choice of actions on the stock market. Moreover, it is obvious from the context of the Twitter company-related messages that investors and traders as well use Twitter as a platform to communicate, share their knowledge or actions on the stock market and even to influence masses to perform trading action in their favor. One good example to illustrate the powerful relationship between the social media universe and the stock markets, which happened to take place during the development of the present work, was the large impact that one Twitter message had on the stock market recently: The flash crash, worth $134 billion of stocks automatically sold-off almost instantly in a reaction to a tweet from a hacked Associated Press news account in Twitter, which reported fake news about explosions at the White House and Barack Obama, USA president, being injured, on the 23 of April, 2013 [47]. The novelty of the present work relies upon the fact that for the task of multiclass sentiment classification was used Conditional Random Fields probabilistic model over extensive manually labeled datasets, achieving a reasonable performance, especially considering the fact that Twitter micro-blogs constitute a very complex material for the sentiment classification even for a human being, and companies stock related tweets are part of a special financial domain,

10 10 which employs a very specific set of jargons and slangs, with particular abbreviations and symbols. Moreover, company-related tweets tend to be even shorter than 140 symbols limit for a Twitter micro-blog, often stating only the action on the stock market that the investor has or is intending to performed, such as short $GOOG, and therefore, as observable, containing very little information for the classifier to build an especially accurate model. Nevertheless, the performed regression analysis in search for correlation in-between company-related Twitter micro-blogs, classified with the CRF model into 3 classes positive, negative and neutral, and the stock market values have revealed interesting patterns and adherences.

11 Main Findings As a scholar work, this thesis seeks to contribute through several fronts: First of all, a large gap in the literature was found concerning the use of Conditional Random Fields probabilistic model for sentiment classification as successfully performed here, as it should become clear in the Results chapter, where the achieved satisfactory accuracy for the model is presented, even if it was performed using only as features word combinations and Part-of-Speech tags. Secondly, it has been achieved interesting correlations between the company-related Twitter stream sentiments and stock values, using statistical regression analysis, which wasn`t as well largely found in the studied literature. Finally, given the fact that it has been restricted by Twitter to redistribute its content and since all of the useful applications which had functions of exporting Twitter datasets have been discontinued, another finding of the present work was, through the weekly crawling of Twitter (with its Search API), the 38-days (from 17 th of April, 2013, until 24 th of May, 2013) datasets created for two companies, containing 4,540 manually labeled tweets in the relation to the stock of the company, out of 17,214 of the dataset for the Google Inc. and 2,212 manually labeled tweets related to the stock of the company out of 8,070 of the whole dataset for Microsoft Inc.

12 Thesis report structure The remainder of the document is organized as follows. Chapter 2 gives general overview of background information over methodologies and concepts operated with in the present thesis. Chapter 3 introduces existing publications on the chosen subject area. Chapter 4 describes the methods and tools used for the construction of the novel approach to the stock prediction with Twitter sentiment. Chapter 5 introduces the experimental setup and evaluation, mainly getting in details of the datasets created. In Chapter 6, the results of the present work s experiments are provided and discussed. Finally, Chapter 7 summarizes the conclusions of this thesis and brings an insight on the possible directions of future work.

13 13 Chapter 2 Background In this chapter the necessary background knowledge is provided to support the thesis subject area. 2.1 Sentiment Analysis With the emergence of Web 2.0 phenomenon, characterized by a change of thinking from content-based to user-centralized web-based approach, many new types of Web applications have appeared in recent years. They all have a common feature which is that the user is no longer simply a consumer of pure content, but also actively publishes information, the so-called User Generated Content (UGC). Due to the popularity of such examples of Web 2.0 websites as social networks, micro blogs, blogs, video sharing websites and so on, the amount of this User Generated Content is rapidly expanding providing large up-to-date information about user preferences, interests, opinions and views on the world events, products and services of particular companies, making the UGC a unique source of data for sentiment analysis the automatic detection in texts of emotive lexicon and author s emotional attitude with respect to the objects that are defined in the text. Sentiment analyses are performed using natural language processing methods, statistics, and/or machine learning methods. Machine learning based approach, as in the present thesis, uses supervised algorithms from machine learning such as Support Vector Machines, Naïve-Bayes and Conditional Random Fields along with some natural language processing techniques to determine the tonality of a text. This approach requires a dataset of manually pre-labeled texts to train the classifier. Each document/sentence (depends on the level of sentiment analysis chosen) is converted into a Bagof-Words to provide a simplified representation of the document/sentence itself. Then each word (unigram) or group of words (n-grams) are used as features to determine text polarity. Another common approach applied to sentiment analysis is the usage of publicly available linguistic resources for the lexical representation of affective information, such as WordNet-Affect [27], SentiWordNet [28] and SenticNet [29] to construct a Knowledge-based

14 14 system. The distinction of this approach from the machine learning one is that in a Knowledgebased system the expensive in terms of labor and time pre-labeled dataset is not necessary. There are 3 typical levels of sentiment analysis: document, sentence and feature/word level. The most fine-grained and the hardest to perform is the feature level, which aims to state not only the global polarity of a document (document level), but also author s sentiment in relation to each specific characteristic of the topic mentioned in the text. The sentence level sentiment analysis can be divided into two subtasks identify if a sentence is subjective (reporting someone s opinions, feelings, views) or objective (stating a factual information about the world) and classify the subjective sentence to identify its tonality [10]. It is common to use the output of one level as the input for the higher layers [12]. For example of document classification, in one of the earliest works of this area conducted by Pang et al., [30], authors classify movie reviews by overall sentiment being positive or negative. With regards to vocabulary and grammar used, in subjective sentences opinions can be explicitly stated (e.g. What a wonderful car! ) or implicitly (e.g. I ve just bought a new car but it already broke down ) [10] [12]. Moreover, an explicit sentence can report a direct opinion (e.g. The drink X is good ) or a comparison (e.g. The drink X is better than drink Y ) [10] [11]. The most common task in sentiment analysis is the classification of polarity of a given text to identify if the expressed opinion in that text is positive, negative or neutral [8], as for example in the work by Pak et al., 2009 [70]. The most simple of the polarity classifications is the binary classification of sentiment into positive or negative categories, an example of which is a mood tracking tool OpinionFinder, described and used in the work of Bollen et al., 2011 [59]. Another way of modeling human mood is using a multi-scale system in which each word is associated with a number on the scale from -10 (most negative) to +10 (most positive). A different way of classifying sentiment is discussed in Bollen et al., 2011 [9] where a well-vetted psychometric instrument Google-Profile of Mood States (GPOMS) was applied to measure mood in terms of 6 dimensions - Calm, Alert, Sure, Vital, Kind, and Happy. Just like there exists various technics to sentiment classification and different scales, there are also numerous knowledge domains of sentiment applications predicting real-world outcomes,

15 15 such as using movie reviews to forecast box-office revenues (Meador et al., 2009 [69], Asur et al., 2012 [72]) soccer games results to predict stock market movements (Edmans et al., 2005 [62], Chang et al., 2012[63]) or just a corpus of daily Twitter posts to predict the possibility that H1N1 (Swine Flu) virus will become a pandemic (Ritterman et al., 2009 [13]). As a conclusion, it must be pointed out that extraction of opinions and their sentiment analysis is a challenging task due to a large amount of noise in this data, domain-dependence of language and lack of annotated resources. Even humans are often in disagreement [19]. And moreover, the shorter the text, the harder it is to be analyzed.

16 Social Networking Facebook is now part of most people s web lives, Twitter is where a lot of people are reading the breaking news and if you want to be entertained then just dial into YouTube [24]. Social media are a popular source for data mining due to their massive scale, real time updates and inclusion of current trends. To prove social networking ever-growing tendency and appropriateness as a data source for the present thesis experiment, below are provided some statistics. According to the latest social research, out of all internet users 67% are using social networks, out of which 16% are registered on Twitter [21]. Stratification of age in social media use has remained the same since 2005: dominantly young adults in the age range from 18 to 29 and less usage as age increases, reveals latest statistics Pew Research centers and Docstoc [22] [23]. The 2 main factors driving social web in 2013, as stated by [24]: The number of people accessing the internet via a mobile phone increasing by 60.3% to million in the last 2 years; On Twitter the year age bracket is the fastest growing demographic with 79% growth rate since Urban people are the most social media active 70% of population [23]. The education level has an almost equal distribution from 66% of people with a high school or less of education use social media to 65% - having a college degree [26]. Twitter [21] is one of the most popular social networking websites, after Facebook [14], in terms of overall registered users - 554,750,000 registered users [34] [15], out of which 288 million monthly active [25], which marks a growth rate in active users of 714% since July 2009 [37]; holding 10th place in Alexa s traffic rank [16] [33] as stated for the May 2013 and 13th in Alexa s Top 500 Global Sites on the web [32], and ranked 12th most popular website over USA, country, containing 25% of Twitter s visitors [20]. Also, second, after Facebook, on The Top Moz 500 Domains in terms of linking root domains and external links [31]. Being launched on the 15 of July, 2006, ever since, Twitter is the fastest growing social network in the world by active users, according to a Global Web Index Study [25], having 44% growth from June 2012 to March 2013, with on average 500 million (according to Twitter CEO

17 17 Dick Costolo in October 2012 [35]) tweets posted a day with an estimated rate of 9,100 tweets per second [34]. In contrary to Facebook, most of the Twitter users have a public by default profile [36], making possible the access for any research and analysis purposes to millions of everyday tweets from all over the world with different age, nationality, household income, professions, and hobbies distributions including also other content such as photos, links and videos. Even with other very successful social media sites, no one is better at conversation than Twitter, says David Berkowitz, vice president of emerging media at 360i [39]. During big events, such as Super Bowl XLVII [39], London Summer Olympics [38] and U.S. presidential elections [40], Twitter becomes the platform for millions of people to share quick reactions and participate in a massive, public conversation, proving its reach and utility [39]. Consequently, based on the provided above statistics, it is obvious why Twitter was chosen as an appropriate source of up-to-date real time in massive scale information for the stock prediction experiment, performed in the present thesis work. Therefore, in the next subchapter more details on concretely Twitter functionalities are provided to bring in an insight on what kind of information was dealt with.

18 Twitter functionalities Twitter represents a complex network of users with diverse interests and intentions. Twitter s audience varies greatly from regular users to celebrities, countries presidents and political figures, company representatives, and even cardinals [48]! Hundreds of millions of people monthly post facts about their life, share their opinions on all possible topics including stock-related ones, express feelings, give feedbacks on different products and services, report news or thoughts about current global events, comment other people and just chat with friends using direct messages symbol followed by the username. Due to the restriction of 140 symbols per post in Twitter and the possibility to access this service not only from the official website [21] but also by sms (short message service) and various external applications, users are much more rapid in publishing their thoughts and news which make this social network so real time and sometimes even faster than news channels, making it applicable for many useful purposes: Researchers are studying the possibility of using Twitter as a source of data for building a flu epidemics tracking algorithm, and for the study of the health-related behavior of the population, finding suicidal cases or cases of domestic violence before they happen; building a tool for looking up rates of diseases in different neighborhoods and restaurants which had cases of food poisoning [50]. In Mills et al., 2009 [45] authors explore pros and cons of Twitter as 21 st century emergency communication system: According to Twitter.com, Twitter usage noticeably spikes during disasters and other large events. People use Twitter during emergencies at levels far above average, especially during nature cataclysms such as wildfires, ice storms, earthquakes, hurricanes and so on. In march 2011, an 8,9 scale earthquake stroke Japan followed by a resulting tsunami, and Twitter happened to be the easiest and most reliable way of keeping in touch with relatives as well as providing emergency numbers, information and news to those in stricken areas when all the telephone lines collapsed [43]. Twitter is becoming a first source of real time breaking news. The first-hand reports, photos and even videos of the tragedy of the Boston Marathon bombing of April 15 th, 2013, appeared within 10 minutes on Twitter and no news on TV and cable news channels at least for 15 min. [51]. After arresting the second suspect of the Boston Marathon bombing, the Boston police tweeted about it in Twitter [44]. Seal Team Six killing Osama bin Laden broke on Twitter. The uprisings of the Arab spring were first

19 19 covered via Twitter. So was the death of former British Prime Minister Margaret Thatcher. More and more, it seems the first word we get about major events comes from the micro-blogging service [51]. USA second time elected president Barack Obama has used Web 2.0 (especially Twitter and Facebook) as a central platform of his presidential campaigns since 2008 [49]. Different celebrities use Twitter to build their fans network, for example, Justin Bieber, Canadian pop singer, has more followers (accurate number as for the 8th of June, 2013, [53], making him the top followed account on Twitter) then the entire population of Canada [52]. One of the most recent examples of how powerful are Twitter post has happened during the research for the present thesis and is extremely relevant to the subject on 23 of April at 1:07 pm, 2013, hacked Associated Press Twitter account posted a false tweet Two explosions in the White House and Barack Obama is injured causing a flash crash on the stock market as autotrading computer systems on autopilot sold $134 billion worth of stocks [47]. The market quickly recovered but those two minutes of damage were enough of a warning signal that tweets from these news organizations can make a huge impact around the world [46] and what s important for the research conducted in the present thesis work - Twitter posts do have a serious impact on stock market, as can be seen on the Figure 1. Figure 1 Dow Jones "flash crash" after the hacked tweet from the AP Source:

20 20 Since Twitter is a social network in which people share their thoughts about different topics it is a perfect source of people s opinions for the sentiment analysis. Useful feature of Twitter is that users tend to use hashtags, a word or phrase after a number sign (for example, #google), to specify to which topic their tweet refers to, and especially useful for the present thesis is that, due to the fact that Twitter has been becoming a more and more popular platform to exchange trading ideas and other stock-related information, since 30 July 2012 [41], Twitter introduced cashtags clickable ticker symbols with a dollar sign prefix (for example, $goog), which takes a user to the search results about company s finance and stock. Another important advantage of Twitter over other social networks is that it exposes APIs to retrieve and filter tweets. But all of the positive reasons stated above simultaneously cause difficulties to perform analysis and research using Twitter datasets. For example, since people use Twitter to inform what is going on exactly right now, this immediacy, rush to tell the world what s happening [42], is reflected in the style of writing impulsive, rapid, not thoughtful, full of abbreviations, also the brought up above statistics show that, mainly, audience is composed of young adults, obviously, slang and informal language aren t neglected; 140 characters limitations force people to apply tricks such as URLs shortenings all together making the task of sentiment classification extremely hard on the Twitter data, but nonetheless apparently worth it.

21 Conditional Random Fields The task of classification, known as the assignment of a class y Y to an observation x X, can be approached with the probability theory by specifying a probability distribution to select the most likely class y for a given observation x. The approach, chosen for the present thesis experiments, is Conditional Random Fields (Lafferty et al., 2001 [54]) - a framework for building probabilistic models to segment and label sequence data. The name Conditional Random Field denotes the modeling of the labeling, Y = y, as a network of inter-dependent random variables (a random field), while conditioning over another set of random variables: the context, X = x [56]. As defined by Lafferty et al., 2001 [54], if X is a random variable over data sequence to be labeled, and Y is a random variable over corresponding label sequences, and all components Yi of Y are assumed to range over a finite label alphabet y. Let G = (V, E) be a graph such that Y = (Yv)v V, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(yv X,Yw,w= v) = p(yv X,Yw,w v), where w v means that w and v are neighbors in G. The joint distribution over the label sequence Y given X has the form: (1), where x is the data sequence, y is the label sequence, v is the vertex from vertex set V, e is the edge set E over V, fk Boolean vertex feature, gk Boolean edge feature, k number of features, λk and µk are parameters to be estimated, y e is the set of components of y defined by edge e, y v is the set of components of y defined by vertex v. Let Y0 = start and Yn+1 = stop special start and stop states. For each position i in the observation sequence x, defined the Y Y matrix random variable Mi(x) = [Mi(y,y x)] by

22 22 (2), where ei is the edge with labels (Yi 1,Yi) and vi is the vertex with label Yi. CRFs use the observation-dependent normalization factor over all state sequences Z(x) for conditional distributions, it is the (start, stop) entry of the product of these matrixes: Then the conditional probability of a label sequence y is written as: (3), (4), where y0 = start and yn+1 = stop. CRFs have a similar structure to the Conditional Markov Model (CMM) and consequently share the same benefits of the CMM over generative models such as Hidden Markov models (HMM), but instead of using a directed graph as CMM, CRFs use an undirected graph. Lafferty et al., 2001 [54] state the critical difference between CRFs and Maximum entropy Markov models (MEMMs or CMM) is that instead of using per-state exponential models for the conditional probabilities of the next state given the current state as MEMM, CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence. Therefore, the label bias problem does not arise for CRFs because the weights of different states can be traded off against each other, which can lead to accuracy improvements (Lafferty et al., 2001 [54]). Conditional random fields offer a unique combination of properties: Arbitrary overlapping dependencies on the observation sequence, making CRF much more expressive; the transition probability between labels may depend not only on the current observation, but also on the past and future observations, if available;

23 23 the features do not need to specify completely a state or observation (different levels of granularity possible), implying that less training data is necessary for the model to be estimated from; All of which are particularly important for opinion mining. The results presented by Lafferty et al., 2001 [54] demonstrate that even when the models are parameterized in exactly the same way, CRFs are more robust to inaccurate modeling assumptions than MEMMs or HMMs, and resolve the label bias problem, which affects the performance of MEMMs. Likewise, in the real-world part-of-speech tagging experiments, Lafferty et al., 2001 [54] confirm the advantage of CRFs over MEMMs. Thereby, motivated by the earlier findings, CRF is the proposed in the present thesis approach to sentiment classification of Twitter microblogs.

24 Brief Overview of Stock Markets The subject of the analysis carried through the Twitter sentiment classification in this project is referent to the stocks of publicly listed technology companies. It is relevant, in these terms, to briefly shed light on the key concepts and mechanisms involving the stocks and stock markets in general. The first initial point that will sustain this discussion is that a stock of company represents the smallest fraction of ownership of it that one can detain. In other words, buying (or selling) a stock is the same as buying (or selling) a share of ownership of the underlying company. The price variation of such securities, in this sense, is nothing but the variation of the perceived value of the company itself. Probably the most sensitive point of this discussion refers to how the market derives the value to a given company. The value of the shares (or stocks) will be nothing but the simple division of the total company value by the number of shares. Despite the numerous varieties of analysis that are performed by the players in the financial markets in order to determine whether to buy or sell a given security (and at which prices to do so), they can be generally divided into two main categories: Technical and Fundamentalist. While the first one is basically an attempt of applying mathematical models (such as calculation hypothetical resistance levels based on moving averages of the trading prices, for instance), the second one is based on the study of intrinsic value of the company whose shares are under consideration. This is why such investors are also referred to as value investors. The key concept to be understood about the fundamentalist analysis is that the value of a company is based on its capacity of generating cash in the future. Buying a company share is, in this sense, the same as buying an expected future cash flow. That is why the perception of how the company will perform in the long run is more relevant than how it did so in the past. If one believes that a given company will be more profitable and more able to generate cash for its shareholders in the future, this person (according to the fundamentalist analysis) will be willing to pay a higher price to become the owner of a fraction of this company, through buying its stocks. The relationship in-between the general ascertained sentiment related to a given company, in this sense, may be a good proxy of its future profitability. Suppose, for instance,

25 25 that every single mention about this company is positive, being its customers thrilled about its products and talking about it openly. It is not hard to assume that this company has a better chance of performing well in the future than another in the opposite situation. In this sense, the sentiment analysis of tweets referring to a given company may be well a good indicator of the future profitability of a company, in a way that it could be correlated to the stock performance of it, relationship that will be put to test during the development of the present work.

26 Brief Overview of Regression Analysis Methods Going through the key concepts of the regression analysis, the first initial definition which must be presented is what is a regression per se. As defined by Ramos, 2010 [80], a regression analysis consists in a study which seeks to provide an equation that relates two (or more) variables, in the following form: ( ) Where x1, x2., xk are called factors (or independent variables) and is called error More important than going through the details on how to calculate the regression itself, it matters for this study to understand whether the regression is or not significant. For that, Ramos (2010) presents us the ANOVA (analysis of variance) methodology applied to the linear regression: Starting from the set of assumptions: ( ) ( ) Once more, the key objective for the present work is, more than understanding which equation best fits the proposed relationship, to understand whether there is or not a statistically significant regression. In order to do so, it is possible to calculate, following the methodology presented by the same author, the total variance through the formula: ( )

27 27 And the residual variance can be estimated by the formula: ( ) And finally the value, called regression model variance, estimated by the formula: ( ( )) ( ) From these definitions, it becomes possible, also according to Ramos, 2010 [80], to calculate the critical F-value (based on the F-Snedecor distribution) as: Which should be compared to the critical F value where α is the chance of misinterpretation (1minus the desired confidence level). If, should be rejected and therefore it is implied that the linear regression is statistically significant. Using similar processes for the polynomial regressions (quadratic and cubic, for instance), it is also possible to define whether there is a statistically significant improvement when comparing the new regressions to their lower-order predecessors. In this study, Minitab will perform all steps for such analysis, but the logic is the same as the one presented above, modifying the F-Snedecor test parameters and the calculation formula for and, according to the new degrees of freedom.

28 Conclusions of the chapter In this chapter, the key aspects of the present thesis area are introduced, giving a general introduction to the machine learning technique utilized Conditional Random Fields, the motivation behind the choice of data source for the experiments, the area of the sentiment analysis application, namely stock market, and the concept of regression analysis used in the present thesis to find correlation in-between Twitter sentiments and stock market values. Now as for the conclusions made, in the present thesis it was decided to perform sentiment classification over manually labeled company-related Twitter stream for training the model with Conditional Random Fields, which, as has been researched, haven t been yet utilized for this task in the previous works of the stock market movement prediction. Based on the fact that tweets are limited to contain only 140 characters, it is common to consider that each tweet contains only one subject and only one sentiment and, therefore, the sentiment classification would be performed at the sentence level.

29 29 Chapter 3 Literature review In this chapter the literature review on the subject area of the present thesis work is provided. 3.1 From classification to stock prediction with Twitter sentiment Classification Nowadays, a lot of free and easy to reach information is accessible online at any time from all over the world. For better organization of this information for users, researchers have been looking into the problem of automatic text classification, which studies how to automatically learn to make accurate predictions based on past observations. Supervised learning, the most common approach to text classification, is a machine learning task of deducing a function from correctly labeled training data. This data is used to train the learning algorithm, which creates models that can then be used to label/classify similar data. Formally, given a set of input items, X = {x1, x2,... xn} and a set of labels/classes, Y = {y1, y2,... yn} and training data T = { (xi, yi) yi is the label/class for xi }, a classifier is a mapping from X to Y, f(t, x) = y. Literature describes a large number of classification algorithms, based on a variety of ideas from mathematical logic, statistics, artificial intelligence and neural networks. Nowadays application of machine learning methods to retrieve information from texts is a very popular task. One area of research concentrates on classifying documents according to their topicality, useful, for example, for organization of articles, news or reviews. Another, related to the subject of present thesis, is focused in determining privileges from using sentiment, extracted from the texts. One possible application of this extracted information is for companies products clients feedback analysis. Another one is the subject of the present thesis study using classification of sentiments within messages extracted from micro-blogging platform Twitter to find a correlation with stock market for the purpose of a possible utility of such knowledge in the prediction of the stock market price outcomes.

30 Classification of Twitter micro-blogs With the growth of popularity of micro-blogs increases the popularity of using the social networks as datasets for social data mining studies. Twitter [21], one of the most popular microblogs services, has been widely used as the corpus for various classification tasks in numerous works over past years. One example of a classification tasks performed over Twitter dataset is using an additional information domain on the Web, such as Wikipedia, personal or company websites or newspaper articles for decreasing information scarcity one of the main problems of short, only 140 symbol long tweets. Another interesting application of Twitter micro-blogs classification is described in the work by Ritterman et al., 2009 [13], where authors use the Hubdub online prediction market to model public belief about the possibility that H1N1 (Swine Flu) virus will become a pandemic. Authors of this work demonstrate that adding to bigram model features concerning the historical context of the current day s feature counts has a remarkably beneficial effect on the forecast accuracy, improving the baseline error by 20%. In their work the Support Vector Machine algorithm was chosen to carry out regression for its ability to train rapidly and interpret a large feature vector. Nowadays, one popular direction of many recent researches is the usage of Twitter datasets for the task of sentiment classification.

31 Classification of sentiments within Twitter messages The characteristic peculiarity of Twitter micro-blogs is the hidden sentiment in users opinions about all possible things expressed in 140 character messages. Twitter users express their feelings about their personal life events and global news, share emotions of all kind, comment, share information, seek for information, build communities and friendships. The capability to use these sentiments would bring a lot of advantages as for the normal users and especially for the companies and organizations being able to extract real knowledge about their products and services, and use it for the future developments and adjustments driven from the understood user feedback. There have been found many areas of application of the sentiments extracted from UGC, such as increasingly gaining practice political forecasting direction, stock market prediction or modeling public mood in relation to different seasonal holidays and so on, as much as using sentiment classification over large scale and free of charge UCG data for the purpose of different scientific studies, for example, presented in the paper by Pak et al., 2009 [70] linguistic analysis performed over a POS tagged Twitter sentiment corpus of positive, negative and neutral posts to reveal interesting phenomena upon subjectivity and objectivity, positivity and negatively of text related to the parts of speech used by the authors of the micro-blogs. For example, objective texts were found to have a tendency to contain more common and proper nouns (NPS, NP, NNS), while authors of subjective texts use more often personal pronouns (PP, PP$). Authors of subjective texts usually describe themselves (first person) or address the audience (second person) (VBP), while verbs in objective texts are usually in the third person (VBZ). As for the tense, subjective texts tend to use simple past tense (VBD) instead of the past participle (VBN). Also a base form of verbs (VB) is used often in subjective texts, which is explained by the frequent use of modal verbs (MD). Also was found that superlative adjectives (JJS) are used more often for expressing emotions and opinions, and comparative adjectives (JJR) are used for stating facts and providing information. Adverbs (RB) are mostly used in subjective texts to give an emotional color to a verb. Similarly, a positive set has a prevailing number of possessive whpronoun whose (WH$), which is unexpected. Another indicator of a positive text is superlative adverbs (RBS), such as most and best. Positive texts are also characterized by the use of possessive ending (POS). As opposite to the positive set, the negative set contains more often verbs in the past tense (VBN, VBD) because many authors express their negative sentiments about their loss or disappointment.

32 Another useful finding of Pak et al., 2009 [70] work is that in their experiments the best performance was reached using bigrams in relation to unigrams and trigrams. 32

33 Earlier works of stock prediction with UGC sentiments There have been a number of studies of the effect of UGC sentiments on the financial markets, all of which mainly reach fascinating results and frequently outperform baselines. Below is provided a study on diversity of earlier works and findings of the present thesis s subject area. The research by Edmans et al., 2005 [62] of the impact of international soccer games results, as a primary mood variable, to investigate the effect of investor sentiment on asset prices, has found a significant market decline after soccer losses. This loss effect is stronger in small stocks and in more important games, and is robust to methodological changes, also noticed a loss effect after international cricket, rugby, and basketball games. On average, the effect is smaller in magnitude for these other sports than for soccer, but is still economically and statistically significant. And surprisingly, no evidence of a corresponding effect after wins for any of the sports has been found by Edmans et al., 2005 [62], who interpret the loss effect as a cause of a change in investors mood. A more recent work by Chang et al., 2012[63] affirmed the results of Edmans et al., 2005 [62] study, extending their work to firm-level analysis to find that negative results of National Football League (NFL) games lead to lower next-day returns for locally headquartered Nasdaq firms stocks. There have been some recent different works on the various scales employed for the task of sentiment classification of public mood for the purpose of stock market prediction. For example, Gilbert et al., 2010 [58] derive an Anxiety Index metric, which estimates anxiety, worry and fear on LiveJournal blogging website. Using a Granger causal framework they find that a deviation increase in the Anxiety Index corresponds to 0.4% lower S&P 500 returns; they also present a confirmation of this result via Monte Carlo simulation - anxiety slows a market climb and accelerates a drop. Zhang et al., 2009 [60] have measured collective hope, worry and fear on each day and analyzed the correlation between these indices and the stock market indicators to find out that emotional tweet percentage significantly negatively correlated with Dow Jones, NASDAQ and S&P 500, but displayed significant positive correlation to VIX. A work, presenting 86.7% accuracy on the prediction of the movement of closing values of the Dow Jones Industrial Average (DJIA) stock, conducted by Bollen et al., 2011 [59], have

34 34 analyzed daily Twitter feeds by two mood tracking tools - OpinionFinder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). A Granger causality analysis and a Self-Organizing Fuzzy Neural Network are then used to investigate the hypothesis that public mood states, as measured by the OpinionFinder and GPOMS mood time series, are predictive of changes in DJIA closing values. In particular variations along the public mood dimensions of Calm and some degree of Happiness, as measured by GPOMS, seem to have a predictive effect, but not general happiness as measured by the OpinionFinder tool. A step further with respect to Bollen et al., 2011 [59] is a work by Vu et al., 2012 [66] in which authors predict stock market with features from Twitter messages at company level which is deeper than the whole Dow Jones Industrial Average index, predicted in the Bollen et al., 2011 [59]. In this more recent work, authors built a model combining features, namely positive and negative sentiment, consumer confidence in the product with respect to bullish (a positive price outlook) or bearish (a negative price outlook) lexicon and three previous stock market movement days. These features were then employed in a Decision Tree (C4.5) classifier using cross-fold validation to yield high levels of accuracies of 82.93%, 80.49%, 75.61% and 75.00% in predicting the daily up and down changes of Apple (AAPL), Google (GOOG), Microsoft (MSFT) and Amazon (AMZN) stocks respectively in a 41 market day sample. To remove the noise data was built a named entity recognition system on Twitter data using CRF++ tool with overall Precision of 90.02%, Recall of 78.08%, and F-score of 83.60%. To detect the sentiment in Twitter posts, was employed an online sentiment classifier called Twitter Sentiment Tool (TST). To determine whether consumers have market confidence in the company, was made use of a state of art for Twitter data CMU POS (Part-of-speech) Tagger to extract adjective, noun, adverb and verb words and fix them to bullish and bearish as anchor words, which were then calculated using the Semantic Orientation (SO) algorithm to measure the association between two words. One very important and useful conclusion made in the work by Vu et al., 2012 [66] is that in case a company has frequent positive and negative sentiment, these features appear to function as a strong prediction of stock market movement. Though not on Twitter data but for financial message board Yahoo! Finance, a novel method on stock prediction was presented by Sehgal et al., 2007 [73]. To solve the problem of data irrelevance, authors proposed a new measure TrustValue, which assigns trust to each message based on its author. This method rewards those authors who write relevant information

35 35 and whose sentiments closely follow stock performance, not only related to the direction of the stock price but also the magnitude is counted. The value of each entry in the vector of words and author names, into which each message was converted, were calculated using TFIDF (Term Frequency Inverse Document Frequency) formula. The sentiment classification was performed experimenting with 3 classifiers of Weka toolkit, namely Naive Bayes, Decision trees and Bagging (bootstrap aggregation). All of the features including sentiment and TrueValue were used to train the classifiers. The resulting model made predictions with accuracy ranging from 63% for ExxonMobil up to 81% for Apple. With the addition of TrueValue feature accuracy, for example for Apple, increased by 9%, proving that this feature helps to remove many irrelevant or noisy sentiments. Another work, taking advantage of statistical measures of the messages, is conducted by Ruiz et al., 2012 [74]. Author of this publication extract messages from Twitter about company stocks, and represent that information through graphs capturing different aspects of the conversation around those stocks. These time-constrained graphs are used to evaluate a wide range of features in terms of their degree of correlation to changes in stock price and traded volume. The main finding of this work is that the best feature in terms of correlation, especially in relation to traded volume, is the number of connected components of the constrained subgraph. On the other hand, the stock price is not strongly correlated with any of the features extracted, but it is only slightly correlated with the number of connected components and even less with the number of nodes in the constrained subgraph. However, authors show by using a simulation of daily trading of stocks that even relatively small correlations between price and micro-blogging features can be exploited to drive a stock trading strategy that outperforms other baseline strategies. In this publication has been also made an interesting observation that an increase in the activity of some companies is often compensated by the inactivity of others. The features extracted from Twitter posts in the study by Ruiz et al., 2012 [74] can be categorized in two groups: first group features measure the overall activity in the micro-blogging platform, such as number of posts, number of re-posts, number of users and so on; the second group features measure properties of the link structure of the graph, for instance, the number of connected components, statistics on the degree distribution, and other graph-based properties. Alternatively, Maggioni et al., 2012 [64] Master thesis work proposes an ad-hoc system which combines a semantic sentiment analysis technique together with an adaptive predictive algorithm specifically designed to take advantage of both the sentiment and the past performance measures of the chosen financial instrument. The algorithm was tested by simulating an

36 36 investment on the selected financial instrument, which enabled a gain of 14.7% of the initial capital, with respect to a return of just 2.8% achieved by the same instrument. Authors perform their experiments on datasets of the automotive sector micro-blogs from Twitter, trying to predict the stock movements of the Dow Jones Automobiles & Parts Titans 30 index (DJTATO). For the task of sentiment classification in this study was used an external tool, namely SentiEngine. Each post was assigned to a specific category by analyzing the post s content and then the polarity of the post was evaluated by assigning it a monodimentional binary value. One finding of this work is a designed with a top-down approach brand model whose purpose is the mapping of all the important concepts that are possibly the subject of chatting on Twitter, to better understand the sentiment within each single post. Another important finding of the Maggioni et al., 2012 [64] study is the presented model identification approach to model the sentiment impact on stock markets, algorithm that infers some hidden relations from data in such a way to maximize the profits that could be made by trading the chosen financial instrument. The model was identified over a limited time frame, which used to predict only the posterior return, and then repeating this process an arbitrary number of times by shifting forward the time frame; the algorithm then selects the most proper model by using the various quality metrics. The results of this methodology tests on the automotive sector outperform the benchmark. Also Maggioni et al., 2012 [64] in their primary work tried to evaluate the importance of Twitter as a news medium and came to conclusion that information copied from other sites to Twitter has a negligible delay of 2 to 4 minutes. Wolfram et al., 2012 [67] is another recent Master of Science work on stock market prediction with Twitter data, in which authors built regression models using the Support Vector Regression algorithm of several NASDAQ stocks, namely Google Inc. (GOOG), Apple Inc. (AAPL), First Solar (FSRL), Inc. and Intel Corporation (INTC), which on average were able to predict future prices 15 minutes ahead of the actual prices with accuracy close to a strong baseline. Different linguistic textual feature representations, such as bag-of-words model, were built from the raw Twitter dataset provided by the University of Edinburgh, School of Informatics contained Twitter posts ranging from November 11th 2009 to February 1st For the stock datasets authors of Wolfram et al., 2012 [67] collected Intra-Day stock quotes during a period of two weeks starting on July 19 to July 30, Using Intra-Day minute stock quotes instead of usual End of Day resulted in bringing a possibility of predicting real valued prices within a short period of time into the future instead of just the next trading day.

37 37 A lot of effort in this study by Wolfram et al., 2012 [67] was allocated to the task of raw Twitter data filtering to reduce the dimensions of it. First of all, around one third of tweets were removed due to belonging to non-english languages. Then, for each of the 4 stocks of the experiment were created lists of no more than 5 keywords, best describing the company and its products/services. Then a query expansion web service Google Sets was used to discover additional 43 keyword terms. For the list of stock market symbols and financial terms, related to these companies, Google Finance was employed. From all of these keyword lists, the final query terms were concatenated manually. For each experiment was calculated the Mean Squared Error (MSE) of a strong baseline, where the baseline is calculated by taking the Simple Moving Average (SMA) of a series of stock prices in the testing set. After the first experiments with simple bag-of-words feature vector space, authors added next improvements such as most common stop words removal and performing stemming on every token. Then SMA was added as a feature, improving the results but still not outperforming the baseline. By plotting some keywords against stock price, authors found that keywords that had a very high frequency count and which related directly to the company including currently popular products and services showed considerable correlation with price changes. Due to this finding, the weighted query terms were generated and the tf-idf algorithm was employed to calculate the weights of each post, in desire to reduce the dataset by only selecting posts that adequately match the weighted query terms. For the comparison of how close a document matched their weighted query terms, authors applied the cosine similarity measure, which reduced the initial datasets. After all, the results of the final experiments approached the baseline in all cases fairly well. The evaluation of the possibility of make a profit from the predictions of the generated models, for better understanding of the meaning of attained results and the impact of the MSE values, was created Virtual Stock Trading Engine by Wolfram et al., 2012 [67], which imitated a day trading agent, following general rules to formulate a fairly realistic simulation environment. As the outcome of the tests with this engine, significant profits were obtained. As a result of experimentations, conducted by Wolfram et al., 2012 [67, was found that the best feature vector with the lowest error measure was obtained from feature selection methods with the addition of a new feature which was constructed using the average stock price over the last 60 ticks of minute stock quotes. Results show the predictions were very close to the strong baseline with an MSE score of for Apple versus the baseline MSE of Similarly Google s MSE score was compared to the baseline. It was also found that

38 38 predicting the future price can be achieved at short distances of 15 minutes into the future, but accuracy becomes unstable as the forecast distance increases (30 minutes). One other attempt to correlate users opinions to stock market was performed not on Twitter data set, but using an online financial message board, Liu et al., 2006 [57] focused on identifying the most historically accurate posters and analyzing their sentiments using a mixture of experts framework in a completely automated fashion. One conclusion made in this work is that there is significant information in the sentiment tags that can be used to create real-time, profitable, implementable strategies to make accurate predictions about the returns on investments, which proves that the topicality of the present thesis is truly important. The mood of investors tends to affect their evaluations of future prospects and hence their trading behavior in financial markets (Chang et al., 2012 [63]). All of these earlier publications prove that stock market movements can be predicted by analyzing different variables effects or global sentiment published on such micro-blogging services such as Twitter. Yi et al., 2009 [61] mainly explore the area of feature filtering and selection methods from the datasets, built out of Newswire, Blog and Twitter, in order to understand the effect of different types of features with respect to their correlations to the stock prices and the performance of the learnt models. They ve performed experiments over individual stocks, including Google Inc., Microsoft Corporation, HSBC Holding plc., and Yahoo!; and for stock market indices including NASDAQ Composite, FTSE 100 and S&P 500. Simplest counting method failed to correlate directly with stock prices, but models built on more complex features that bring cross-related concepts are shown to be effective in prediction. As the result - all models outperform the baselines except the S&P 500 index. Google Inc. won with a large margin and the others perform slight better than the baselines. The study conducted by Meador et al., 2009 [69] reached controversial conclusion to all the other publications reviewed in this chapter. Authors in their work focused on two topics recently released movies and stock market performance. And if for the prediction of movies popularities based on Twitter reviews there has been explicit correlation results compared to the actual box-office returns, which is a measure of the revenue a movie accumulates during a set period of time, for the stock market prediction authors inferred that overall it seemed that Twitter is not an effective method of predicting most stocks, at least in such a small window of five weeks. The results of the stock market prediction experiments demonstrated very little

39 correlation and only the stock volume data appeared to generate any significant correlations with the Twitter data. 39 In relation to forecasting box-office revenues for movies with Twitter chatter another more recent then Meador et al., 2009 [69] study was performed by Asur et al., 2010 [72]. Authors used a quantifiable measure on tweets, namely tweet-rate, as a number of tweets referring to a particular movie per hour, to outperform Hollywood Stock Exchange information market, the gold standard in the industry. Adding the sentiments as an additional variable, to the regression equation on the movie stock prices improved the prediction to 0.92 while used with the average tweet-rate, and 0.94 with the tweet-rate time series. For the sentiment analysis in this work was used LingPipe linguistic analysis package, concretely DynamicLMClassifier, which is a language model classifier. To obtain labeled training data for the classifier were utilized workers from the Amazon Mechanical Turk. The classifier was trained using an n-gram model, where n was chosen as 8. When tested on the training-set with cross-validation, an accuracy of 98% was obtained in classifying tweets into positive, negative and neutral. Though, interestingly, it was found that even though sentiments did provide improvements, they weren t as important as the tweet-rate measure. Another approach to correlating sentiment of texts to stock market is a simple contextual framework by Choudhury et al., 2008 [65], purposed to model and analyze communication dynamics among people in technology gadget-discussing blogosphere community Engadget in search for the impact on the technology companies stocks. They assume that the stock movement on a certain weekday can be correlated with the communication dynamics in the past week. Communication dynamics are described by several contextual properties of communication, identified for four companies in experiment, namely Apple, Microsoft, Google and Nokia, which are the number of posts, the number of comments, the length and response time of comments, strength of comments and the different information roles that can be acquired by people (early responders / late trailers, loyals / outliers). The Support Vector Machine regression framework was used to predict the stock movement. This method shows a 22% mean prediction error for magnitude compared to actual stock movement and it outperforms the baseline methods, the SVR method also performs the prediction of the direction of the trend movement better than baseline methods with an error of 13.41%. A successfully existing example of stock prediction over Twitter data application is TweetTrader.net (Sprenger et al., 2011 [68]) a website-forum that uses crowd knowledge to aggregate the information containing in stock-related tweets. It uses the Twitter Search API to

40 40 provide a Livestream of all tweets related to S&P 500 stocks. Users have a choice between all tweets or a subset of selected indices or industries. Three main features provided to the users by this service: the automatic stock-related tweet classification using the multinomial Naïve Bayesian classifier of the Weka machine learning package into 3 categories buy, hold or sell; a real-time update of the automated classifier with the user manually voted sentiment, provided by the possibility of any user to vote each tweet of Livestream as buy or sell with special buttons; and the last feature is the Stock Game for every correct prediction a point is awarded to the user, TweetTrader.net keeps track of these predictions, evaluates them in realtime and shows a ranking of all participating users, allowing players to monitor their predictions and follow the best investment advisers in the Twittersphere. Also there exists a section Scoreboard which presents the stock-related tweets and statistical information on the chosen company s stock. All of these and many more successful examples of the search for stock market correlations with UGC justify the unremitting interest to this research area and the topicality of the present thesis subject.

41 Conclusions of the chapter In this chapter were introduced various existing studies related to the present thesis subject area. The literature review of earlier publications on the subject of stock market prediction with UGC present valuable results using different features and methods, but evidently not successful enough to gain a significant, stable and reliable advantage in the investment opportunities, predicting the price of a security, has been interpreted in this thesis work as a justification of the topicality of the chosen sentiment classification application area for the present study and a motivation to bring in a contribution. Due to the Pak et al., 2009 [70] work in the linguistic analysis on Twitter data, it was decided to use POS tagger to possibly improve the accuracy of the classifier s results. The approach of finding correlation between company stock value movement and company-related Twitter sentiment in the present thesis is similar to that of Vu et al, 2012 [66], it is assumed to count that Twitter users sentiment, related to the stock of a company, influences the next day s market price movement. And due to the finding in major portion of publications that stock movement has an explicit correlation with negative sentiment while almost none with positive, consequently, this is the primary hypothesis in the present study as well.

42 42 Chapter 4 Methods This chapter describes the tools and methods chosen for the approach to sentiment classification and stock market prediction in the present thesis. The whole process can be divided into 5 major parts: Data pre-processing, Data processing, Training, Twitter data labeling and the Regression analysis of Twitter sentiment and stock market values. Figure 2 shows the architectural overview of the proposed approach to the task of finding correlation of sentiments extracted from Twitter and the stock market movement, presented in this thesis. In details, parts of this process will be discussed in the following subchapters.

43 Figure 2 Architectural overview of the proposed approach 43

44 Data pre-processing Raw Twitter data contains a large amount of noise and irrelevant retweets, advertisements and marketing all necessary to be taken proper care of, making data preprocessing step one of the most important in the sentiment analysis process. Data pre-processing tasks were performed with the properly adjusted java implementation by Silvia Quarteroni on top of which were added modules more concretely specific for the subject matter of the present thesis. During the first steps of research was noticed the novel feature of Twitter social platform on the 13 of march 2013 multiple lines breaks have been permitted to be used in twitter messages [3] [75], which destroy the necessary for future work data tab format and which had to be eliminated before any further data preprocessing was performed. On the Figure 3, it can be observed an example of a Twitter post with line breaks. Figure 3 Example of a post with Twitter line breaks. Source: One of the first important tasks is to filter out all the non-english data, which was performed with 'language-detection' java library [79]. The necessity of tweets being of only English language is due to the fact that the present thesis work is conducted in English language. And even though all the requests to Twitter Search API were made using a parameter for only English language data collection but during research was detected that indeed some non-english data was still retrieved, proving that this is a non-redundant task. Below is an example of acquired non-english tweet: Ya tenemos datos de MSFT Q3 EPS $0.72 vs. $0.68 Est.; Q3 Revs. $20.49B vs. $20.50B Est. $MSFT QUOTE: andresllorente Thu, 18 Apr :10:

45 45 Elimination of retweets ( RT s) which contaminate data samples with a large repetition mostly with marketing objectives, commonly inviting Twitter users to follow some provided link or containing a company promoting appeal. Eliminating all, except for those that had any extra information before the RT symbol in text, therefore providing additional useful knowledge, namely opinion of the person who has retweeted that message. Three examples of multi-repeated RT with a cashtag of stock company ($msft) posted with an obviously advertisement motive: 9 Things Windows Phones Can Do That The iphone Can't $MSFT $AAPL a_azizshah Sat, 25 May :44: VIDEO REVIEW: Nokia's Newest Lumia Windows Phone $NOK $MSFT $VZ agustinilt Thu, 23 May :17: Microsoft is expected to announce its next generation Xbox today. Follow our Xbox live blog: $MSFT arieljara77 Tue, 21 May :49: Further tweets cleaning included filtering out disagreeing annotations, replacing punctuation marks with white spaces, except for punctuation marks that had previously a numerical symbol, thus not to destroy any numerical representations of stock values movements, deleting duplicate tweet ids, replacing multiple spaces with a single one, decoding HTML. Each tweet has been tokenized at word level with CMU ARK scala twokenizer first created by O'Connor et al., 2010 [77] and modified especially for POS tagging for the CMU ARK Twitter POS Tagger. Tweets were POS-tagged with CMU ARK Twitter Part-of-Speech Tagger fast and robust java-based Part-of-Speech Tagger for Twitter, created by CMU ARK as described in Gimpel et al., 2011 [76]. Tweets are represented as an array of strings, using the IOB2 notation which represent every tweet s first word of its sequence of words with B -label signifying the beginning of the chunk (tweet) and every next word of that tweet as a I -label signifying the inside of the chunk, tokens which do not belong to any chunk are represented by O labels.

46 46 Before further processing of the Twitter data, the original tweets timestamps were normalized to the American Eastern Time zone to be comparable to the NASDAQ stock market time.

47 Features selection and templates creation Due to the fact that it was chosen to use an external general purpose tool, namely CRF++, which requires a predefined templates for the training phase, a file, describing which features are used in the training/testing, after all the irrelevant messages have been filtered out, the left over data cleaned up and each message has been tokenized, POS-tagged and represented as an array of strings the next step was to choose the features for the better classifier s performance. In this thesis both unigram and bigram templates were created and used in the experiments conducted in the search for the best classifier s performance. The difference inbetween them is that in bigrams is automatically generated a combination of the current output token (in this context a word) and the previous output token, meaning that the algorithm reads two lines at a time. In practice, the difference in the unigram and bigram template s content is an addition of B symbol at the end for the unigram, which has a U symbol, to be read as bigram. The templates are created varying different combinations of word and POS tag features. The ones reporting the best accuracy will be presented in the Results chapter of the present thesis work. Below is an example of a template file: # Unigram #WORD U00:%x[-2,0] U01:%x[-1,0] U02:%x[0,0] U03:%x[1,0] U04:%x[2,0] #POS tag U10:%x[-2,1] U11:%x[-1,1] U12:%x[0,1] U13:%x[1,1] U14:%x[2,1] # Bigram (previous + current) B

48 48 In this template, at any token that the CRF++ examines it will take into account that current word, the two previous words, the two next words, the current word s Part-of-Speech tag, the two previous words POS tag and the two next words POS tags. The U character stands for the unigram features; to convert a unigram to bigram, it is necessary to add a B character in the end of the file, like presented in the above example. The numbers after U are arbitrary and stand as unique identifiers to distinguish relative positions of tokens. For the bag-of-words feature, neglecting the relevant positions of the features, the identifiers are not put. The lines, starting with # symbol and empty lines are discarded as comments.

49 Conditional random fields tool For the training and testing of data to perform the task of sentiment classification in this thesis work was used and external CRF++ tool, called from the java code. CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data [55]. For CRF++ tool to work properly, both the training and testing data have to be in a proper format: must consist of multiple tokens, each token represented in one line, and each token consisting of multiple but a fixed number of columns, where each column is separated from others with a tabular space. In the present work, a token is a word, where a sequence of tokens is a tweet. To define the boundary between tweets, an empty line is put. The semantics for the columns is as follows: 1. word 2. POS tag 3. Author of the tweet 4-9. Date of the tweet: 4. day of the week 5. number of the day of the month 6. month 7. year 8. time 9. time zone 10. manually assigned label of the tweet (the true label, which is going to be trained by CRF++) in the IOB format. The date of the tweet was necessary for the further evaluations and wasn t included in the templates as a feature. To specify which features are to be considered in the model for training and testing, a feature template file is written, as described in the previous subchapter. The CRF++ tool also allows for training condition parameters adjustment. The 4 major parameters are as follows: -a CRF-L2 or CRF-L1: where L2 is the default setting and performs a bit better than L1, in which the non-zero features is significantly smaller.

50 50 -c float: changes the hyper-parameter for the CRFs. This parameter trades the balance between overfitting and underfitting, promising significant results influence. -f NUM: where 1 is the default value. This parameter sets the cut-off threshold for the features. CRF++ uses the features that occur no less than NUM times in the given training data, which is useful in case of large-scale datasets. As for the testing options: -v: verbose level 1 or 2, where 0 is the default value. With the increase of the level additional information is provided by the CRF++. -n: yields N-best results sorted by the conditional probability of CRF, using a combination of forward Viterbi and backward A* search Which of these parameters have had an effect on the performance is described in the Results chapter of the present report.

51 Results evaluation metrics using: The performance of CRF++ tool in scope of sentiment classification was quantified Accuracy the degree of closeness of measurements of a quantity to that quantity's actual (true) value [17], in other words, how well the classifier agrees with the human judgments ( ground truth ). Recall is the fraction of relevant instances that are retrieved [18], the ratio of correctly classified instances out of the full set of instances which should have the label in question. Recall is a measure of completeness or quantity. Precision is the fraction of retrieved instances that are relevant [18], the ratio of correctly labeled tweet instances to the number of instances assigned a particular label. Precision can be seen as a measure of exactness or quality of the classifier. F-measure - is a measure of a test's accuracy, a weighted mean of precision and recall. For the calculation of the above measures, the tweets were grouped into four categories: true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). The true positives are the tweets that belong to a class i and in fact were assigned by the classifier to that class i and the tweets which belong to this class i but which were wrongly assigned by the classifier another class are false positives. True negatives are tweets which are not of a class I and weren t assigned by the classifier to the class i, while false negatives are tweets which weren t assigned to class i by the classifier but which actually belong to this class i.

52 52 Since in the present thesis tweets are classified into 3 classes ( positive, negative, neutral ) all the evaluation measures were calculated class-based and then for the total measures was taken an average. All the above classifier performance measures were calculated with java class written for this concrete purpose.

53 Regression analysis tool To perform the statistical analysis of the regression, it was employed in this thesis the software Minitab 16 [81], used during its free evaluation period. Minitab 16 is the most recent version of the renowned statistical package, firstly developed at the Pennsylvania State University in1972. This newer version comes with an embedded powerful regression analytics tool, which guides the user through the complete regression analysis, with user-friendly results and conclusions about the dataset. The main screen of Minitab 16 is shown below: Figure 4 Minitab 16 main screen And the Minitab 16 assistant tool step-by-step use:

54 54 Figure 5 Minitab 16 Regression assistant In this window (Figure 6), the user must choose which variables that are supposed to be studied for correlation, as well as the type of regression to be performed (option Choose for me performs all the tests and returns the best result possible, based on the statistical significance of the improvement of the regression in higher orders), and the level of confidence (alpha-level) to determine the existence or not of a regression: Figure 6 Regression analysis setup

55 55 The results are shown in the following window: Figure 7 Minitab 16 Results window Where it is presented the level of significance of the regression (in this case, the regression is statistically significant, since the p value is smaller than the threshold of 5% established), the R² level (which indicates the amount of the observed variation that can be explained by the developed equation), the level of correlation r (which illustrates whether the variables are positively or negatively correlated), the fitted equation of the most significance (in this case, the linear equation Y= *X) and some extra relevant comments.

56 56 Minitab also provides the residual plots for the regression: Figure 8 Minitab 16 Diagnostic report That can be used to validate the dataset used and the results obtained.

57 Technical details and Libraries The entire project was developed in Java programming language using IntelliJ IDEA Along with the Java standard libraries, the following libraries have been used: com.cybozu.labs.langdetect a language detection java library [79] used for filtering out non-english tweets; Edu.cmu.cs.lti.ark.tweetnlp.ark-tweet-nlp java library CMU ARK Twitter Part-of-Speech Tagger [76]; Edu.cmu.cs.lti.ark.tweetnlp.ark-tweet-nlp scala CMU ARK Twitter Twokenizer [77]; org.apache.commons.lang3 for HTML decoding.

58 58 Chapter 5 Experimental evaluation In this chapter the experimental setup required for the experiments is introduced as well as the datasets requirements and construction. 5.1 Experimental setup All experiments are performed on a 4GB RAM, Genuine Intel(R) Core(TM)2 DUO CPU 2.53 GHz. Windows Vista, 32-bit machine. 5.2 Twitter datasets Due to the fact that there were no Twitter company-related datasets publicly available, in this thesis one of the most demanding and time consuming tasks was to collect and manually label datasets, necessary for the experiments. For the purpose of creation of Twitter company-related datasets was written a piece of software in java programming language, which collected necessary tweets through invocations to the Twitter Search API. Due to the limitations of Search API of 1500 tweets per each request and not older than a week, the experiment datasets contains tweets of 5 weeks (38 days) old collected weekly. For each company of the experiment there were made several separate requests with the Search API using companies hashtags and cashtags to collect messages, which were further on looked through manually to identify the most useful dataset for each company individually, and which were further on manually labeled into positive, negative and neutral. Moreover, worth noting, that all the requests to Twitter Search API were performed specifying language parameter as English for the simplicity of further manual labeling and performing experiments evaluation. Each dataset s raw tweet was retrieved in a tab format, with the following fields in presented below sequence: Object number Text (max 140 characters) Tweet ID

59 59 Username (max 20 characters) Data (e.g. Mon, 22 Apr :41: ) The companies for the experiment were chosen guided by the idea of making an experiment on companies with various activity of the data distribution over time and different popularity of discussion of their products and services on the Web, while still choosing popular enough companies to have large enough datasets for the experiment. The initial scope of this research intended to use 5 companies in the IT sector to test the prediction capacity of the developed model, being them Apple Inc. (AAPL), Google Inc.(GOOG), Microsoft Inc. (MSFT), Dell Inc. (DELL) and Facebook Inc. (FB). However, as the data gathering process started to be performed, a clear anomaly appeared regarding the company Dell Inc. (DELL): The tweets indicated a possible acquisition offer for the company shares to be made by a financial investor (Blackrock Inc.). Being that so, a great disturbance on the number and nature of tweets was visible, in a manner that, for the purposes of this study, Dell Inc. (DELL) was disregarded from the sample, aiming towards more meaningful results and analysis to be carried out with the remaining companies. Below is a sample of tweets contaminating Dell dataset: Blackstone abandons $25bn move for #Dell #money tinatinjohnson Fri, 19 Apr :20:05 The bidding war over control of #Dell appears to be collapsing & now #Icahn are bailing: $DELL FoxBusiness Fri, 19 Apr :52: Facebook was discarded from the experiment for the reason of polluting a large data sample of its Twitter messages being irrelevant to the company matter but based on the social network related chatter which contaminates the dataset and is greatly complicated to be filtered out. This is due to the equality of the stock market ticker of Facebook Inc. to the common language abbreviation of the social network fb. Below is a sample of the contaminating Facebook dataset tweets: The #WalkingDead and #DowntonAbbey: Season 3 #fb sfsignal Tue, 23 Apr :53: Missed a lot of this season of the Voice but the 2 girls singing right now are awesome. Who agrees? #Thevoice #fb #blakeshelton hhanks Tue, 23 Apr :53:

60 60 Diet, exercise together yield better results, study says #fb #in #healthybody #vi #shake passionexplorer Tue, 23 Apr :53: A very extensive dataset per each day make it unpractical to manually label the minimum data period to build an accurate classification model. Apple Inc. stock ticker (AAPL), as stated in Most Talked About section of TwetTrader.net[6] (which is a stock micro-blogging forum that leverages the wisdom of crowds to aggregate the information contained in stock-related tweets, as stated in Sprenger et al., 2011 [7]), as of the time of the present work being written, has the strongest popularity over Twitter stock-related discussions, making its dataset overwhelming and therefore having to be discarded from the experiments due to time constraints. After all research on the data, 2 companies were chosen for the experiments on the developed in this thesis model: Google Inc. (GOOG) and Microsoft Inc. (MSFT), which contain diverse activity over Twitter distribution and distinctive overall popularity. Raw data was retrieved containing 17,214 tweets from 17 th of April, 2013, till 24 th of May, 2013 as for Google Inc., and 8,070 tweets for the same time period for Microsoft Inc. The datasets were manually labeled by two Politecnico di Milano students, one from Computer Engineering faculty, and another from Management Engineering faculty with an internship experience in the financial market related area, which gave each advantage in proper understanding of stock related IT sector companies tweets. Data labeled by two independent students with different but related to this thesis knowledge areas brings more accuracy to the manually selected labels. The tweets are classified as either positive, negative or neutral, where neutral stands for the tweets with no specific polarity or those with a mixed sentiment (compound or multidimensional sentiment). Due to the difficulty of sentiment analysis task for an automated system, just as for a human, due to a certain level of subjectivity always present, as shown in the study of Wilson et al., 2005 [1], was decided to rely on patterns created in the work by Maggioni et al., 2012 [2] for manual sentiment classification: Tweets reporting uncertainty should be categorized as negative, as uncertainty is expected to have a negative impact on stock market prices. Tweets about commercials, ads and new products should be categorized as positive, because usually a user shares these contents only if he enjoys them Tweets reporting news should be categorized with the same sentiment of the news.

61 61 Tweets reporting generic or ambiguous news should be categorized as neutral. For each dataset it was chosen to perform the manual labeling task over a time period from 17 th of April until 24 th of April, 2013, which contained 2,212 tweets for Microsoft company, out of which 778 tweets were manually classified as positive, 337 as negative and 1,096 neutral; and out of 4,540 tweets for the Google company 1,350 positive, 805 negative, 2,377 neutral. On the Table 1 can be seen day by day distribution of sentiment for the Microsoft Inc., and on Table 2 for Google Inc. Microsoft Inc. Total amount Positive Negative Neutral Wed, Thu, Fri, Mon, Tue, Wed, Table 1 Microsoft Inc. daily Twitter sentiment distribution Google Inc. Total amount Positive Negative Neutral Wed, Thu, Fri, Mon, Tue, Wed, Table 2 Google Inc. daily Twitter sentiment distribution Due to the fact that the training of the model was performed on a real not normalized tweets datasets, the sentiment distribution and overall amounts of tweets are uneven in-between days, and the datasets for Microsoft and Google companies are of different dimensions for the same time period.

62 62 As can be noticed on the Tables 1-2, the days of 20 th and 21 st of April are skipped, this is due to the fact that Nasdaq Stock Market s trading hours are Monday through Friday and therefor, the tweets from Saturdays and Sundays are added to those of the Monday working hours since that s the day they ll be possibly influencing. Moreover, the sum of the tweets for the day is from 16:00 of the previous day until 16:00 of the current day in Eastern standard American Timezone (New York) due to the Nasdaq Stock Market s closing time being 16:00 and it is assumed that all the tweets after the closing hours of the stock market are to be correlated to the next day s stock market performance. During the time period of the Twitter and stock market datasets there weren t any holidays on which Nasdaq Stock Market wasn t working normal trading hours [78].

63 Cross-validation Due to the limited labeled corpus size of 2212 tweets in the dataset for Microsoft company and 4540 tweets in the dataset for Google company, and considering the over-fitting problem, was chosen to use the n-fold cross validation method to estimate the accuracy of the sentiment classification approach. Empirically it was chosen n = 10 as the best performance driven one. N-fold cross validation is performed by dividing the randomized data into n equal parts, out of which 1/n is used as the test dataset and the rest as the training dataset. This crossvalidation process is then repeated n times with each of the n data parts used as test dataset exactly once. The advantage of this approach is that all observations are used for both training and testing and each observation is used for the testing phase exactly once.

64 Sentiment classification Sentiment classification was performed over manually labeled datasets which were described in 5.2 subchapter, using CRF++ tool, described in the previous chapter (4.3), which was trained using predefined templates of features, introduced in 4.2 subchapter. Then from the results of experiments with various alternations of parameters and features, which will be described in the Results chapter of the present thesis, was chosen the best classification model which was used to classify the rest of the time period of tweets from 25 th of April until 24 th of May In the Table 3 is presented the obtained volumes of sentiments for the companies in the experiment.

65 DATE Microsoft Inc. Google Inc. Total Positive Negative Neutral Total Positive Negative Neutral Thu, 25 Apr Fri, 26 Apr Mon, 29 Apr Tue, 30 Apr Wed, 01 May Thu, 02 May Fri, 03 May Mon, 06 May Tue, 07 May Wed, 08 May Thu, 09 May Fri, 10 May Mon, 13 May Tue, 14 May Wed, 15 May Thu, 16 May Fri, 17 May Mon, 20 May Tue, 21 May Wed, 22 May Thu, 23 May Fri, 24 May Table 3 Volumes of sentiments for the companies Microsoft and Google in the time period from 25 th of April until 24 th of May

66 Stock prices datasets To build the regression model for the stock market values prediction from companytargeted Twitter sentiment, the datasets of companies in experiment (Microsoft and Google) stock quotes for the same time period as of the Twitter datasets, from 17 th of April, 2013, till 25 th of May, 2013, were necessary to be created. The stock prices refer to the closing prices of the underlying shares, obtained through the usage of the Bloomberg service, considered by many the most reliable source of information concerning the financial markets.

67 67 Chapter 6 Results This chapter presents various experiments performed in the present thesis work and the results obtained with the corresponding discussion. 6.1 Classifier s Performance The primary experiments have been conducted on classifying the Twitter micro-blogs in relation to the sentiment within them, using CRF++ tool and different templates in the search for the best accuracy of the classifier. First of all, have been performed a number of experiment using simple uni-grams and bigrams to compare their performance. As in the most of the previous works on the classification task, it was found that bigrams significantly increase the accuracy over uni-grams. For example, in an experiment, conducted for the Microsoft Inc., using an unigram template, considering previous and current words with a POS tag of the current word was obtained a 72% average accuracy over 10-folds, while for the bigram template with the same features the average accuracy was obtained as of 81%, demonstrating a 9% increase in accuracy. Next, were conducted numerous experiments with bigram template varying features, bellow is provided the list of bigram templates used: Simple current word and it s POS tag; Previous previous and current words and their POS tags; Prev+Next previous, current and next words and their POS tags; Prevprev+Nextnext = two previous words, current and two next words and their POS tags; Word_combinations - includes Prevprev+Nextnext template features and the combinations: word before previous word / previous words, previous/current, current/next and next/ next after next words, as can be seen below:

68 68 # Unigram U00:%x[-2,0] U01:%x[-1,0] U02:%x[0,0] U03:%x[1,0] U04:%x[2,0] U05:%x[-2,0]/%x[-1,0] U06:%x[-1,0]/%x[0,0] U07:%x[0,0]/%x[1,0] U08:%x[1,0]/%x[2,0] U10:%x[-2,1] U11:%x[-1,1] U12:%x[0,1] U13:%x[1,1] U14:%x[2,1] B Word_combinations + -f 2 - the described above template with training parameter f, for CRF++ to use only features which appeared atleast 2 times in the training data. Word_combinations + -f 4 - the described above template with training parameter f, for CRF++ to use only features which appeared atleast 4 times in the training data. Word_combinations + -c 3 - the described above template with training hyperparameter c. Word_combinations + -c 8 - the described above template with training hyperparameter c.

69 On the Table 4 the average accuracy over 10-folds results of the experiment with the introduced above templates are demonstrated: 69 Template Microsoft Inc. Google Inc. Simple 80% 78.92% Previous 80.35% 79.35% Prev+Next 80.56% 79.16% Prevprev+Nextnext 80.64% 78.89% Word_combinations 80.79% 79.67% Table 4 Classifier s accuracy, average over 10-folds, for Microsoft Inc. and Google Inc. with various templates. As for testing the training parameters in the search for the increasing effect on the classifier s performance, Table 5 presents some of the results: Template w/param Google Inc. Word_combinations + -f % Word_combinations + -f % Word_combinations + -c % Word_combinations + -c % Table 5 Training parameters effect on the classifier s performance, on Google Inc. dataset Varying different training or testing parameters and variables, presented in the 3.3 subchapter of the present thesis work, as visible in the Table 5 examples, not only didn t present an increase in the classification s accuracy, but in most cases introduced a significant decrease, showing that for the experiments carried out in this study they weren t relevant either for the reason of the data sample s size or for the features frequency.

70 70 From this Table 4 it is observable that the accuracy didn t vary much from altering the features, though it can be noticed that for both companies, Microsoft Inc. and Google Inc., the best performance was obtained using Word_combinations template, which was chosen to be used for the labeling of the next one month long time period, from 25 th of April until 24 th of May, 2013, to produce the Twitter sentiments daily volume, necessary for the next task of finding correlation of it with the stock values for these companies, as will be described in the following subchapters. These results of the classifier s performance evaluation, including accuracy, precision, recall and F-measure calculated as the average over 10-folds, can be observed on the Table 6 for Microsoft Inc. and Table 7 for Google Inc. Microsoft Inc. Accuracy Precision Recall F-measure Total 80.79% 67.57% 63.31% 65.36% Positive 78.45% 68.98% 72.03% 70.46% Negative 86.17% 58.66% 36.22% 44.65% Neutral 77.75% 75.06% 81.66% 78.19% Table 6 classifier s performance measures for Microsoft Inc. dataset, average over 10-folds. Google Inc. Accuracy Precision Recall F-measure Total 79.67% 70.92% 63.07% 66.77% Positive 78.66% 65.95% 60% 62.82% Negative 87.03% 77.11% 45.10% 56.86% Neutral 73.20% % 84.14% 76.24% Table 7 classifier s performance measures for Google Inc. dataset, average over 10-folds. In the Tables 8 and 9 are presented the performance measures, namely accuracy, precision, recall and F-measure, of the classification models for each company of the

71 experiment, selected out of the 10-folds, which were introduced in the Tables 6 and 7, with the best performance. 71 Microsoft Inc. Accuracy Precision Recall F-measure Total 81.67% 72.28% 63.47% 67.59% Positive 77.72% 73.49% 70.93% 72.18% Negative 88.15% 71.42% 32.25% 44.44% Neutral 79.14% 71.92% 87.23% 78.84% Table 8 Classification models performance for Microsoft Inc. Google Inc. Accuracy Precision Recall F-measure Total 80.08% 70.87% 63.57% 67.02% Positive 79.31% 66.31% 60.21% 63.12% Negative 87.23% 75.52% 46.25% 57.37% Neutral 73.70% 70.77% 84.25% 76.92% Table 9 Classification models performance for Google Inc Discussion of the classifier s performance Due to the fact that the training sets for both companies were non-symmetric, rather not of a large scale, and manually labeled with inherent subjectivity, the achieved results can be assumed as acceptable. In the present work, as features were used only word combinations and Part-of-Speech tags, which, have reached a surprisingly well results, if to consider that short Twitter messages present a very complex task for sentiment classification due to the unordinary abbreviations and lexicon, and especially complex becomes this task for micro-blogs related to stock market

72 72 chatter about concrete companies, as have been performed in the present study, when the lexicon becomes even more compact and domain-specific for each company case. Even for a human being the task of sentiment classification is counted as a complex one, especially since according to the research, humans disagree 21% of time [19], simply the labeling part could have had disagreements. It is a common assumption that automatic sentiment analysis will never be as accurate as humans because the classifier doesn t pickup, for example, sarcasm, but even if the classifier was 100% accurate, humans would agree with it only 79% of time [19]. The tweets for both of the datasets were labeled with sentiment in the relation to the positiveness and negativeness for the company s stock price, and perhaps not always in the relation to the global sentiment of that concrete tweet. For example, in the tweet below, which has been manually labeled as Negative, obviously because Apparently $GOOG has a problem is a negative sentiment in relation to the Google Inc., but the classifier has labeled it as Positive perhaps due to the positive first part of the tweet - This is a nice one, which could ve been counted as a Positive sentiment if the person wanted the Google s stock price to go down. This is a nice one. Apparently $GOOG has a problem. As seen on thinkberg Wed, 17 Apr :17: In the Future work part of the Conclusions chapter will be presented possible directions for the increase of the classifier s performance.

73 Classification Results and Model Adherence Before this next section, a crucial disclaimer which must be performed in this stage of the present work is that the correlation does not necessarily imply causality. For this project, this means that the fact of observing a high level of positive tweets may not (and very likely is not) be the reason why the stock performance for the studied day was positive. On the contrary, it may simply mean that the stock variation caused a positive reaction on the Twitter users, which on their turns posted positive content about the company. What this intends to say is that this study is not an attempt of creating a tool for anticipating stock market movements, but simply an attempt to validate an alternative classification method, based on the relationship of what is tweeted and the stock market movements. The level of granularity used for the temporal division of this project (results grouped on the trading day) would make it infeasible to attempt to use this algorithm to predict the results itself, reinforcing the purely scholar character of this verification to be applied next. This, however, does not imply that the presented results may not be taken as a starting point for modified versions of the developed tools that seek to predict the stock market movements, making it a potentially useful work for commercial applications Classification Results and Stock Price Behavior The analysis of the achieved results to be performed will initially consist in the comparison of the plotted results from the classification task and the stock prices behavior. The preliminary objective is to visually identify whether a similar trend is present in the sentimentclassified number of tweets and in the movements of the stocks closing prices and negotiated volumes. The common sense suggests us to expect a positive correlation in-between the number of positive-classified tweets and the upward movement of the stock prices, as well as a positive correlation in-between the number of negative-classified tweets and the downwards trend of the closing prices for the shares. In this sense, one of the analyses to be performed refers to the analysis of the stock price versus the net count of positive-classified tweets, i.e. the total number of positive-classified tweets subtracted by the number of the negative-classified tweets. This score should present a fair level of correlation with the closing price of the stocks as the time passes.

74 Number of Tweets Closing Price (USD) 74 A second question refers to how this comparison should be carried out. This analysis may be conducted on a daily basis, trying to understand the daily variation (percent variation) of the closing price can be related to the number of net positives of the given single day. However, this methodology may present some limitations due to the inertia presented by the stock markets: The positive sentiment identified on the tweets on a given day may impact the stock value on the following days, not being restricted to an immediate effect on the price variation. To illustrate this concept, one can present a situation as in the case where the company presents in one day some highly relevant positive fact that may take a while to be fully analyzed and interpreted properly by the stock analysts, yielding impacts not on the same day but during the course of the following ones. In this sense, it is reasonable to suggest a cumulative score as the predictor for the stock values. It s been defined in here, to cover this issue, a cumulative score variable, which refers to the total number of positive-classified tweets minus the total number of negative classified tweets of the current and previous days under analysis. Finally, it is relevant to assess whether there is any type of correlation in-between the total number of tweets in a given day and the total negotiated volume for the given security. It is expected that the higher the number of tweets in a certain day, the higher the negotiated volume for the studied stock Microsoft Inc. For Microsoft, the obtained results are summarized in the chart below: /4 24/4 1/5 8/5 15/5 22/ Total Positive Negative Neutral Closing Price Chart 1 Sentiments and Closing price, Microsoft Inc.

75 Number of Tweets Closing Price Variation 75 As explained, the chart above does not provide enough visibility for a proper analysis of the obtained results versus the real observation of the stock price variation. In the chart 2 presented next, are plotted the value of the variation of the stock price (compared to the closing price of the previous day), the number of positive-classified tweets, the number of negative classified tweets and the net number of positive tweets, as explained: 300 5% 4% 150 3% 2% 0 17/4 24/4 1/5 8/5 15/5 22/5 1% 0% -1% % Net Positive Positive Negative Price Change (%) Chart 2 Sentiments, Net Positive and Price change, Microsoft Inc. As an initial analysis, it seems that the classification results presented a worse fit to the change in the closing price then was observable in the previous chart 1, only providing a visual positive correlation of smaller volatility on the time period of the manually labeled sentiments. This can be due to the not such a large scale data sample used for the training of the classification model or any other difficulties related to sentiment classification task mentioned in section of the present thesis work. Seeking to understand the cumulative impact of the people s manifested sentiments via Twitter with the price change of the underlying security, it has also been developed a chart comparing the accumulated value of the net positive tweets (as explained in the previous section) and the stock closing price, for every studied day:

76 Number of Tweets Traded Volume Number of Tweets Closing Price (USD) /4 24/4 1/5 8/5 15/5 22/ Net Positive Accumulated Neg. Accumulated Pos. Accumulated Closing Price Chart 3 Accumulated Net Positive and the Closing price, Microsoft Inc. In this chart, the correlation in-between the results obtained thru the sentiment classification and the observed stock performance are undoubtedly perceived. The plots follow similar patterns, even if the closing price presents a higher volatility. At last, the traded volume for Microsoft was plotted along with the total number of tweets presented for each given day. Even if this comparison does not imply anyhow about the quality of the classifying procedure, it is important for the understanding of the relationships in-between social networking activities and real world phenomena /4 24/4 1/5 8/5 15/5 22/5 160,000, ,000, ,000, ,000,000 80,000,000 60,000,000 40,000,000 20,000,000 0 Total Volume Chart 4 Total number of tweets and traded volume, Microsoft Inc. Probably the best fit achieved for all the Microsoft experiment, the traded volume presented strong correlation (visually) with the number of tweets about the stock. This implies

77 Number of Tweets Closing Price Variation Number of Tweets Closing Price (USD) 77 that larger movements on the stock market can be also noticed by the larger amount of posts about the security in the social networks Google Inc. As for Google, the initial results obtained are presented in the chart below: /4 24/4 1/5 8/5 15/5 22/ Total Positive Negative Neutral Closing Price Chart 5 Sentiments and Closing Price, Google Inc. Once more, the chart comparing the total number of positive-classified tweets, negative classified tweets and net positive tweets to the daily price change of the stock was created, for a more comprehensive understanding: % 4% 3% 2% /4 24/4 1/5 8/5 15/5 22/ % 0% -1% -2% -3% Net Positive Positive Negative Price Change (%) Chart 6 Sentiments, Net Positive and Price change, Google Inc.

78 Number of Tweets Closing Price Variation 78 Google s results are somewhat clearer than the ones achieved for Microsoft s case; there is a visible adherence of the net positive value and the stock daily performance. This may be due to the broader sample size (number of studied tweets daily or number of tweets used for the classifier training set) used in this second case or simply due to a higher level of tweets discussing topics more directly related to the stock performance itself than of the long-term results of the company. As performed with the Microsoft results, the chart comparing the accumulated results was also developed: /4 24/4 1/5 8/5 15/5 22/ Net Positive Accumulated Neg. Accumulated Pos. Accumulated Closing Price Chart 7 Accumulated Net Positive and the Closing price, Google Inc. As expected, once more the results presented in here denote a good adherence of the classification model to the observed performance of the stock price. The higher volatility of the price was once more observed, but the similar trend was also obvious. As for the comparison of the observed volume and the total number of tweets for each trading day, the results are plotted below:

79 Number of Tweets Traded Volume /4 24/4 1/5 8/5 15/5 22/5 7,000,000 6,000,000 5,000,000 4,000,000 3,000,000 2,000,000 1,000,000 0 Total Volume Chart 8 Total number of tweets and traded volume, Google Inc. As observed with Microsoft, this comparison was probably the one with the closest fit among all presented for the Google s experiment. This validates the expectation that notable facts that may impact the number of trades in a day also impact in a similar manner the volume of mentions of that company across the social networks.

80 Residual Regression Analysis To further understand the adherence of the classification results with the real world observations, a statistical regression analysis was performed in this thesis, to evaluate the correlation in-between the observed behavior of the stock price and the results, using the same pairs of variables as in the previous sections. The objective in this section is to go beyond the simple visual analysis of the plotted data, providing some statistical certainty of the conclusions inferred Microsoft Inc. As performed in the previous section, the same sets of variables will be tested for the regression, using the Minitab 16 software. The first one to be tested, in this sense, is the significance of the net positive tweets as independent explanatory variable for the daily change, in percent, of the stock closing price. Initially, looking at the Residual Analysis below, it is possible to infer that the dataset entered has a suitable dataset, with no observations with large residuals. Regression for Price Change (%) vs Net Positive Diagnostic Report 1 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns. 0,02 0,01 0,00-0,01-0,02-0,01 0,00 0,01 Fitted Value 0,02 0,03 Examples of patterns that may indicate problems with the fit of the model: Unequal variation Clusters Uneven variability, such as when the spread of points increases as the fitted values increase. If the unequal variation is severe, get help to address the problem. Groups of points that suggest there may be important X variables that were not included in the regression model. Get help to address the problem. Strong curvature Figure 9 Residual analysis for Price change and Net positive, Microsoft Inc. Large residuals Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. The regression results, on the other hand, presented the values one could expect observing the chart 2, determining a non-statistically significant relationship in-between the net Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. positive value and the stock price variation, at the given threshold of 5% significance.

81 Price Change (%) 81 Y: Price Change (%) X: Net Positive Yes 0 Is there a relationship between Y and X? 0,05 0,1 P = 0,064 The relationship between Price Change (%) and Net Positive is not statistically significant (p > 0,05). Regression for Price Change (%) vs Net Positive Summary Report > 0,5 No 4,00% 2,00% 0,00% Fitted Line Plot for Cubic Model Y = 0, , X + 0, X**2-0, X** Net Positive Comments % of variation accounted for by model 0% R-sq (adj) = 15,83% 15,83% of the variation in Price Change (%) can be accounted for by the regression model. 100% The fitted equation for the cubic model that describes the relationship between Y and X is: Y = 0, , X + 0, X**2-0, X**3 If the model fits the data well, this equation can be used to predict Price Change (%) for a value of Net Positive, or find the settings for Net Positive that correspond to a desired value or range of values for Price Change (%). A statistically significant relationship does not imply that X causes Y. Figure 10 Regression for Net positive and Price change, Microsoft Inc. Nonetheless, the equation provided was a cubic one, which can be found in the image above. As for the accumulated net positive value versus the closing price behavior, the results were remarkably different. Even if the Residual Analysis presented 2 observations with somewhat large residuals, the points were on the other hand better distributed along the fitted value axis, which is positive in consideration to the quality of the regression (Ramos, 2010 [80]). The outliers may refer to unusual price variations that were caused by singular events in the sample.

82 Residual 82 Regression for Closing Price vs Net Positive Accumulated Diagnostic Report 1 1,0 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns. 0,5 0,0-0,5-1, Fitted Value Examples of patterns that may indicate problems with the fit of the model: Unequal variation Clusters Uneven variability, such as when the spread of points increases as the fitted values increase. If the unequal variation is severe, get help to address the problem. Groups of points that suggest there may be important X variables that were not included in the regression model. Get help to address the problem. Strong curvature Figure 11 Residual analysis for Closing price and Accumulated net positive, Microsoft Inc. Large residuals Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. For this study, the analysis provided a quadratic equation which presented excellent statistical significance and R² value, meaning that it is possible to infer that the accumulated net Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. positive value presents a good correlation with the closing price observed, as well as that the calculated equation was able to explain a large portion (over 90%) of the variation observed.

83 Closing Price 83 Y: Closing Price X: Net Positive Accumulated Yes 0 0,05 0,1 Regression for Closing Price vs Net Positive Accumulated Summary Report Is there a relationship between Y and X? P = 0,000 The relationship between Closing Price and Net Positive Accumulated is statistically significant (p < 0,05). > 0,5 No Fitted Line Plot for Quadratic Model Y = 28,81 + 0, X - 0, X** Net Positive Accumulated 2000 Comments % of variation accounted for by model 0% 100% R-sq (adj) = 90,72% 90,72% of the variation in Closing Price can be accounted for by the regression model. The fitted equation for the quadratic model that describes the relationship between Y and X is: Y = 28,81 + 0, X - 0, X**2 If the model fits the data well, this equation can be used to predict Closing Price for a value of Net Positive Accumulated, or find the settings for Net Positive Accumulated that correspond to a desired value or range of values for Closing Price. A statistically significant relationship does not imply that X causes Y. Figure 12 Regression for Closing price and Accumulated net positive, Microsoft Inc. Finally, when observing the volume change according to the number of tweets for a given day, Minitab provided us the following results:

84 Residual 84 Regression for Volume vs Total Diagnostic Report 1 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns Fitted Value Examples of patterns that may indicate problems with the fit of the model: Unequal variation Clusters Uneven variability, such as when the spread of points increases as the fitted values increase. If the unequal variation is severe, get help to address the problem. Groups of points that suggest there may be important X variables that were not included in the regression model. Get help to address the problem. Strong curvature Figure 13 Residual analysis for Traded volume and Total number of tweets, Microsoft Inc. Large residuals Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. An observation must be made referring to the outliers that were identified in the analysis: The point which is in the bottom, with far less volume than it would be expected for the number Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. of tweets concerns 21 st of May, date in which Microsoft presented its new video-game console, Xbox One. Even if such fact could impact the traded volumes for the stocks, it certainly attracted as well the attention of people who do not operate anyhow stocks, creating this unusual observation.

85 Volume 85 Y: Volume X: Total Yes 0 Is there a relationship between Y and X? 0,05 0,1 P = 0,001 The relationship between Volume and Total is statistically significant (p < 0,05). Regression for Volume vs Total Summary Report > 0,5 No Fitted Line Plot for Linear Model Y = X % of variation accounted for by model 0% 100% Total R-sq (adj) = 31,21% 31,21% of the variation in Volume can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 0,58 The positive correlation (r = 0,58) indicates that when Total increases, Volume also tends to increase. 0 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = X If the model fits the data well, this equation can be used to predict Volume for a value of Total, or find the settings for Total that correspond to a desired value or range of values for Volume. A statistically significant relationship does not imply that X causes Y. Figure 14 Regression for Traded volume and Total number of tweets, Microsoft Inc. Even taking into account such point, the statistical significance of the relationship inbetween the two variables was confirmed, providing a linear equation to relate the two variables. As for the R² value, it was found to be lower than expected, even if it reached a higher value (over 47%) when the 21/05/2013 outlier was removed from the dataset, as it can be seen below:

86 Volume_1 86 Y: Volume_1 X: Total_1 Yes 0 Is there a relationship between Y and X? 0,05 0,1 P = 0,000 The relationship between Volume_1 and Total_1 is statistically significant (p < 0,05). Regression for Volume_1 vs Total_1 Summary Report > 0,5 No Fitted Line Plot for Linear Model Y = X % of variation accounted for by model 0% 100% Total_ R-sq (adj) = 47,31% 47,31% of the variation in Volume_1 can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 0,70 The positive correlation (r = 0,70) indicates that when Total_1 increases, Volume_1 also tends to increase. 0 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = X If the model fits the data well, this equation can be used to predict Volume_1 for a value of Total_1, or find the settings for Total_1 that correspond to a desired value or range of values for Volume_1. A statistically significant relationship does not imply that X causes Y. Figure 15 Regression for Traded volume and Total number of tweets (without an outlier of 21/05/2013), Microsoft Inc Google Inc. Following an analog procedure to the one that has been applied for Microsoft Inc., the first regression analysis to be carried out for Google was the one using the net positive value per each day as the explanatory variable for the price daily change for the Google stock.

87 Price Change (%) Residual 87 Regression for Price Change (%) vs Net Positive Diagnostic Report 1 0,030 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns. 0,015 0,000-0,015-0,030-0,01 0,00 0,01 Fitted Value 0,02 0,03 Examples of patterns that may indicate problems with the fit of the model: Unequal variation Uneven variability, such as when the spread of points increases as the fitted values increase. If the unequal variation is severe, get help to address the problem. Groups of points that suggest there may be important X variables that were not included in the regression model. Get help to address the problem. Strong curvature Figure 16 Residual analysis for Price change and Net positive, Google Inc. Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Clusters The residual plot exhibits 2 outliers, but Large in general, residuals the regression results, as shown Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. below, were more satisfactory than the ones in the Microsoft equivalent experiment: Y: Price Change (%) X: Net Positive Yes 0 Is there a relationship between Y and X? 0,05 0,1 P = 0,001 The relationship between Price Change (%) and Net Positive is statistically significant (p < 0,05). Regression for Price Change (%) vs Net Positive Summary Report > 0,5 No 4,00% 2,00% Fitted Line Plot for Linear Model Y = - 0, , X 0,00% 0% % of variation accounted for by model 100% -2,00% Net Positive 400 R-sq (adj) = 32,37% 32,37% of the variation in Price Change (%) can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 0,59 The positive correlation (r = 0,59) indicates that when Net Positive increases, Price Change (%) also tends to increase. 0 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = - 0, , X If the model fits the data well, this equation can be used to predict Price Change (%) for a value of Net Positive, or find the settings for Net Positive that correspond to a desired value or range of values for Price Change (%). A statistically significant relationship does not imply that X causes Y. Figure 17 Regression for Price change and Net positive, Microsoft Inc.

88 Residual 88 The obtained p-value denotes a significant linear regression in-between the two variables adopted, with positive correlation, as expected. However, the R² value of 32,37% denotes a low level of explanatory capacity of the model, who was able to predict only1/3 of the observed variation in the Y (change in price) variable. As for the accumulated net positive value as an explanatory variable for the closing price value, once more the results presented were remarkably good. Regression for Closing Price vs Net Positive Accumulated Diagnostic Report 1 10 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns Fitted Value Examples of patterns that may indicate problems with the fit of the model: Unequal variation Uneven variability, such as when the spread of points increases as the fitted values increase. If the unequal variation is severe, get help to address the problem. Strong curvature Figure 18 Residual analysis for Closing price and Accumulated net positive, Google Inc. Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Clusters Large Only one outlier and well distributed plot of residuals indicate a good set of observations Groups of points that suggest there may be Points that are not well fit by the model. Try to important X variables that were not included in the understand why the points are unusual. Correct regression model. Get help to address the problem. measurement or data entry errors and consider removing data that have special causes. for performing the regression analysis.

89 Closing Price 89 Y: Closing Price X: Net Positive Accumulated Regression for Closing Price vs Net Positive Accumulated Summary Report Fitted Line Plot for Cubic Model Y = 770,7 + 0,06772 X + 0, X**2-0, X**3 Is there a relationship between Y and X? 0 0,05 0,1 > 0,5 Yes No P = 0,000 The relationship between Closing Price and Net Positive Accumulated is statistically significant (p < 0,05) Net Positive Accumulated 3000 Comments % of variation accounted for by model 0% 100% R-sq (adj) = 97,56% 97,56% of the variation in Closing Price can be accounted for by the regression model. The fitted equation for the cubic model that describes the relationship between Y and X is: Y = 770,7 + 0,06772 X + 0, X**2-0, X**3 If the model fits the data well, this equation can be used to predict Closing Price for a value of Net Positive Accumulated, or find the settings for Net Positive Accumulated that correspond to a desired value or range of values for Closing Price. A statistically significant relationship does not imply that X causes Y. Figure 19 Regression for Closing price and Accumulated net positive, Google Inc. The cubic equation relating the Closing Price to the Net Positive Accumulated has a nearperfect adherence to the observations, with a p-value=0 and a 97,56% R² value. It is, however, unexpected to see what appears in the upper-right part of the plotted results, where the accumulated net positive grows and the price moves down. In order to solve this issue, an analysis of linear regression was also carried out, presenting a slightly worse residual plot (in the sense that the observations were less spread apart) and a still high level of significance of the regression and a good R² level of 76%.

90 Closing Price Residual 90 Regression for Closing Price vs Net Positive Accumulated Diagnostic Report 1 40 Residuals vs Fitted Values Look for large residuals (marked in red) and patterns Fitted Value Examples of patterns that may indicate problems with the fit of the model: Unequal variation Uneven variability, such as when the spread of points increases as the fitted values increase. If the unequal variation is severe, get help to address the problem. Strong curvature Figure 20 Residual analysis for Closing price and Accumulated net positive, Linear regression, Google Inc. Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Clusters Groups of points that suggest there may be important X variables that were not included in the regression model. Get help to address the problem. Y: Closing Price X: Net Positive Accumulated Yes 0 0,05 0,1 P = 0,000 The relationship between Closing Price and Net Positive Accumulated is statistically significant (p < 0,05). > 0,5 No Large residuals Points that are not well fit by the model. Try to understand why the points are unusual. Correct measurement or data entry errors and consider removing data that have special causes. Regression for Closing Price vs Net Positive Accumulated Summary Report Is there a relationship between Y and X? Fitted Line Plot for Linear Model Y = 797,8 + 0,04034 X 800 0% % of variation accounted for by model 100% Net Positive Accumulated 3000 R-sq (adj) = 75,89% 75,89% of the variation in Closing Price can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 0,88 The positive correlation (r = 0,88) indicates that when Net Positive Accumulated increases, Closing Price also tends to increase. 0 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = 797,8 + 0,04034 X If the model fits the data well, this equation can be used to predict Closing Price for a value of Net Positive Accumulated, or find the settings for Net Positive Accumulated that correspond to a desired value or range of values for Closing Price. A statistically significant relationship does not imply that X causes Y. Figure 21 Linear regression for Closing price and Accumulated net positive, Google Inc.

91 Residual As for the regression analysis relating the total number of tweets for each given day and the total traded volume for it, the results provided by Minitab16 can be found below: 91 Regression for Volume vs Total Diagnostic Report Residuals vs Fitted Values Look for large residuals (marked in red) and patterns Fitted Value Examples of patterns that may indicate problems with the fit of the model: Unequal variation Uneven variability, such as when the spread of points increases as the fitted values increase. If the unequal variation is severe, get help to address the problem. Strong curvature Figure 22 Residual analysis for Traded volume and Total number of tweets, Google Inc. Curve in the data that is not well explained by the regression model. If you are already using the best fitting model, get help to address the problem. Clusters The single outlier found refers to the trading Large day residuals of 19/04, which was subsequent to the Groups of points that suggest there may be Points that are not well fit by the model. Try to important X variables that were not included in the understand why the points are unusual. Correct regression model. Get help to address the problem. measurement or data entry errors and consider removing data that have special causes. earnings release of Google Inc. In this sense, the only knowledge that can be extracted from this presence is that the number of the tweets in this event is disproportionally smaller than the increase in the traded volume for such day.

92 Volume 92 Y: Volume X: Total Yes 0 Is there a relationship between Y and X? 0,05 0,1 P = 0,000 The relationship between Volume and Total is statistically significant (p < 0,05). Regression for Volume vs Total Summary Report > 0,5 No Fitted Line Plot for Linear Model Y = X % of variation accounted for by model % 100% Total R-sq (adj) = 59,03% 59,03% of the variation in Volume can be accounted for by the regression model. Correlation between Y and X Negative No correlation Positive -1 0,78 The positive correlation (r = 0,78) indicates that when Total increases, Volume also tends to increase. 0 1 Comments The fitted equation for the linear model that describes the relationship between Y and X is: Y = X If the model fits the data well, this equation can be used to predict Volume for a value of Total, or find the settings for Total that correspond to a desired value or range of values for Volume. A statistically significant relationship does not imply that X causes Y. Figure 23 Regression for Traded volume and Total number of tweets, Google Inc. The linear equation that was obtained presented a very satisfactory p-value and an explanatory capacity (R²) of around 60%. The r value also denotes a positive correlation inbetween the total number of tweets and the traded volume, which confirm the previous findings.

93 Discussion of the Results The first analysis to be made refers itself to how the experiment changed in-between the two companies analyzed. Microsoft presented a more limited amount of tweets per day, yielding a more restricted training set for the classification algorithm. As a result, it would be expectable to observe a lower degree of adherence in-between the results obtained for Microsoft classified tweets and the stock prices when compared to the ones attained for the Google s experiment. In this sense, yet being already possible to grasp some initial evidence of such preposition during the graphical analysis stage, it became more clear and proven after the numerical statistical analysis was performed. The results have been summarized below in Table 10: Change (%) v.s. Net Positive Closing Price v.s. Ac. Net Positive Volume v.s. Total Tweets Microsoft Google Equation Cubic Linear P-Value 6.40% 0.10% R sq % 32,37% Equation Quadratic Cubic P-Value 0.00% 0.00% R sq % 97,56% Equation Linear Linear P-Value 0.10% 0.00% R sq % 59,03% Table 10 Regression results As it becomes evident, the larger sample used in the Google s experiment is bound to have created a more reliable classification mechanism, at least using the adherence of the results to the stock performance variables. Also, it is important to mention that the first proposed correlation analysis, trying to grasp a relationship in-between the number of net positive tweets and the daily price change of the underlying stock didn t perform as well as the next. This indicates that the preposition presented at the first pages of this section, which defended the existence of some inertia in the movement on the stock prices, could be important for this experiment. In this same sense, it is very relevant to mention as well that this experiment turned out to have an important limitation: Both stocks had a positive performance on the period studied, generally speaking. In this sense, there could have been different results if the stocks chosen

94 94 would happen to have performed poorly during the tweet collection phase. However, for some small periods the stock prices did present declines, movements which were also noticeable in smaller increases (or even decreases) in the accumulated net positive values. Finally, the correlation in-between the traded volume and the number of tweets appears to hold in most cases, except when a very specific case affects more the financial markets than the social media users in general (as earnings release) or the other way around (as the announcement of Xbox One). It appears, in this sense, that the classification did perform, especially in the Google s case, quite well, at least confirming what could be expectable to see as a reflex in the stock market prices of the company-related sentiment of the social media users.

95 95 Chapter 7 Conclusions 7.1 Contributions and conclusions In this thesis work it has been addressed the task of finding correlations between stock market values and stock related Twitter sentiments for selected companies belonging to the IT sector. Sentiment classification is a complex task, especially when performed over Twitter short messages and with additional challenges when applied to micro-blogs of the financial domain, which are often even shorter than the 140 symbols Twitter limit and contain very specific language and abbreviations, and in which many words imply different meanings and associate with distinct emotions. The resulting accuracy for sentiment classification, using as features word combinations and Part-of-Speech tags, in the present thesis study was achieved of 81.67% for Microsoft Inc. and 80.08% for Google Inc., having the highest accuracy been achieved for the negative class of 88.15% for Microsoft Inc. and 87.23% for Google Inc. The datasets labeled with the CRF classification models built for each company were then used to perform a regression analysis for these companies stock values for the same time period as of the datasets, indicating interesting trends and correlation. Even though for Microsoft Inc. there was a non-statistically significant relationship in-between the net positive value and the stock price fluctuation (at the given threshold of 5% significance) and, perhaps due to the smaller data sample used for training the classification model or number of studies tweets daily than in Google Inc. case, the correlation in-between the negotiated volume and total number of tweets appeared to hold in most cases for both companies. In this sense, this test validates the expectation that the occurrences that may impact the number of trades in a day also likely to affect in a similar manner the volume of chatter related to that company over the Twitter microblogging platform. Also, an excellent adherence of the classification model to the observed performance of the stock price was observed in the analysis of the accumulated net positive value versus the stock s closing price for both companies, including with Google Inc., presenting a near-perfect adherence with an ideal level of significance of the regression and a 97,56% explanatory capacity (R² value), as well as a slightly lower but still a remarkable R² value of 90.72% for Microsoft Inc. In this sense, the visible correlations of the company-related

96 96 sentiments to the stock values prove also the quality of the built classification models for the companies in the experiment. Though the length of the time period and the number of companies in the experiment don t enable this work to make any strong global conclusions from its results, nevertheless, the achieved observations referring to the accuracy of the classification models and the found adherences of sentiments with the stock market values testify the relevance of the findings presented in this thesis approach. 7.2 Future work There are many areas in which this work could be expanded with the objective of further refinements in the methodology and of being able to grasp more generic and embracing conclusions. Twitter sentiment analysis requires a large data sample to be truly effective. Therefore, to increase classifier s performance measures one first possible approach could be to perform the research study for a longer time period and expand the manually labeled data sample. Regarding the sense of complexity and time consumption of the task of manual data labeling, the Crowdsourcing approach could be adopted, bringing in additional benefits concerning the reduction of the subjectivity of the data labels. As for the model refinement, relevant classification model improvements could rely on the introduction of more complex features, rather than using solely combinations of words and POS tags. More extensive effort could as well be employed to the task of noisy data removal, such as, for example, identifying the implication of a tweet to the company stock discussion or to a useless chatter, as would ve been applicable for the Facebook Inc. case. Also the presented approach could be tested on a wider range of companies and perhaps on different knowledge domains. One important limitation in this sense is that the two companies studied ended up presenting very positive trends in their stock prices during the analyzed period. As mentioned, it could be desirable as well to understand the effectiveness of the approaches used for companies in a negative trend. Another possible direction for the future work is to expand the methodologies, approaches and findings presented in this thesis to different language domains, through the

97 development of new datasets in foreign languages, as well as though the development of the basic classification rules for it. 97

98 98 References [1] Recognizing contextual polarity in phrase-level sentiment analysis. Wilson, T., Wiebe, J., and Hoffmann, P., 2005, HLT 05: [2] Design and validation of a forecasting trading system based on Twitter, Alberto Maggioni, Luca Mazzoni, 2012 [3] Twitter now supports line breaks in tweets, retrieved on the march 25, [4] An Experiment in Integrating Sentiment Features for Tech Stock Prediction in Twitter. Vu [5] CRF++ tool [6] TweetTrader official website. [7] TweetTrader.net: Leveraging Crowd Wisdom in a Stock Microblogging Forum Sprenger [8] Sentiment analysis on Wikipedia [9] Twitter mood predicts the stock market Bollen, J., Mao, H., Zeng, X., [10] Sentiment Analysis: Introduction and the State of the Art overview. Adam Westerski [11] Sentiment Analysis and Opinion Mining April 22, 2012 Bing Liu [12] Sentiment Analysis: An Overview, Mejova Y. November 16, 2009 [13] Using Prediction Markets and Twitter to Predict a Swine Flu Pandemic. Ritterman, 2009 [14] Facebook official website [15] List of social networking websites, Wikipedia [16] List of most popular websites, Alexa traffic rank, Wikipedia [17] Accuracy and precision [18] Precision and recall [19] Sentiment analysis, Ogneva, M, Retrieved [20] Alexa website, Twitter statistics

99 99 [21] Twitter social network website, official website [22] Social Media in 2013: User Demographics For Twitter, Facebook, Pinterest And Instagram. Posted on Thursday, May 2nd, Written by Leo Widrich. [23] The Demographics of Social Media Users 2012 by Maeve Duggan, Joanna Brenner. Feb 14, [24] 21 Awesome Social Media Facts, Figures and Statistics for 2013 Written by Jeff Bullas / [25] GlobalWebIndex: Twitter Now The Fastest Growing Social Platform In The World [26] Mashable: Just who uses social media? A demographic breakdown April 12, [27] "WordNet-Affect: An affective extension of WordNet" Carlo Strapparava, Alessandro Valitutti, 2004, Proceedings of LREC. pp [28] "Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining" Stefano Baccianella, Andrea Esuli, Fabrizio Sebastiani, 2010, Proceedings of LREC. pp [29] "SenticNet 2: A semantic and affective resource for opinion mining and sentiment analysis" Erik Cambria, Catherine Havasi, Amir Hussain, 2012, Proceedings of AAAI FLAIRS. pp [30] "Thumbs up? Sentiment Classification using Machine Learning Techniques" Bo Pang, Lillian Lee, Shivakumar Vaithyanathan, 2002, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). pp [31] The Top Moz 500 Domains rank, last updated May 24, [32] Alexa s Top 500 Global Sites on the Web ranking [33] What is Alexa Traffic Rank, retrieved on April 20, [34] Twitter Statistics. Research Date: [35] The Telegraph: Twitter in numbers

100 100 [36] Twitter Privacy Policy [37] Twitter Now The Fastest Growing Social Platform In The World Jan. 28, [38] Web inventor Tim Berners-Lee stars in Olympics opening ceremony, July 28, / [39] LIVE ACTION: TWITTER GRABS SUPER BOWL SPOTLIGHT BARBARA ORTUTAY, Feb. 4, :38 PM EST [40] The Hill: Obama tweets 'thank you' following projected win Alicia M. Cohn - 11/07/12 12:23 AM ET [42] Twitter Drops "What are You Doing?" Now Asks "What's Happening?" Barb Dybwad 19, [43] The Telegraph: Japan earthquake: how Twitter and Facebook helped 13 march 2011, Harry Wallop. [41] Twitter introduces ticker symbol 'cashtags' for finance searches Dara Kerr. July 31,012 12:15 AM PDT Facebook-helped.html [44] CNN: 'CAPTURED!!!' Boston police announce Marathon bombing suspect in custody Chelsea J. Carter and Greg Botelho, April 20, [45] "WEB 2.0 EMERGENCY APPLICATIONS: HOW USEFUL CAN TWITTER BE FOR EMERGENCY RESPONSE?" Alexander Mills, Lee Chen, Rao (2009). Twitter for Emergency Management and Mitigation: 3. Retrieved 18 April [46] Descrier: Twitter Warns News Organisations Amid Syrian Hacking Attacks published on April 30, [47] 3News: How a false tweet sank stocks published on the 25 th of April,

101 101 [48] abc news: Tweeting Cardinals Share Pre-Conclave Thoughts March 6 th, [49] Yes We Did! How social media helped Obama won the 2012 election Nov, 14, [50] The Washington Post: Twitter becomes a tool for tracking flu epidemics and other public health issues. Brooke Jarvis. Retrieved on epidemics-and-other-public-health-issues/2013/03/04/9d4315c2-6eef-11e2-aa58-243de81040ba_story.html [51] The Guardian: Twitter is becoming the first and quickest source of investment news 23 April 2013 by Barry Ritholtz. [52] Justin Bieber's Twitter Followers Top The Population Of Canada Retrieved on June 8, The-Population-Of-Canada (Justin Bieber) Twitter Statistics on Social Bakers. Retrieved on June 8, [54] Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. John Lafferty, Andrew McCallum, Fernando Pereira [55] CRF++: Yet Another CRF toolkit [56] Scaling Conditional Random Fields for Natural Language Processing Trevor A. Cohn, 2007 [57] Predicting Stock Price from Financial Message Boards with a Mixture of Experts Framework A. Liu, 2006 [58] Widespread Worry and the Stock Market Eric Gilbert, Karrie Karahalios, 2010 [59] Twitter mood predicts the stock market Johan Bollen, Huina Mao, Xiaojun Zeng, [60] Predicting Stock Market Indicators Through Twitter I hope it is not as bad as I fear Xue Zhang, Hauke Fuehres, Peter A Gloor, [61] Stock Market Prediction Based on Public Attentions : a Social Web Mining Approach Ailun Yi, 2009

102 102 [62] Sports sentiment and stock returns Alex Edmans, Diego Garc, [63] Local Sports Sentiment and the Returns and Trading Behavior of Locally Headquartered Stocks : A Firm-Level Analysis. Shao-Chi Chang, Sheng-Syan Chen, Robin K.Chou and Yueh- Hsiang Lin, [64] Design and validation of a forecasting trading system based on Twitter Alberto Maggioni, Luca Mazzoni, 2012 [65] Can blog communication dynamics be correlated with stock market activity? Munmun De Choudhury, Hari Sundaram, Ajita John, Dorée Duncan Seligmann, [66] An Experiment in Integrating Sentiment Features for Tech Stock Prediction in Twitter T. T. Vu, S. Chang, T. Quang, December [67] Modelling the Stock Market using Twitter M. Wolfram, A. Sebastian, [68] TweetTrader. net : Leveraging Crowd Wisdom in a Stock Microblogging Forum Popularity of stock microblogs,timm O Sprenger, [69] Analyzing the Relationship Between Tweets, Box-Office Performance, and Stocks, Catie Meador, Jonathan Gluck, 2009 [70] Twitter as a Corpus for Sentiment Analysis and Opinion Mining, Alexander Pak, Patrick Paroubek, 2009 [71] Sentiment classification on customer feedback data : noisy data, large feature vectors, and the role of linguistic analysis, Michael Gamon, 2003 [72] Predicting the Future with Social Media, Sitaram Asur, Bernardo Huberman, [73] SOPS: Stock Prediction Using Web Sentiment, Vivek Sehgal, Charles Song, 2007 [74] Correlating financial time series with micro-blogging activity, Eduardo J. Ruiz, Vagelis Hristidis, Carlos Castillo, Aristides Gionis, Alejandro Jaimes, [75] Twitter adds line breaks: Haiku poets rejoice, retrieved on the march 25, [76] Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments Kevin Gimpel, Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. In Proceedings of ACL

103 103 [77] TweetMotif: Exploratory Search and Topic Summarization for Twitter. Brendan O'Connor, Michel Krieger, and David Ahn. ICWSM-2010 (demo track), [78] Investment U: Nasdaq Holiday schedule Retrieved on the 10th of May, [79] Language Detection Library for Java" [80] PRO 2711 Estatistica II Alberto W. Ramos, Sao Paulo, 2010 [81] Minitab 16 Regression analysis tool