Modelling the Stock Market using Twitter

Transcription

1 Modelling the Stock Market using Twitter M. Sebastian A. Wolfram E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2010

2 Abstract Stock markets are driven by a multitude of dynamics in which facts and beliefs play a major role in affecting the price of a company s stock. In today s information age, news can spread around the globe in some cases faster than they happen. While it can be beneficial for many applications including disaster prevention, our aim in this thesis is to use the timely release of information to model the stock market. We extract facts and beliefs from the population using one of the fastest growing social networking tools on the Internet, namely Twitter. We examine the use of Natural Language Processing techniques with a predictive machine learning approach to analyze millions of Twitter posts from which we draw distinctive features to create a model that enables the prediction of stock prices. We selected several stocks from the NASDAQ stock exchange and collected Intra-Day stock quotes during a period of two weeks. We build different feature representations from the raw Twitter posts and combined them with the stock price in order to build a regression model using the Support Vector Regression algorithm. We were able to build models of the stocks which predicted discrete prices that were close to a strong baseline. We further investigated the prediction of future prices, on average predicting 15 minutes ahead of the actual price, and evaluated the results using a Virtual Stock Trading Engine. These results were in general promising, but contained also some random variations across the different datasets. i

3 Acknowledgements I would like to thank Miles Osborne not only for supervising my thesis, but also for being a great mentor during the entire process. I will dearly miss his first question at the beginning of every weekly meeting, Are we rich yet? I am dedicating this work to my father, without whom I would not have had the opportunity to begin nor been able to finish this master. I am very thankful to my wife who has greatly supported me in the most difficult moments, especially because she sacrificed a lot for the pursuit of my goals. I want to thank my mom for her constant support and motivational talk and my sister for sending me a lot of pictures of my cute niece, Amelie. ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (M. Sebastian A. Wolfram) iii

5 Table of Contents 1 Introduction Main Findings Report Structure Chapter 1 Summary Background Efficient Market and Random Walk Hypotheses Simulated Stock Markets Social Media and Social Networking Literature Review Chapter 2 Summary Methods Framework Design Keyword Expansion Intra-Day Stock Quote Extraction Support Vector Regression Evaluation Methods Stock Selection Error Measure Virtual Stock Trading Engine Chapter 3 Summary Experimental Setup Twitter Data Raw Data Cleanup Feature Construction Dataset Construction iv

6 4.5 Chapter 4 Summary Implementation and Results Simple Bag of Words Moving Average and Stop Words Weighted Query Terms Minute Predictions Parameter Validation Accumulating Training Data Chapter 5 Summary Conclusion Discussion Conclusion Future Work A Stock Charts 57 B Acronyms 60 Bibliography 62 v

7 List of Figures 2.1 Percentage of global internet users visiting Twitter.com on a daily basis. (Source: Twitter Input/Output Methods. Image from (Krishnamurthy et al., 2008) Prediction Framework Design Example of a maximum margin including its support vectors indicated as double circles on the margin lines. (Image from (Chen et al., 2005)) The soft margin loss setting for a linear SVM (Image from (Schölkopf and Smola, 2002)) Intel Corporation stock chart snapshot (INTC) Samples of posts that have been removed by the Raw Data Cleanup process where each line represents a separate post APPL Simple Bag of Words - experimentation results GOOG Simple Bag of Words - experimentation results FSLR Simple Bag of Words - experimentation results INTC Simple Bag of Words - experimentation results AAPL Simple Moving Average - experimentation results GOOG Simple Moving Average - experimentation results FSLR Simple Moving Average - experimentation results INTC Simple Moving Average - experimentation results AAPL query term frequencies of ipod and apple against the Apple stock price. For the apple term, we can observe correlation between high term frequencies and major price changes as on July 21, where both the price and the term count rise significantly AAPL Weighted Query Terms Filter Method - experimentation results GOOG Weighted Query Terms Filter Method - experimentation results 42 vi

8 5.12 FSLR Weighted Query Terms Filter Method - experimentation results INTC Weighted Query Terms Filter Method - experimentation results AAPL 15 Minute Predictions - experimentation results GOOG 15 Minute Predictions - experimentation results FSLR 15 Minute Predictions - experimentation results INTC 15 Minute Predictions - experimentation results AAPL Multiple Equally Sized Datasets - experimentation results AAPL Multiple Equally Sized Datasets - prediction error results GOOG Multiple Equally Sized Datasets - experimentation results GOOG Multiple Equally Sized Datasets - prediction error results FSLR Multiple Equally Sized Datasets - experimentation results FSLR Multiple Equally Sized Datasets - prediction error results AAPL Accumulating Training Data - experimentation results AAPL Accumulating Training Data - prediction error results GOOG Accumulating Training Data - experimentation results GOOG Accumulating Training Data - prediction error results FSLR Accumulating Training Data - experimentation results FSLR Accumulating Training Data - prediction error results A.1 Apple Inc. stock chart snapshot (AAPL) A.2 Google Inc. stock chart snapshot (GOOG) A.3 First Solar, Inc. stock chart snapshot (FSLR) vii

9 List of Tables 3.1 Google Sets query expansion results Simple Bag of Words - experimentation results Simple Moving Average - experimentation results Weighted Query Terms Filter Method - experimentation results Minute Predictions - experimentation results Multiple Equally Sized Datasets - experimentation results Accumulating Training Data - experimentation results viii

10 Chapter 1 Introduction One popular area amongst researchers and financial analysts for pattern recognition and machine learning applications are the highly dynamic and data intensive financial markets. Besides the evident motivation of gaining an advantage in investment opportunities, predicting the price of a security using various statistical tools, machine learning techniques or fundamental and technical approaches is still an ongoing field of extensive research and no methods have yet been discovered which can accomplish such a task. Efforts that have attempted to solve this problem have merely shown small and unstable successes. Furthermore, stock market prediction research stands against widely accepted theories that imply predicting the price of a security is an impossible task. One such theory is the Random Walk hypothesis (Malkiel, 1996) which states that the price movement of a stock is no more predictable than the random selection of successive steps in the positive, negative or equal direction of the value of the stock. Moreover, the efficient - market hypothesis (EMH) (Fama, 1965) says that the prices of securities reflect all available information about the current financial standing of the company while new information made available or otherwise introduced immediately corrects the new value of the stock. Therefore, these hypotheses say that an attempt to predict market values is based solely on chance and that investors placing orders do so at the securities intrinsic value rather than at the anticipated lower buying or higher selling value. On the other hand, an experiment conducted by (Lebaron, 1999), in which he created an artificial stock market to study the behavior of decision processes of stock traders based on timely introduction of information, revealed a lag in the time that information was introduced and the time the market would adjust itself. To exploit this theory, research in the analysis of textual data has found subtle success in pre- 1

11 Chapter 1. Introduction 2 dicting stock market prices using text mining and Natural Language Processing (NLP) techniques. These methods usually extract information from various sources on the web including news wires, personal and company websites and blogs as well as social networking communities and micro blogs. In today s information age opinions, facts and random chatter are created and exchanged at extraordinary rates. There are many different types of media individuals use to share information, but it is the infrastructure of wireless networking with the combination of small and inexpensive mobile devices which made the explosion of fast data and information exchange possible. For individuals involved in this social ensemble, reading and submitting status updates have become a new way of life. While many people send their information privately to other groups or individuals, developments in social networking have created networks which are open to the world. A virtual environment to exchange information and communicate is a great sandbox to learn about opinions of groups that relate to topics of interest. Unfortunately, not all data roaming in the social network cloud is meaningful information, which is both a problem for individuals trying to pay attention to status updates but also to researchers trying to mine for relevant topics while ignoring the noise and spam that surrounds them. Using Twitter as a rapid information source has proved to be a useful tool for various scenarios such as analyzing and predicting the spread and belief of the recent swine flu pandemic (Ritterman et al., 2009; Harvey, 2009) but also for potentially early alert systems of earthquakes due to people posting Twitter messages as soon as they happen. Generally, these submissions happen within 20 seconds of stronger occurrences in areas with higher technology density (Earle, 2010). Twitter is also an enormous discussion forum for many technical and economic topics letting people express their sentiment over products, services and entire organizations. Directly gathering this information from the population has shown to be a great source of valuable information for analyzing a company s branding success and incorporating it in their overall branding strategy (Jansen and Zhang, 2009). In this thesis, the problem of stock market prediction is coupled with a complex Information Retrieval (IR) task and we attempt to solve it using NLP techniques, by transforming the raw Twitter posts into linguistic textual representations such as the bag of words model. We use different filtering methods to reduce the dimensions of our raw data and finally approach the task of predicting a price by building a regression model using the Support Vector Regression (SVR) machine learning algorithm from

12 Chapter 1. Introduction 3 statistical learning theory (Vapnik, 1998). Our data for text analysis comes from the micro-blogging community Twitter and is made available through the University of Edinburgh, School of Informatics using the Twitter Streaming API 1. As we will describe in the next chapter, there are various incentives to choose Twitter as our data source but primarily due to the timely release of millions of posts worldwide. These releases may even be fast enough to retrieve relevant information about stocks capable of predicting future prices before the market adjusts itself. In addition to the Twitter data we downloaded Intra-Day stock price information from several NASDAQ stocks in order to build and train a regression model and predict a discrete stock price. 1.1 Main Findings Most research relating to modelling stock prices using NLP techniques with textual data has focused on three distinct aspects: First, most data sources came from news articles or blogs relating to companies rather than micro-blogs. Second, inference was usually accomplished through classifying the direction of the stock price rather than the actual amount. And third, the training of prediction models was mostly done using End of Day stock price data rather than Intra-Day minute stock quotes. This resulted in systems predicting the stock price for the next trading day. Furthermore, as we will explore in our literature review in the next chapter, current research has found only subtle successes in modelling stock prices as well as predicting discrete future price values. We therefore asked, is it possible to use micro-blogs data to model stock prices? Additionally, can we predict the price of specific stocks some period of time into the future? And finally, can we make a reasonable profit? Concluding from our experimentations, we have found the following results: - We can use Twitter posts to build a close model of stock prices and we showed that the regression line of the model approaches our strong baseline in all test scenarios. - From experiments which attempted to predict future stock prices, our results varied across different stock selections; however, in several cases we were able to attain significant profits from the evaluation of our Virtual Stock Trading Engine. - Finally, our results indicate that using Twitter as a source of information to predict stock prices contradicts the EMH. 1 api

13 Chapter 1. Introduction 4 Although our results did not perform better than our strong baseline, higher accuracies should be reached given that more time is spent on handling Twitter spam and noise as well as on feature exploration using techniques we have not yet explored. 1.2 Report Structure We describe the findings stated in the previous section by a detailed explanation of our methods, experimental setup, and results in the following chapters as outlined below. Chapter 2 introduces the background to our thesis including financial market theories, social media, micro-blogging and Twitter. The Chapter also discusses related research in the field of market prediction using NLP techniques as well as other related topics in order to motivate and support our hypothesis. In Chapter 3 we discuss the methods used to conduct our experiments and begin by introducing the overall framework design. We then describe the learning algorithm used as well as the different evaluation methods applied to the experiment results. In Chapter 4 we lay out the experimental setup necessary to conduct the experiments. The Chapter includes discussions on data pre-processing, feature exploration as well as dataset requirements. Chapter 5 discusses the implementation of different feature construction approaches which are tested on different stocks and parameter settings. The Chapter details all our findings in sequential order. In Chapter 6 we analyze the results obtained from our experiments and draw a conclusion of the work we conducted. We close our thesis by suggesting future work. 1.3 Chapter 1 Summary We introduced our research goal of predicting the stock market using Twitter and began motivating our objective by giving a brief overview of theories governing the financial markets including related research and explained a few relevant Twitter benefits and scenarios. We concluded the chapter by pointing out our main findings and laid out the structure for the rest of this thesis.

14 Chapter 2 Background In this chapter we introduce the major topics and theories relating to our project in order to explain our motivation for the foundations of the framework. First, we explain the theory behind the EMH and the Random walk hypothesis and follow the discussion by introducing an experiment based on virtual stock markets that contradicts the EMH. Then, we continue with an introduction to social media and the micro-blogging Website Twitter. Finally, the chapter concludes with an exploration of relevant work and techniques applied to the area of IR in relation to the task of predicting stock market prices. These publications relate to our hypothesis and will further help motivate our objective. 2.1 Efficient Market and Random Walk Hypotheses There have been many attempts to predict the stock prices and the quest to understand the dynamics behind financial markets is an ongoing research area both in academia and finance. Standing in the way of such research are theories which claim that the financial market dynamics are stochastic and informationally efficient. This implies that predicting the price movement of securities is not possible. The theory of random walk states that the future path of the price level of a security is no more predictable than the path of a series of cumulated random numbers (Fama, 1965). This means that successive stock price changes are independent while the change in value follows some probability distribution. The independence assumption implies that the probability distribution of the price p t of a security at time t is independent 5

15 Chapter 2. Background 6 of the probability distribution p t n where n is the number of previous time units. Additionally, the EMH states that the efficiency of financial markets causes stock prices to reflect all the information and data known about the security to be incorporated in the price at the point when a trader places an order, making it impossible to beat the market. Therefore a stock trader who places an order to purchase or sell stocks does so at the stock s intrinsic value instead of an anticipated lower buying or higher selling amount. EMH is divided into three versions characterized by different strengths, beginning with the Weak which asserts that a security reflects all the information available to the public. The second Semi-Strong version incorporates the weak version with the addition that any new information instantly is reflected in price not allowing a trader to get any advantage. Finally, the last version is termed Strong and incorporates the first two versions with the addition that non-public information, such as insider information and other unknown facts are also reflected in the price. While there is evidence against the Strong version of the EMH (Findlay and Williams, 2000), our goal of this thesis is to find evidence that helps reject (or support) the Weak and Semi-Strong version, by the use of timely releases of information and opinions on Twitter, which could be used to form a prediction model and beat the market before it adjusts itself to the intrinsic value. 2.2 Simulated Stock Markets One way to study the dynamics of financial markets is to create an artificial simulation of it and let virtual agents trade by giving them trading rules. Researchers have attempted to create computer simulations of financial markets and we are introducing one such work as it showed contradicting indications against the Random Walk theory which, in part, forms the basis of our hypothesis. In an experiment conducted by (Lebaron, 1999) an artificial stock market was created to study the behavior of decision processes of stock traders based on the timely introduction of information. These simulated trading agents would act like their human counterpart, which was accomplished by applying dynamic rules that would constantly be evaluated and updated so that agents would use optimal rules whenever they discover them. Using those rules the agents built forecasts of future prices and dividends and, during trading sessions, began accumulating rules that worked best and prune off ones that did not. This was achieved by the use of a genetic algorithm. Since the arti-

16 Chapter 2. Background 7 ficial market was a multi-agent framework, the setting did not allow agents to interact with each other; however, they did act upon changes indirectly when agent s behaviors affected price changes. The rules translated into actions which triggered trading behaviors. One of the findings was that changes in certain parameters that governed aspects of time dramatically changed the behavior of trading agents indicating that the time taken to address new information or changes in the prices was relevant in the strategies (rules) the agents would optimize. It was found that fast reacting agents selected technical rules while slow reacting agents selected fundamental rules. An interpretation is also that this difference revealed a lag in the time that information was introduced and the time the agents had to act before the market would adjust itself. We view this time-lag as a basis of our thesis and hope to discover patters in Twitter data which may lead to a valuable short term prediction of the future stock prices, enabling the execution of a basic profitable trading strategy. 2.3 Social Media and Social Networking Countries which have the infrastructure to offer their citizens affordable and reliable internet and mobile communication services are actively practicing in Social Media and Social Networking due to the ever-growing social networking websites (SNW) and services available on the internet. This phenomenon has existed for many years where some of the first community websites launched in the mid 90s on services like Geocities 1. However, these websites where not considered to be social web services as we know them today. According to a study of Social Networking Sites (Boyd and Ellison, 2008) a definition is as follows: We define social network sites as web-based services that allow individuals to (1) construct a public or semi-public profile within a bounded system, (2) articulate a list of other users with whom they share a connection, and (3) view and traverse their list of connections and those made by others within the system. The nature and nomenclature of these connections may vary from site to site. Some of the best examples of SNW available and popular today are Facebook 2 or the micro blogging community Twitter 3. Other similar services such as MySpace 4 and

17 Chapter 2. Background 8 Friendster 5 have lost popularity but remain part of many available choices. Moreover, services such as LinkedIn 6, YouTube 7 or Flickr 8 focus on specialized social networking groups or services. LinkedIn, for example, has a target audience consisting of business professionals giving members the opportunity to stay connected with previous colleagues and increasing one s outreach to new career opportunities. YouTube s focus is mainly on the user creation and distribution of video content. While these websites have specialized motives, most of them have in common the ability to enable the exchange of media through online social interaction. Some examples of social media include but are not limited to electronic media such as images, sound and video, blogs, slogans, events, and other information or ideas. Further, with worldwide growing mobile connectivity the exchange of social media seems to have no end as it enables its users to access and create content on the fly. Besides the websites mentioned so far, a great deal of social networking comes from the blogging community. There, individual users create web-pages known as blogs which are paragraphs or longer articles listed in inverted chronological order (starting with the newest post on top of the website) usually including meta-tags such as release date and author. Thus, the term blog comes from the combination of the words web and log. There are many reasons for people to blog, such as expressing their opinion or emotions on certain topics or just write a diary of specific events. The popularity of blogging was further fueled through the releases of journals of daily activities of interesting people including serious commentaries on important issues (Nardi et al., 2004). Many celebrities and influential persons have blogs to communicate with their followers. From the perspective of text analysis, it is sometimes more difficult to analyze blogs, unless information about page visits, registered users, and number of comments is publicly available. With the popularity of blogs came the invention of micro-blogs. These follow a similar format as regular blogs with the main difference of being much shorter. The two most popular micro-blogging SNW are, at the time of this writing, Facebook and Twitter (Source: Figure 2.1 shows the percentage of global internet users visiting Twitter on a daily basis from mid 2008 until August These percentages are still on the rise and already outperforming search

18 Chapter 2. Background 9 Figure 2.1: Percentage of global internet users visiting Twitter.com on a daily basis. (Source: engines like Yahoo 9 and Bing 10. While there are some conceptual and usage differences between Facebook and Twitter, one of the more significant and a major aspect of this thesis, is the fact that most Twitter users have public profiles in contrast to Facebook where users restrict much more personal information as well as status updates from users not part of their network. In fact, Twitter accounts are by default public so that new posts are automatically submitted to the public Twitter timeline. We are therefore interested in Twitter status update releases since we get access to millions of user s opinions and their daily chatter. Twitter is increasingly becoming a good source of data due to a variety of users from different backgrounds and professions. As with blogs, many celebrities, authors and other influential figures use Twitter to stay connected with their audiences. Besides individual Twitter users, there are also many organizational Twitter accounts who release posts into the public stream with either commercial or informational purposes. These organizations include news outlets, companies, research/nonprofit organizations, and many more. A second distinction of Twitter is its message limit of 140 characters. This restriction was initially implemented to conform to the character limit in the Short Message System (SMS) used in mobile phone text communication and may seem to contradict the expressionism of social media. Nonetheless, it has turned out to be more popular giving rise to additional benefits. I.e., the character limit requires people to express their opinion and comments in a concise manner. This results in posts that are to the point without much of the noise found in traditional blogs and articles. Also, advertisements, tags, links and many other sources of unrelated text usually found on blogs, news and other in

19 Chapter 2. Background 10 Figure 2.2: Twitter Input/Output Methods. Image from (Krishnamurthy et al., 2008) formational web-pages do not exist in the raw Twitter feed making it easier to process. Nonetheless, the absence of such data does not imply that possibly important information is lost. That is because Twitter incorporates some of this data inside the actual posts. For instance, the hash tag is used to relate posts to specific topics: e.g. #twitter in a post represents a tag of the topic twitter. Moreover, due to the shortness of the posts, individuals and organizations are much faster at publishing messages. Organization, who use Twitter to inform their followers about news, products or services updates do not have to spend much time on writing lengthy articles including the bottlenecks from editing or formatting. Additionally users have the ability to use mobile devices to publish news as soon as they happen. In fact, there is an entire array of possible input methods as shown in Figure 2.2. Finally, it is also possible to analyze the importance of a Twitter account by the use of Twitter meta-data. Information about the number of posts, followers (number of users that are following the twitter account), following (number of other users the account is following) and the network of links between users could be used to determine the rank of a specific users and therefore its impact on their audiences.

20 Chapter 2. Background 11 The combination of millions of publicly released posts, with the speed of the release as well as the limit in character length makes Twitter an ideal data source for our thesis as it enables us to perform real time search on the belief of the population. 2.4 Literature Review One of the more recent works was published in 2009 by (Schumaker and Chen, 2009). They attempted to predict the actual price of stocks listed in the S&P 500 using a SVR algorithm by applying text mining techniques on financial news articles and transforming them into feature representations including bag of words, noun phrases and named entities. Their objective was to find out how accurate the predictions of the proposed models could be concerning the task of forecasting the actual stock price 20 minutes after the release of a news article. Moreover, they were also interested in exploring the best techniques for analyzing and decomposing news articles using various IR algorithms. They investigated over 9000 news articles relating to stocks on S&P 500. They tested data on four different models, first a simple linear regression model followed by three models using SVR. Their main finding was that the best performance was achieved by combining the article terms with the stock price at the release of the article. (Mittermayer, 2004) proposed a system that categorized press-release articles using a Support Vector Machine (SVM) algorithm. These categorizations were then used in a trading system that attempted to predict price trends immediately after the release of the article. He used press release articles rather than news articles claiming that such text would contain a better source of unexpected information. Results found that the system performed better than a random exchange of securities and returned an average profit of 0.11% per trade. With certain established trading rules, Mittermayer was also able to make slight profits taking transaction costs into consideration. A similar approach was done by C. Fung et al. (Fung et al., 2002) who used pattern recognition methodologies to model stock price trends using a regression algorithm in a combination with a SVM classifier that would categorize features extracted from news article to either predict a stock price increase or decrease. They implemented an incremental K-means clustering algorithm for filtering articles and associating them to directional price trends. Their system performed better than random resulting in moderate profitable successes. With respect to the use of Twitter as a data source, another more recent work

21 Chapter 2. Background 12 conducted by (Tayal and Komaragiri, 2009) compared traditional blogs with microblogs to determine the predicative power on stock prices given the use of either data source. Their research focused on sentiment analysis of blogs and micro-blogs and found that in their experiments micro-blogs consistently outperformed blogs in their predictive accuracy. They obtained their two data sources from the web service Google Blogsearch 11 and Twitter. The system used the stock name to filter and reduce the text after which they performed sentiment analysis using a lexicon of positive and negative terms. In the experiments, they predicted the actual stock price of the following day from the models of each data source. They also found that the character limit of Twitter helped determine more concise sentiment results since one Twitter post usually relates to one topic. A paper written by (Yi, 2009) compared three sources of text from social media and build predictive models using SVR for each text sources in order to predict a real valued price. One of the three data sources of this work was also Twitter and results showed that Twitter was the best performing source of information. Research by (Lavrenko et al., 2000) focused on constructing language models from news stories and stock prices by identifying features in articles that indicate whether or not a particular article type shows patterns for potentially influencing the behavior of specific stocks. This model has provided evidence that, in time series, news stories can be associated to trends. Rather than focusing on the content of news articles, (Peramunetilleke and Wong, 2002) used news article headlines to classify stock movement in either up, down or steady directions. This was done by using different document and term weight techniques, including term frequency - inverse document frequency (tf-idf) and term frequency - category discrimination frequency (tf-cdf). Results showed better performance than random guessing. In contrast to the methods introduced so far, which mostly applied either SVM or SVR models with textual feature representations of posts relating to stocks, the work done by (Huang et al., 2005) also uses SVM but in combination with financial macroeconomic variables of the NIKKEI 225 Index. While this work uses a different type of datum, its significance was the comparison of different classification methods on high dimensional time series data which found that out of all classification methods tested, SVM performed best due to the algorithm s advantages of structural risk minimization compared to empirical risk minimization. As we will describe in Chapter 3, this 11

22 Chapter 2. Background 13 element makes SVM less vulnerable to the overfitting problem. Additionally, in contrast to the machine learning algorithms introduced, the work conducted by (Thomas and Sycara, 2000) implemented two classification methods to analyze text posted on discussion forums. The focus of the paper was on the genetic algorithm. They analyzed different extracted numerical representations from the posts, including the number of messages as well as the total number of words posted per day for a given stock. Due to high noise in their dataset, they found that aggregating several runs of the genetic algorithm into a single predictor helped improve performance. Nonetheless, they were only able to show results from data sources that contained more than 10,000 posts concerning a stock. There has been research in areas that are not related to financial markets but which also use documents and query terms to forecast trends and discrete values. For instance, Google Flu Trends 12 is a service which uses aggregated Google search data to estimate flu activity. The research published in Nature (Ginsberg et al., 2009), found a correlation between the frequency of keyword searches relating to flu topics and the number of people that are actually reporting flu symptoms to their doctors. They compared their results with agencies such as the U.S. Centers for Disease Control and Prevention (CDC), who deliver reports about flu outbreaks within 1-2 weeks, and found that their forecasts not only matched the reports well, but also had earlier flu detecting signals than the delayed releases from the CDC. While Google Flu Trends is updated on a daily basis, the system is capable of producing near real-time results, due to the analysis of real-time user submitted query terms. In the same way as Google Flu Trends results can complement data released by agencies like the CDC, research by (Connor et al., 2010) shows that using Twitter as a source of sentiment detection for consumers opinions on presidential job approval can also be a good supplement to the expensive and slow traditional polling techniques. In this research, one billion Twitter messages posted between 2008 and 2009 where analyzed by a lexical sentiment analysis. While the results fluctuated on different dataset instances, the highest correlation between Twitter sentiment and actual polls was a measure of 80%. Finally, on the subject of feature selection for text classification, (Forman, 2003) motivated the use of certain heuristics where he presented a study comparing 12 feature selection methods using different text sources such as Reuters or Text RERtrieval Conference (TREC) and performing classification using the SVM algorithm. He analyzed his results using a variety of measures including accuracy, precision and F- 12

23 Chapter 2. Background 14 measure. His experiments on dataset preparation found that to decrease dimensionality of a bag of words feature representation without losing classification accuracy, the rare word count cutoff threshold should be set to low. 2.5 Chapter 2 Summary In this Chapter we introduced theories of financial markets and their implication to our task. We described the EMH and contrasted it with other research forming arguments and motivations in favor of our hypothesis. In addition, the Chapter gave an introduction to Social Media and Social Networking, listing the major contributing services of the social online communication movement. Finally, we introduced related work most relevant to our goal. A wide range of papers implement the SVM and SVR algorithms due to its predictive advantage in the field of text classification and regression. While many papers focus on predicting price movements as well as discrete prices, only a few use these predictions on time series data of Intra-day stock quotes. To our knowledge, there has not been any research done that uses Twitter posts as the main data source in order to forecast real valued prices within a short period of time (several minutes).

24 Chapter 3 Methods In this Chapter we describe the methods and algorithms we chose to construct our framework, which is divided into four major components: Data Pre-Processing, Feature Selection and Construction, Regression, and Evaluation. We will concentrate on explaining the approach of our prediction task and leave individual implementations of data processing and feature exploration for the next Chapter. We start by briefly introducing the individual components of our framework. 3.1 Framework Design To accomplish our objectives explained in the Section 1.1 we developed the prediction framework illustrated in Figure 3.1. The framework is coded using the Python 1 programming language. Python is often used for processing text due to its ease of use in handling files for input/output operations. In the design shown in Figure 3.1, the four major components of the framework are represented in the dashed boxes, labeled with their relevant framework component name. Within are depicted individual processes that play a major role in the overall function of the component. The first component (Pre-Processing Framework) is responsible for data collection, pre-processing and filtering. Section 3.2 will discuss the methods we used to come up with relevant key words used to filter the Twitter posts from irrelevant data. Section 3.3 will describe the methods used to extract quotes from the NASDAQ stock market. The second component (Feature Selection and Construction Framework) involves several NLP techniques which we will define in Chapter 4 and explore and evaluate in Chapter 5. The third component (Regression Framework) handles the implementation of the SVR algorithm

25 Chapter 3. Methods 16 Figure 3.1: Prediction Framework Design

26 Chapter 3. Methods 17 which we will explain in Section 3.4. The final component (Evaluation Framework) is used to quantify our results. We will describe the methods for different error measures and evaluation rules in Section Keyword Expansion One of the first tasks involved in Pre-Processing our data is finding relevant keywords that we can use to filter out the raw dataset from irrelevant posts. This section describes how we constructed a list of query words used to search and filter the pre-processed dataset. Having a dataset consisting of millions of twitter posts requires a way to search and filter so that we end up with post that relate to our task. Therefore, we create a list of query terms which relate to the company that we want to forecast stock prices. For our experiments, we chose four different stocks from companies listed in the NASDAQ index. The companies are Google Inc., Apple Inc., First Solar, Inc. and Intel Corporation with the symbols GOOG, AAPL, FSRL and INTC respectively. For the rest of this paper, we will refer to a stock using either the company name or the stock symbol. A discussion about the choice of these stocks will be given in Section For each of these symbols, we created a list of no more than 5 keywords that best described the company and its products and services. For example, for the technology company Apple, the initial terms were: apple, mac, ipod, steve, jobs We chose only five keywords, since the rest would be discovered using a query expansion tool. Query expansion can be compared to a thesaurus, where original query terms are complimented with terms of similar meaning. This helps broaden search results since using a single term may leave out relevant documents which could have been included in the result set if a related query term would have been used. For our purpose we used Google s query expansion web service Google Sets 2. The resulting set of additional 43 terms included words such as iphone, windows, macintosh and google, but also terms like howto, englisch or cool. Table 3.1 shows the complete list obtained by Google Sets. While the list was comprehensive in describing the company, it still lacked some information about stock symbols and company names. We therefore included a list of stock symbols and company names gathered 2

27 Chapter 3. Methods 18 Table 3.1: Google Sets query expansion results Initial Key Terms apple mac ipod steve jobs After Query Expansion software windows osx video computer itunes macintosh tools free freeware linux download opensource podcast web audio design iphone macosx hardware microsoft tutorial howto programme community musik cool englisch google web2.0 rss media browser os mobile forum shareware imac downloads player radio podcasting webdesign from Google Finance 3. From the list generated by Google Sets and the list of related companies on Google Finance, we manually concatenated the final list of query terms. This step could however easily be automated by creating a Python script which gathers the information from these services and outputs the data to a file. 3.3 Intra-Day Stock Quote Extraction To do regression and predict the value of a stock, we have to create a dataset that combines the stock quotes with instances of relevant Twitter posts. The stock price is declared as the target label and will be used to compare results of our predictions to a baseline as well as the actual stock price. There are two types of stock quote that have been used in stock market predicting using text regression. The first is the End of Day stock market data which refers to the stock price that was recorded after the last trade of the day was completed but before the extended after-hour market opens. This price is useful for looking at statistics that range over long periods of time. The second type is the Intra-Day stock quotes. These are recorded at different time intervals during the opening hours of the stock market. Generally, time steps are in the range of one to twenty minutes, depending on the organization recording the data. While it is simple to 3

28 Chapter 3. Methods 19 download free End of Day data from many financial websites such as Yahoo Finance 4, it very difficult to find free of charge, Intra-Day historical stock quotes. Moreover, Intra-Day stock quotes are more relevant to our prediction task as we should get more accurate regression estimates in combination with real time Twitter posts as opposed to using one price for an entire day of Twitter feeds. Our initial raw Twitter dataset provided by the University of Edinburgh, School of Informatics 5 contained Twitter posts ranging from November 11th 2009 to February 1st Unfortunately, we were not able to find free of charge Intra-Day historical stock prices for the specified date range. Therefore, we created a method to download new stock quotes on the fly. We developed a web crawler using Python which downloaded up-to-date stock information and parsed out the price of a stock in one minute intervals. To download stock quotes we crawled a service 6 on Google Finance which outputs a simple string of key/value pairs of live stock information. We ran this script during the regular opening hours of the NASDAQ stock exchange as well as during extended pre and after market hours. In the end, we crawled a list of four top NAS- DAQ stocks and collected the data for a period of two weeks starting on July 19 to July 30, For a complete list of the stocks and sample stock charts over the two week period refer to Appendix A. 3.4 Support Vector Regression Since we are primarily concerned with the prediction of a real valued Dollar amount, we decided on the use of a regression algorithm well suited for text analysis. In particular, we chose the SVR algorithm (Smola and Schölkopf, 2004). SVR is the regression counterpart to the popular SVM (Vapnik, 1999) algorithm used for classification. SVM s popularity over the last decade is in part due to the superior Structural Risk Minimization (SRM) as demonstrated in (Gunn et al., 1997) but first introduced in (Vapnik and Chervonenkis, 1974). It stands in contrast to the established Empirical Risk Minimization (ERM) principle particularly known from neural networks. The SRM principle addresses the problem of overfitting, where the model complexity increasingly fits the training data too well resulting in inaccurate predictions when new data has been observed, by finding a balance between the model s complexity and the this dataset is no longer available since Twitter requested to stop sharing the data 6

29 Chapter 3. Methods 20 Figure 3.2: Example of a maximum margin including its support vectors indicated as double circles on the margin lines. (Image from (Chen et al., 2005)) closeness of fitting the training samples correctly. Even though the SVM algorithm was designed for classification problems, it has soon after been extended to work for regression task as well. SVM is generally used for two class problems and the transition from classification to regression is a small step but contains a significant difference in loss function. SVM tries to find an optimal separating hyper plane between two classes. The hyper plane is optimal since there is only one that can maximize the so called margin. In the simplest case the maximum margin is the longest distance that separates the two closest and linearly separable points from two opposing classes. Such points are known as the support vectors and are retained to build the model used for generalization. See Figure 3.2 for an example of the margin including its support vectors. In more complex problems, SVM employs the use of different kernels to map non-separable data points into a higher dimensional space which allows to linearly separate the classes. The choice of kernels depends on the domain and in the context of statistical text analysis a linear kernel function (formula) has proved to be the best choice in our experiments. Given a set of training samples for a linear regression task with the linear function D = {( x 1,y 1),...,(x n,y n ) } (3.1) f (x) = w,x + b (3.2)

30 Chapter 3. Methods 21 we wish to find the weight vectors w to find the optimal function f (x) by minimizing 1 2 w 2 +C n i=1 y i w,x i b ε + ξ i subject to w,x i + b y i ε + ξ i ξ i,ξ i 0 (ξ i + ξ i ) (3.3) (3.4) The constant C > 0 is also known as the regularization constant (or regularizer) and determines how flat the function is and how much deviations larger than ξ are permitted. SVR is accomplished by the use a different loss function as in SVM (Smola, 1996), one that includes a distance measure and allows sparseness in the support vectors. One example of such a function is the ε-insensitive loss function. 0 if ξ ε ξ ε := ξ ε otherwise. Finally, w from the regression function given in equation (3.2) is defined as and w = n i=1 (3.5) β i x i (3.6) b = 1 2 w,(x r + x s ) (3.7) Additionally, β are the coefficients of the samples, where samples that have nonzero coefficients are the support vectors. SVR uses therefore a small subset of the data to construct the final model. Figure 3.3 illustrates such a case where points outside the shaded area contribute to the cost to a certain extent as the deviations are penalized in a linear fashion. Besides using a regression algorithm to predict the value of a stock price, SVR has also been found to work well in time series forecasting application such as in the works by (Mukherjee et al., 1997; U. Thissen, R. Van Brakel, A.P. De Weiher, W.J. Melssen, 2003; Muller et al., 1997)

31 Chapter 3. Methods 22 Figure 3.3: The soft margin loss setting for a linear SVM (Image from (Schölkopf and Smola, 2002)) 3.5 Evaluation Methods In this section we define evaluation methods that will be applied to our experiments in order to compare and draw conclusions from our results Stock Selection As mentioned before, we chose four different stocks from the NASDAQ stock exchange: Google Inc. (GOOG), Apple Inc.(AAPL), First Solar, Inc. (FSLR) and Intel Corporation (INTC). All experiments are conducted in the same date span starting from Monday, July 19, 2010 through Friday, July 30, Our aim was to select two popular stocks (Google & Apple) that were frequently mentioned on Twitter. Second, we wanted to add a stock that was less popular and out of a specialized domain. For this we chose the solar company First Solar. And finally we added one stock which did not have a significant change in value over the period of our experimentations in order see if the learning algorithm would have difficulty to find distinctive patterns. A snapshot of the chart of INTC is shown in Figure 3.4 which shows the prices during the time period of our experimentation. The remaining charts of AAPL, GOOG, and FSLR can be found in appendix A in the Figures A.1, A.2, and A.3 respectively. All experiments are evaluated on all four stock datasets except when specifically specified otherwise. We carried out multiple runs of individual experiments with different parameter settings as well as kernel selections of the SVR algorithm and found that the linear

32 Chapter 3. Methods 23 Figure 3.4: Intel Corporation stock chart snapshot (INTC) kernel worked best with parameter settings c = 0.1 and ε = 0.5. Other experiments use a validation set to determine the optimal parameters of c and ε which is explained in more detail in Section (5.5) Error Measure For each experiment we calculated the Mean Squared Error (MSE) of a strong baseline. The MSE is defined as MSE(t) = 1 N N i=1 (y(x i ) t i ) 2 (3.8) where t is the target value, y the predicted value and N the number of samples in the testing test. This baseline is calculated by taking the Simple Moving Average (SMA) of a series of stock prices in the testing set. The SMA is defined as SMA t = 1 T T i=1 (P t i ) (3.9) where t represents the latest value in the time series, T is the number of time series steps and P is the price at each step. We used the target price of each testing sample as the starting point and summing it up with the 59 previous ticks. Since we captured stock quotes every minute, the running average spanned over a period of 1 hour (a total of 60 ticks), making it a very strong baseline for our algorithm to beat. Often, a random baseline is also included for

33 Chapter 3. Methods 24 comparison; however, we believe that in our case a random baseline is too weak and irrelevant Virtual Stock Trading Engine Finally, to better understand the meaning of our results and the impact of the MSE values, we created a Virtual Stock Trading Engine to evaluate whether it is possible to make a profit from the predictions the model generated. The engine imitates a day trading agent and follows general rules to formulate a fairly realistic simulation environment. At the start of a trading session the agent is given an initial capital. In all our evaluations, we chose to set the capital of each trial to 10,000 units. The trial begins by looping through the time series of testing examples beginning with the oldest instance. Although the test set contained instances which belonged to the time periods of preand after-market hours, we decided not to include these instances and allow the agent to only place orders within the regular opening hours of the NASDAQ stock exchange which are between 9:30AM and 4:00PM Eastern Daylight Time (EDT). However, in our final experimentations, where we predict prices during the entire two week period, we change this restirction also allowing trades to be carried out during extended hours. Every transaction was accompanied with a transaction costs or commission which we set to 1 unit based on the popular online broker firm Interactive Brokers 7. The agent was able to place regular market or short orders at any given time of the regular opening hours. Moreover, for each of the stocks in our experiments, we set a minimum profit target value which included at least twice the commission - once for a purchase order and once for a selling order (or the equivalent for short orders). Unfortunately, since we were not able to capture the asking or bid prices, we replaced these values with current price at each time interval. During a trading session the agent follows the following rules: Given the current capital, current stock price and the predicted stock price. The agent calculates the potential profit per stock, how many stocks can be purchased with the current capital and the total profit, given the total number of stocks the agent can purchase with the available capital. This calculation also takes the commission into account. 7 while this broker firm offers low transaction costs, customers must have a minimum of $10, starting capital to open an account. On this basis we chose our starting capital of 10,000 units.

34 Chapter 3. Methods 25 If the potential profit is bigger than the target profit, the agent places an order purchasing the maximum number of stocks possible. This applies to both market and short orders. At each time interval, the agent checks whether stocks can be sold or shortened. Since we used the predicted value to calculate the potential profit, the agent must sell/short the orders a least 15 minutes after the purchase was made. We repeat this process until there are no training instances during market hours left after which the total profit/loss is returned. It is important to note that once an order has been placed and paid for, the agent cannot place another order for at least 15 minutes, after which the orders are liquidated and ready for further transactions. After this point, it is not guaranteed that a new order will be placed. A new order entirely depends on the threshold of the target profit and whether the variables of capital, current and predicted price meet that desired profit. In our experiments, we found that different target profit values rendered different results. Therefore we ran the virtual stock trading engine with several different target profit thresholds and averaged them up. 3.6 Chapter 3 Summary This chapter presented the general framework including a brief introduction to the individual components. We motivated our choices of algorithms as well as described methods of obtaining different data collections required for our task. Finally, we introduced the different methods used for the evaluation of our experimental results.

35 Chapter 4 Experimental Setup In this chapter we explore the experimental setup required to for our experimentations. Besides explaining the main processes of data cleanup and feature selection & construction we also discuss data requirements and dataset construction. 4.1 Twitter Data In IR, text is usually defined as documents, where a document can be journal articles, WebPages, s, news stories, books, etc. In our case, the documents are micro-blogs and therefore short sentences or single paragraphs. Where most documents require a lengthy process before being published, Twitter posts or tweets are released within seconds from millions of users each day. Due to this huge number of posts, we must put more emphasis on problems concerning relevance and evaluation of information, including filtering out a lot of noise surrounding relevant information. The raw Twitter posts are gathered from the Informatics Forum at the University of Edinburgh using the Twitter streaming API. Twitter releases chucks of data in short time intervals (about 15 minutes) which are only a subset of the full public Twitter timeline. The size of each file is not constant and is managed by Twitter. In its raw form, the data consists of one line per post with each line having the following fields separated by tabs: Date (e.g. Sun Jul 18 10:33: ) Username (max 20 characters) Text (max 140 character) 26

36 Chapter 4. Experimental Setup 27 Source (e.g. web) 4.2 Raw Data Cleanup Raw Twitter data contains a very large amount of noise that is not relevant to the query task at hand. Not only does such noise render prediction of stock prices much more difficult if it is not filtered out correctly, but it also contributes to longer processing times. After analyzing the raw data, it became apparent that much noise comes from non English languages, automated bot postings, and many sources of spam. In this thesis we are only concerned using posts that are written in Latin languages such as English. Therefore, we wanted to exclude languages like Chinese due to their different character set. An easy way to filter out non Latin languages is by checking if a post contains Unicode characters in which case it is excluded from the dataset. However, many posts written in Latin languages may still contain some Unicode characters and penalizing those may remove posts that could be relevant to the IR task. For this reason, we checked each character in a post and counted the number of ASCII and Unicode characters. We removed the post if the ratio of ASCII characters was above some threshold. We found that using a threshold of at least 95% or more ASCII characters worked very well. We cleaned up the data set by removing around one third of the posts. Figure 4.1 shows a snapshot of the data that has been removed using this process. While the reduction was already every extensive we looked into further methods to perform additional clean up in order to remove posts submitted by unwanted automated bots and other sources of spam. (Mowbray, 2010) analyzed statistics of the behavior of Twitter users and found that regular users do not submit more than 100 tweets per day. He also discovered that starting from June 2009 a huge number of Twitter accounts started publishing tweets that exceeded the 100 mark and in some cases even went over 1,000. In his paper, he correlates this sudden rise of automated postings with literature releases that occurred a few months prior. These documents include the Twitter API handbook released in April 2009 as well as marketing books such as Twitter Marketing Tips (Brooks, 2009) or Dominate Your Market with Twitter: Tweet Your Way to Business Success (Jon Smith, 2009). (Krishnamurthy et al., 2008) created a detailed characterization of Twitter and analyzed, among others, the characterization of user accounts. He found that there are three distinct Twitter groups. The first group has many followers but at the same time, the group does not follow many accounts in return. He labels accounts in this groups

37 Chapter 4. Experimental Setup 28 Figure 4.1: Samples of posts that have been removed by the Raw Data Cleanup process where each line represents a separate post. broadcasters which include radio stations that use Twitter to broadcast the current songs being played, as well as news or media outlets, such as the New York Times or BBC, which are broadcasting current headlines. The second group is label acquaintances and represent the users with a ratio of followers to following that is close to 1. Users who use Twitter on a regular basis fall into this group. A third group has a much larger number of following accounts compared to the followers and is characterized as potential spammers as these accounts try to connect with any user they can, in hope of being followed in return. As a result, these groups start spamming all the followers they managed to obtain. Another interesting finding by (Huberman et al., 2009) explains two types of networks amongst twitter users. Even though users may have many followers and following in their network, there is only a small subset of users to whom they have posted a tweet directly to. Huberman et al. define the former as the dense and the latter as the sparse network. The sparse network proves to be the more influential network since users belonging to it are more engaged in back to back communications. On the other hand, accounts who constantly submit posts to their entire follower base rather than a subset are therefore a separate group, which in many cases can also be characterized as spammers but could also be broadcasters. Since the release of the research papers mentioned in this section, new Twitter features and tools have been made available which simplify the automation of posting tweets. Given these findings and the complexity of determining different groups of users while distinguish spammers from non spammers, we decided not to remove posts generated by automated bots in order to ensure not to lose any meaningful data such as broadcastings from news outlets and other. Instead, we leave the task of spam detection

38 Chapter 4. Experimental Setup 29 and Twitter account ranking and categorization as future work. After cleaning up the dataset from noise the next objective was to further filter the posts in order to retrieve a new dataset that contained stock related messages. Here we applied the query expansion process introduced in Section 3.2, which used Google Sets automated keyword expansion webservice. 4.3 Feature Construction Our first set of features had to be very simplistic and a base for future improvement. We created a vector space model using a basic bag of words approach which is usually a standard in IR and text mining applications (Croft et al., 2009). This model is beneficial due to its capabilities to implement term weighting, ranking, or relevance feedback. In this model each document is part of a t dimensional vector space where t is the number of indexed words. A document, in our case a Twitter post, is represented by a vector of such indices: D i = (d i1,d i2,...,d it ) (4.1) where d i j represents the weight of the jth word. A Corpus of n Twitter posts is represented as a matrix where each row is a separate bag of words representation of one Twitter post and each column describes the weight that is attributed to a word for a given document: Term 1 Term 2... Term t Doc 1 d 11 d d 1t Doc 2 d 21 d d 2t.. Doc n d n1 d n2... d nt As input we used our preprocessed and filtered dataset which relates to a specific stock in question. For the bag of words approach we needed to split each post into individual words, which we accomplished by the use of a tokenizer. Our tokenizer uses Python s build-in shlex 1 library which is a module for lexical text analysis that is useful for parsing text and create tokens. The shlex library is very powerful in terms of the granularity of tokenizing text and has additional features such as ignoring certain characters which may be important to specialized queries. For example, 1

39 Chapter 4. Experimental Setup 30 if our tokenizer encounters a Twitter topic keyword i.e. #finance, we do not want to tokenize this sequence into # and finance since hash tags are used to identify Twitter topics similar to tags in blogs. Additionally, the tokenizer would completely split Uniform Resource Locators (URL)s into fragments due to many non-alphabetic characters. Posts that contain frequent occurrences of specific URLs may be important information for the regression task. In order to retain occurrences of URLs in posts we used regular expressions to identify, index, and then remove any URLs before allowing the tokenizer to continue. Finally we added additional rules that determined whether a token should be kept. For example, numbers found in tweets can relate to anything and are in most cases useless noise that should be excluded. Additionally, any tokens that are one character long will not add much to the predictive power, such as words like a, I or any punctuations and symbols. We were careful to not add too many such rules for several reasons. First, numbers can in some cases have important information; for instance, when users chat about specific products, a number will indicate which models or versions they refer to. As an example, in the string Windows 7 the tokenizer would ignore the number seven. Additionally, dates or times are also ignored. A second problem with our parser is that strings such as I.B.M. will not be retained since the tokenizer would produce 5 individual tokens of I. B. M. It was beyond the scope of this thesis to find an optimal tokenizer that best fits the task of parsing Twitter posts for predicting stock quotes. We therefore leave this feature for future work. The final output of the tokenizer is a list of key/value pairs where each line consists of a distinct word with its frequency value respectively. At this point the frequency is not a required attribute for the construction of the bag of words, but will be used in calculating term weights as described in Section Dataset Construction The final step involved before training a regression model is to construct a dataset that follows the required format of the SVR algorithm. In section 3.4 we introduced the concepts behind SVR and the reason why the algorithm is well suited for text regression. There are numerous SVR implementations available on the web and we decided to employ the LIBSVM (Chang and Lin, 2001) toolkit due to its cross platform implementations as well as numerous successes in research literature. LIBSVM requires the following format to create both training and testing datasets:

40 Chapter 4. Experimental Setup 31 < label 1 > < index 11 >:< value 11 >... < index 1t >:< value 1t > < label 2 > < index 21 >:< value 21 >... < index 2t >:< value 2t >.. < label n > < index n1 >:< value n1 >... < index 2t >:< value nt > where t is the index of the current word in the bag of words and n is the number of samples in the dataset. This list contains index-value pairs where each < index nt > represents the index of a distinct word from the bag and < value nt > corresponds to the frequency of the word occurring the current sample. The index is an incremental integer value starting at one enumerating the set of features. This representation is sparse as zero values are ignored so that only non-zero values are represented with their respective index. Furthermore, < label n > is the target value used for regression, in our case, the price of the stock we are interested in at the time the twitter post was released. Before matching the target price with a post, we had to convert the date-time from the recorded stock quotes to match the time zone of the Twitter posts. In our case, the Twitter posts used the Coordinated Universal Time (UTC) time zone whereas the Stock prices were recorded using the EDT. 4.5 Chapter 4 Summary In this Chapter we laid out the experimental setup requirements and described the remaining components which play an important role in our framework design. Raw Data cleanup is the first process which helps remove a small portion of spam and posts that are written in other languages. We explained the process for the creation of the bag of words feature vector with the use of a tokenizer to split individual words. Finally, we looked at the requirements of the dataset which are needed to transform the feature vector into a format compatible with the LIBSVM tool.

41 Chapter 5 Implementation and Results In this chapter we will discuss the implementation of different feature construction approaches and evaluate their performance on multiple datasets. We then analyze the results and explore new ideas and improvements. We begin our description with the most basic feature representation followed by improvements over our results. 5.1 Simple Bag of Words Our first objective was to test the performance of the bag of words feature set which we obtained from the filtered Twitter posts. The posts were filtered using keywords obtained by the query expansion algorithm explained in Section 3.2. In this experiment we matched the text with the stock price that existed at the same time of the release of the post. At this point, we were not yet interested in forecasting a future price, but rather test if we can build a regression model of the posts that can fit the time line of our selected stocks. We added two additional filters to the already pre-processed dataset: One filtered the number of features and another the number of posts: The dimensionality of our feature space was initially relatively high and contained after preprocessing on average over 250,000 distinct terms most of which were redundant and had therefore low frequency counts (e.g. zzzzzzzzzzzz ). We applied a threshold on the frequency of each term and kept only features with three or more counts (Joachims, 1998). Additionally, we only included posts that contained at least three or more distinct activated features, i.e. where the attribute value 0. This helped remove posts which had no meaningful information due to the fact that they were either too short or contained numerous repetitions of the same words. The model for this experiment was built using two different training sets. The first 32

42 Chapter 5. Implementation and Results 33 Table 5.1: Simple Bag of Words - experimentation results Stock Symbol AAPL GOOG FSLR INTC Full Dataset Size 56,187 97,931 91,350 29,948 Baseline MSE Prediction MSE Reduced Dataset Size 14,046 24,482 22,837 7,487 Baseline MSE Prediction MSE Figure 5.1: APPL Simple Bag of Words - experimentation results contained all training examples retained after preprocessing the raw data. The second contained a reduced form of the full training set in order to speed up training of the model and also test whether or not we could achieve similar performance as with the complete set. In order to reduce the training set and maintain the same proportions of examples with respect to the time series, we removed every n example(s), rather than cropping the entire set from the bottom or top. Before training, we also removed 10% from the bottom of the training sets and kept those examples for testing. The bottom 10% corresponds to examples with dates at the end of the time series. The training sets comprised of samples from Monday, July 19 to Thursday July 29. The test sets contained samples from the remaining day, Friday, July 30, Table 5.1 shows the results of the simple bag of words experiment of all four stocks. While none of the experiments performed better than the baseline, Apple and Google were much closer than Intel Corporation and especially First Solar. As described in Section 3.5.2, the baseline of each sample is calculated by taking the SMA of the last 60 ticks (one tick per minute) starting from the current price at the release

43 Chapter 5. Implementation and Results 34 Figure 5.2: GOOG Simple Bag of Words - experimentation results Figure 5.3: FSLR Simple Bag of Words - experimentation results Figure 5.4: INTC Simple Bag of Words - experimentation results

44 Chapter 5. Implementation and Results 35 date of the Twitter post. None of the results seemed to follow the actual price line. The prediction results of AAPL for instance simply overlap the actual price line as seen in Figure 5.1 and exhibit very strong noise. In Figure 5.4 the predicted price for INTC was not even in the same region and showed no correlation with the actual price. We found the reason for these extreme error score differences by investigating the historical stock charts which can be found in Appendix A. For AAPL (Figure A.1), the average price spanning over the time period of our training data is similar to the average price predicted in the experiment. This is the case for all four stocks. The Regression model therefore built a price line that correlated with the average of the training data, making it ineffective for our task. To build a better regression model, we needed features that represented the current state of the price. We decided to include the price of the stock as a new feature in the training set. This approach will be discussed in section 5.2. The examination of the results of the full datasets compared to the reduced sets indicated that the performance was rather similar. However, since the model performance of this experiment was poor, we decided to investigate the results of the two dataset sizes in the next experiments, rather than inferring a meaning at this stage. Another point to notice is the difference in the size of the datasets accross stock symbols. While we expected to have bigger datasets from the Apple set, it was almost half the size of Google s or First Solar s sets. Moreover, we expected to find fewer matches concerning FSLR which, to the contrary, contained almost as many posts as the GOOG dataset. To find clues about the reasons behind these numbers we decided to look at the bag of words representations for each stock and compare it with the query terms initially created. After ordering the terms of the FSLR bag of words by frequency, besides finding many stop words on top of the list, we also found words from the original list of expanded query terms such as video with 52,177 counts, world with 28,149 counts, and house with 20,783 counts. On the other hand, terns like green, wind, energy, and solar had frequencies of 6685, 4377, 4198 and 1389 respectively. The comparison of the other stocks showed that the AAPL bag of words had query terms with top frequencies such as iphone, media, google, and ipad. Similarly, the GOOG bag of words contained query terms with high frequencies such as youtube, msn, twitter and business. At first glance, these terms seem to have more relevance to the company than the top query terms of FSLR. We concluded that the performance of the query expansion algorithm was responsible for these deviations. Furthermore, the query terms generated by the algorithm are ques-

45 Chapter 5. Implementation and Results 36 Figure 5.5: AAPL Simple Moving Average - experimentation results tionable, given that the term video was generated for the First Solar query terms but not for Google s list. We will therefore explore the tf-idf algorithm in section 5.3 to determine weights and possible solutions for optimizing query term generation as well as feature selection improvements. 5.2 Moving Average and Stop Words In the research conducted by (Schumaker and Chen, 2009), which analyzed news articles to predict companies stock prices, they found that adding both the bag of words features as well as the price of the stock at the release of a news article greatly improved results over their baseline. We decided to add a similar feature, but rather than adding the stock price which was current at the release of the Twitter post, we used the SMA to account for sudden dips or spikes in price movements. Therefore, our new feature was calculated similar to the baseline as described in Section 3.5.2: we averaged the price of the stock at the release of the Twitter post with the previous 59 minute ticks. Additionally we added two minor improvements, first removing the most common stop words found in the bag of words representations of all stocks and, second, performing stemming on every tokens using the stemmer from the NLTK 1 toolkit. These improvements are standard techniques used in IR and may help to further decrease the prediction error (Croft et al., 2009). Table 5.2 shows the results after running the experiments. We ran the experiments on the entire dataset as well as the reduced set as described in the previous section and found that the results were again reasonably similar. We 1

46 Chapter 5. Implementation and Results 37 Table 5.2: Simple Moving Average - experimentation results Stock Symbol AAPL GOOG FSLR INTC Full Dataset Size 56,187 97,931 91,350 29,948 Baseline MSE Prediction MSE + SMA Feature Reduced Dataset Size 14,046 24,482 22,837 7,487 Baseline MSE Prediction MSE + SMA Feature Reduced Dataset Size - stop/stem 13,943 22,675 24,238 7,437 Baseline MSE Prediction MSE + SMA Feature - stop/stem Figure 5.6: GOOG Simple Moving Average - experimentation results Figure 5.7: FSLR Simple Moving Average - experimentation results

47 Chapter 5. Implementation and Results 38 Figure 5.8: INTC Simple Moving Average - experimentation results also found that adding the SMA as a feature improved the MSE score for all stocks very close to the baseline, but not outperforming it. As expected, stemming and removing stop words increased the performance in all cases as shown in the last row of Table 5.2. On the other hand, the baseline also increased in 3 out of 4 cases because stemming and stop words removal also affected the size of the training and test sets, which in turn affected the baseline calculation. Figures 5.5, 5.6, 5.7 and 5.8 show the results of the experiment in the time line with the prediction price plotted against the actual price. As the results were still not satisfactory, we turned our focus to finding possible improvements relating to the query terms. 5.3 Weighted Query Terms The analysis of the results obtained so far showed that our model is capable of building a regression line which follows the curve of the actual stock price. However, the performance was not well enough, as the prediction line still deviated too strongly from the actual stock prices indicating that the list of keywords we used to construct our features either retained Tweets that were not relevant enough or that the retained posts did not have terms that would help discriminate the direction of the stock price. In this section we look at the process of feature construction in more detail and start by analyzing the Twitter posts that were retained after applying the query term filters used to remove irrelevant information. Our goal was to understand if there was a useful correlation between price changes and key word frequencies of posts. As an example, we will focus our discussion on the Apple stock. We counted the frequencies of each key word in the entire corpus and found that keywords such as iphone, media, google, web, apple, and ipad had the highest frequency count and terms such as msft, aapl, mot, macosx, goog contained

48 Chapter 5. Implementation and Results 39 very low frequencies. Most of the latter keywords are the stock symbols of the companies that compete with Apple while the former keywords relate more to the Apple company and its product line. Then, there are also keywords that are generally common, such as media, google and web. We decided to take some of the keywords and plot them against the stock price to see if there was a correlation between the price and the number of mentions of each word and to observe if any keyword had a bigger impact than others. We found that keywords that had a low frequency did not show noticeable correlations to price changes. On the other hand, keywords that had a very high frequency count and which related directly to the company including currently popular products and services did show considerable correlation with price changes e.g. apple and ipad. Yet, other high frequency terms that did not relate to the company or that described less popular products or services such as the term ipod or media did not show similar correlations. Figure 5.9 depicts the keyword counts of two keywords, aaple and ipod as well as the price of the AAPL stock. The interval on the x-axis is represented in hours. The frequency of the keywords is aggregated over every hour. Similarly, the price of the stock is averaged over every hour. While we are not expecting the frequencies to model the price, we are able to see spikes in keyword mentions of the term apple during several strong price changes of the stock. We also included the term ipod to demonstrate as a comparison that it did not show reasonable correlations with the price. Given that a number of terms showed correlations with price changes we decided to investigate the text of these posts during the periods of time of high activity to see what sort of language was used and what the meaning and sentiment reflected. Below, we show five random samples from July 21 as we witnessed both considerable spikes in price increase as well as in frequencies of key terms on that date. (Usernames are replaced by asterisks to protect their privacy) - theres a #mendeley iphone app.. going to check it out. - apple enjoys solid q3 on strong ipad, iphone sales - flashlight app secretly lets you enable iphone tethering #macworld - apple profits soar thanks to iphone and ipad: sky newsother technology compa-

49 Chapter 5. Implementation and Results 40 Figure 5.9: AAPL query term frequencies of ipod and apple against the Apple stock price. For the apple term, we can observe correlation between high term frequencies and major price changes as on July 21, where both the price and the term count rise significantly. nies have been posting good profits wi... the ipad clearly cannibalized mac sales last quarter except the opposite The samples exhibit sentiment that indicates a positive trend for Apple as the discussions touched on the topic of Apple s positive quarterly profits. Therefore the increased mention of the keyword apple on Twitter correlates with the increase in price of the stock. These are the type of posts we would like our keyword filter method to withhold. While we mentioned many benefits associated with the short size of Twitter posts in Section 2.3, a drawback is that more query terms are required to select posts that may contain important information capable of building a close regression model. Since the previous results showed that the majority of key terms did not have noticeable correlations with price changes, we did not want to penalize these key terms by removing them entirely from the filter process. But at the same time, we did not want to select tweets just on the basis of matching a seemingly unimportant key term. We therefore assigned weights between 0 and 1 to the key terms. This was mainly a semi-manual process where we examined the price changes and frequencies of the key terms from charts we generated, such as the one in Figure 5.9 Generating the weighted query terms was the first part to improve our filter method. We also needed to assign a weight to each occurrence of the query terms in a Twitter

50 Chapter 5. Implementation and Results 41 post. For this task, we counted the occurrences of every query term appearing in each Twitter post as well as the total number of times each word appeared in the entire corpus. With these counts we used the tf-idf algorithm to calculate the weights of each post. This algorithm has two parts: t f ik = f ik (5.1) where t f ik is the term frequency weight of term k in a post and f ik is the number of occurrences of the term k in the post. id f k = log N (5.2) n k id f k is the inverse document frequency weight for term k and N is the number post in the Twitter dataset. n k is the number of posts in which term k occurs. To calculate the final weight, equations (5.1) and equations (5.2) are multiplied. Usually equation (5.1) is normalized in order to avoid favoring longer documents since they may contain a higher query term count regardless of the general importance of the term. Since twitter posts are limited by 140 characters, we decided to remove the normalization factor. In our first experiments, the Pre-Processing step retained all tweets that contained at least one of the query terms. With the calculation of the query term weights as well as the tf-idf weights described in the previous section, we want to reduce the dataset by only selecting posts that adequately match the weighted query terms. To compare how close a document matches our weighted query terms, we apply the cosine similarity measure defined as: t j=1 Cosine(D i,q) = d i j q j t j=1 d2 i j t j=1 q2 j (5.3) where D i is the vector of weighted query terms occurring in the Twitter posts and Q is the list of weighted query terms. The numerator is the sum of the inner product of the post weights and the query term weights. The denominator normalizes the resulting weight by dividing the inner product of the lengths of both vectors. We found that using a threshold of 0.35 gave the best results returning the most relevant Twitter posts. As a remark to our previous experimentation with different dataset sizes, we would like to point out that the cosine similarity measure reduced the initial dataset considerably removing posts that did not reach the threshold. Therefore we stopped our experimentations with two separate dataset sizes as described in Sections 5.1 and 5.2.

51 Chapter 5. Implementation and Results 42 Table 5.3: Weighted Query Terms Filter Method - experimentation results Stock Symbol AAPL GOOG FSLR INTC Baseline MSE Prediction MSE + SMA Feature Figure 5.10: AAPL Weighted Query Terms Filter Method - experimentation results After running the experiments on the new datasets we achieved improvements over the previous results as shown in Table 5.3. It should be noted that the MSE of the baseline also improved due to the fact that we retained a smaller subset of posts. Figures 5.10, 5.11, 5.12, and 5.13 show the prediction results of AAPL, GOOG, FSLR, and INTC respectively against the actual price of each stock. The reduction of the datasets had the highest impact on the INTC stock. This is contrary to our assumptions made in Section 3.5.1, where we presumed that FSLR would have fewer samples as the other stocks. Figure 5.11: GOOG Weighted Query Terms Filter Method - experimentation results

52 Chapter 5. Implementation and Results 43 Figure 5.12: FSLR Weighted Query Terms Filter Method - experimentation results Figure 5.13: INTC Weighted Query Terms Filter Method - experimentation results Table 5.4: 15 Minute Predictions - experimentation results Stock Symbol AAPL GOOG FSLR INTC Baseline MSE minute pred. MSE Profit/loss (10,000 starting capital) 10, , Figure 5.14: AAPL 15 Minute Predictions - experimentation results

53 Chapter 5. Implementation and Results 44 Figure 5.15: GOOG 15 Minute Predictions - experimentation results Figure 5.16: FSLR 15 Minute Predictions - experimentation results Figure 5.17: INTC 15 Minute Predictions - experimentation results

54 Chapter 5. Implementation and Results Minute Predictions Up until now, experiment predictions were done using the price at the time of the release of the Twitter posts. Yet, our goal is to predict the stock price t minutes into the future, so that an agent could use a trading rule and decide whether to act upon the prediction and buy or short a security or stay idle until a better prediction is made thereafter. Since the results of the experiments in section 5.3 approached the baseline in all cases fairly well, we decided to transform our datasets in order to use stock quotes 15 minutes after Twitter posts have been released. The reason we chose a 15 minute prediction is merely due to the fact that new chunks of Twitter posts are released to us every 15 minutes. In this set of experiments we started using our Virtual Stock Trading Engine introduced in section Table 5.4 shows the results after transforming the target values in the datasets while Figures 5.14 to 5.17 show again the predicted price against the actual of all four stocks. As we expected, in every test case, the results have slightly decreased, because in the previous dataset, the SMA feature was much closer to the target value. Whereas now it was 15 minutes further away. Nonetheless, the results are still satisfactory enough to continue experimenting in this domain. Furthermore, results from the trading engine showed mixed results. With 10,000 units starting capital, and 1 unit transaction cost 2, only AAPL and GOOG successfully turned the investment during one trading session into profits. 5.5 Parameter Validation In all previous experiments, we have used 90% of the time series data for training and the final day (10%) for testing. In a real world application, we would probably not train a model to predict prices for an entire day since that model may ignore new and unseen events that could not have been captured during training. Also, while the MSE approached the baseline in all cases fairly well, we did not manage to beat the baseline in any test case. In particular, we wanted to know if a model would perform similar or better, if it could capture features in a shorter time interval. We therefore asked if it was possible to train several models using subsets of the dataset and forecast prices during the entire two week period rather than just on the last day. Specifically, can we take a few hours or, even several minutes of Twitter posts and predict the stock price 2 based on the transaction costs of the online broker firm

55 Chapter 5. Implementation and Results 46 with similar accuracy? Each new model takes the form of < train 1 >< test +15min 1 >< train 2 >< test +15min 2 >... < train t >< test +15min t > (5.4) where < train t > represents a subset of n training instances, < test t +15min > contains one testing sample which is at least 15 minutes away from the train samples, and t is the number of possible train/test subsets for the two week period. We made some modifications to the Virtual Stock Trading Engine in order to accommodate the dataset transformations. Finally, all previous experiments built a model using the same SVR parameters as reported in Section 3.5.1; and while we reported the experiments on the most effective parameters, we still continued experimenting on different control parameters. Around 60% of the time, the same parameters performed best. However, as an additional improvement, we wanted to find out if a validation set could always select the optimal parameters for the algorithm with the assumption that a trained model was capable of forecasting twice the distance into the future. The reason for this assumption is due to the fact that all samples contain the 15 minute forecast price as the target value. This means that, in a real world scenario, validation could only happen after observing the actual target value, which would be revealed after 15 minutes. The same assumption is made (after validation) for the case of predicting the final price which must therefore be twice the distance (30 minutes) into the future. For the experiments conducted in this section, the datasets then take the form of < train 1 >< val 1 +15min >< test 1 +30min >... < train t >< val t +15min >< test t +30min > (5.5) where the training set now also includes a validation set, with < val +15min t > 15 minutes into the future for validation and parameter optimization and < test +30min t > 30 minutes into the future, for the actual prediction using the optimized parameters. Table 5.5 shows the results of the 15 minute predictions as well as the 30 minute predictions including 15 minute validation. In the first three rows of the Table we included results from the validation set without making any future forecasts in order to compare the performance of validation on a simple model. This is equivalent to a 0 minutes wait period before using samples for validation or testing. We also evaluated the 15 minute predictions without validations by plotting the prediction results as dots (in red) against the plot of the actual price line (in blue)

56 Chapter 5. Implementation and Results 47 Table 5.5: Multiple Equally Sized Datasets - experimentation results Stock Symbol AAPL GOOG FSLR INTC Baseline MSE Prediction 0m val. & 0m pred Profit/loss 0m val. & 0m pred Baseline MSE Prediction no val. & 15m pred Profit/loss no val. & 15m pred Baseline MSE Prediction 15m val. & 30m pred Profit/loss 15m val. & 30m pred Figure 5.18: AAPL Multiple Equally Sized Datasets - experimentation results Figure 5.19: AAPL Multiple Equally Sized Datasets - prediction error results

57 Chapter 5. Implementation and Results 48 Figure 5.20: GOOG Multiple Equally Sized Datasets - experimentation results Figure 5.21: GOOG Multiple Equally Sized Datasets - prediction error results Figure 5.22: FSLR Multiple Equally Sized Datasets - experimentation results Figure 5.23: FSLR Multiple Equally Sized Datasets - prediction error results

58 Chapter 5. Implementation and Results 49 as seen in Figure 5.18, 5.20, and 5.22 for AAPL, GOOG, and FSLR respectively. Furthermore, we plot on a separate figure the difference in price between the predicted and the actual price as seen in Figure 5.19, 5.21 and Each point on the graph comes from a single prediction during the two week interval which are connected to form a line. While we cannot draw direct comparisons to our previous experiments due to the fact that we are using different test sets, the results for the 15 minute prediction without validation are similar to previous results. These results show that using shorter training sets to build a model for short term prediction is possible. Once again, we do not have any valuable results from INTC since the dataset was too small. A point to note on all graphs showing the prediction dots vs actual price line are the deep square heaps and dips which look like irregularities. These are in fact the price values during the pre or after market hours which were included in these experiments since predictions where now made across multiple days. In terms of the experiments over different forecasting distances, we found mixed results concerning the Virtual Stock Trading Engine. The first experiment used samples for validation and testing that were 0 minutes away from the training instances. While this is not a possible real world scenario, our aim was to test the performance of the parameter optimization. As expected, in all cases of the first experiment a profit was attained. The second experiment results are rather unexpected, since in two out of the three cases the profit increased over the previous experiment. However, in the case of FSLR, the profit decreased substantially as anticipated. Finally, the third experiment had a disadvantage of predicting 30 minutes into the future. And as we expected, more test cases show significant losses. In the case of AAPL however, we have a high increase in profits. We assume that the reason is probably due to increased random noise as we attempt to predict further into the future. 5.6 Accumulating Training Data Our final experiment built upon the previous experiments which used multiple small datasets. This time, however, we asked if we could increase the accuracy over time if we kept all previously seen samples and reused them in all subsequent training sets. Therefore, rather than using equally sized training sets, we began by constructing a small training set containing 100 training instances and a test set with on sample. The samples drawn for this initial dataset were again from the very beginning of the time series. The following training sets were built by accumulating previous training data

59 Chapter 5. Implementation and Results 50 Table 5.6: Accumulating Training Data - experimentation results Stock Symbol AAPL GOOG FSLR INTC Baseline MSE Prediction with Accumulating Training MSE Figure 5.24: AAPL Accumulating Training Data - experimentation results and concatenating them with new samples. We added one new sample after every prediction was made; however, this new sample was also 15 minutes away from the last predicted value. In this experiment, the training set gradually grew containing more and more samples. Using this method, we expected to see higher errors at the beginning of the timeline while, during the course of increased training samples, the error should gradually decrease. Figures 5.24, 5.26, and 5.28 show the results of the actual against the predicted prices. The first aspect to notice is that there are many more prediction dots. This is due to the fact that we are accumulating one sample at a time to build each new training set rather than having fewer fixed sized training sets as in Section 5.5. Moreover, the predicted values model the actual price lines more accurately from a graphical point Figure 5.25: AAPL Accumulating Training Data - prediction error results

60 Chapter 5. Implementation and Results 51 Figure 5.26: GOOG Accumulating Training Data - experimentation results Figure 5.27: GOOG Accumulating Training Data - prediction error results Figure 5.28: FSLR Accumulating Training Data - experimentation results Figure 5.29: FSLR Accumulating Training Data - prediction error results

61 Chapter 5. Implementation and Results 52 of view. Nonetheless, our aim of this experiment was to test whether accumulating training data would over time decrease the prediction error. We can measure this by comparing Figures 5.25, 5.27, and 5.29 with the equivalent Figures from the previous experiment 5.19, 5.21, and fig:exp5bfslr respectively. It is not very clear if accumulating training data improved results over time, but certain spike decreases show this effect. Certainly in the case of the GOOG stock, and to a certain extent also in the case of AAPL is a decrease in error noticeable. FLSR, on the other hand, shows signs of strong variations towards the right of the graphs in both Figure 5.23 and These spikes seem correlate to the strong decline in the actual price seen in Figures 5.22 and 5.28, which may not have been captured in the trained model. From a error score perspective, we have again similar results as before. 5.7 Chapter 5 Summary In this chapter we presented different experiments to validate whether it is possible to model the stock market using Twitter and further test if we can predict future stock prices. We stared with the basic bag of words approach in the first experiment. We added the SMA as a feature in the second experiment which significantly improved results. We then fine-tuned the results by exploring feature selection techniques. We turned our attention to a new set of experiments and tested the prediction of future price movements including the use of a validation set to fine-tune model parameters. Moreover, experiments were conducted on subsets of the dataset in order to predict prices during the entire two week period.

62 Chapter 6 Conclusion 6.1 Discussion The experiments described in the previous chapter can be divided into two categories. The first set of experiments concentrated on building a regression model of the current stock price using Twitter posts. That is, the target value in each data sample used the price of a stock that was current at the release of a Twitter post. The second set of experiments attempted to use a target price that was set in the future. A further distinction that can be made is the use of the data to create training and test sets. Our focus in the initial experiments used 90% of the data in chronological order for training and the final 10% for testing. In contrast, later experiments divided the dataset into several training and testing sets in order to predict prices during the entire two week period of available data. Our discussion begins by analyzing the first category of experiments. Results from the simple bag of words model in Section 5.1 indicate one major problem. Using the entire training data from the first nine out of the ten business days for training had merely predicted an average of the training period. Thus we saw a straight line with heaps of noise spikes as depicted in Figure 5.1. We expected that using a lot of data for training would help accumulate distinctive support vectors. However, in time series prediction the older data becomes the more it loses its predictive power. Since we gave the entire data the same importance, that is, we did not decrease the weight of older data, the algorithm used the entire dataset equally creating the average lines we have shown. To overcome this problem, we adjusted our feature vector by using the SMA of the last 60 minutes of the stock price. This method had proved to be successful in the research conducted by (Schumaker and Chen, 2009), where they 53

63 Chapter 6. Conclusion 54 used the last known price rather than the SMA. An additional improvement that may help in reducing the importance of older data could be borrowed from an algorithm in Reinforcement Learning (RL). In RL, learning environments are described by a set of finite or infinite states in which the learner find itself. While there are different learning approaches, learned values for each state are generally stored in a value function. A concept called Eligibility Traces is used to give more importance to recently visited states while decreeing the influence of states that have not been visited in a while (Sutton and Barto, 1998). These values are then updated in the value function. This same principle could be applied to the posts, by giving older posts smaller weight than more recent posts. Furthermore, posts that lie to far in the pasts could be pruned off. While the new SMA feature did not yet address the problems of time series data we mentioned, it improved the results significantly. As shown in Figure 5.5 the prediction line adjusted to the actual price line, but still contained many spikes, which indicated not only that we had not yet found features that were relevant to our task, but also that there were a lot of samples which were probably noise or spam. We started addressing this problem in Section 5.3 by analyzing the query terms used to reduce and filter out the raw dataset of irrelevant posts. One key problem with the short size of Twitter posts is that using too few query terms will inevitably skip posts which may have strong predictive content. As we described our use of query expansion algorithm in Section 3.2 a second problem arose: Using too many keywords that relate to our query will include many posts that are not relevant to our task. It is essential to find the right balance which we attempted by applying weights to the query terms. However, other methods should also be taken into consideration in future work such as calculating information gain on individual features in order to remove the lest helpful features and apply appropriate weights to more informative features. While these improvements helped increase the accuracy significantly towards the baseline as seen in Table 5.3 one of the most important contributors to our remaining prediction error is spam and noise. Twitter s rapid growth in popularity has triggered a constant battle between relevancy and spam in Twitter posts, which has been fueled by the releases of the Twitter API. Furthermore, we should take into account who is posting information that is considered relevant and influential, which posts are being re-tweeted most often, and which users have the strongest reach. This information can be obtained from a combination of Twitter meta-data and the network of connected users. This knowledge could then be used to assign additional weights to different posts. We believe that the highest improvements in our thesis can be gained from identifying and removing spam

64 Chapter 6. Conclusion 55 as well as identifying and ranking relevant sources (i.e. user accounts). Given our experimentation of different stocks, we obtained comparable results in most cases but expected that the error measures of First Solar (FSLR) and Intel Corporation (INTC) to be worse than those of Google (GOOG) and Apple (AAPL). We found that FSLR and INTC datasets did not contain considerable fluctuations in price changes or noticeable spikes and that those factors contributed to accuracy observed. Therefore future experimentation should include more data with various fluctuations as well as stocks that are not part of a technical domain. In the second category of experiments, we examined the prediction of future prices as well as the implementation of new training/testing intervals. We changed the target value of all samples in our dataset to use the price 15 minutes ahead of the current value. Results in Table 5.4 show an increase in the MSE, more than doubling from to for Apple, with similar results for the other tested stocks. Since this was expected, we moved on to address the problem of the datasets. As we explored above, using the 90% of the dataset for training and the remaining 10% for testing proved to exhibit problems in time series prediction. For this reason, we created multiple shorter datasets, in order to capture relevant information in close proximity of the prediction at hand. To test our results of the 15 minute predictions, we created a Virtual Stock Trading Engine described in Section The stock trading engine did not take into account all the predictions made by the model. It only selected those that would yield a potential profit which was above a threshold of twice the commission rate. While we found variations across datasets,generally, our findings indicate that using Twitter as a source of near real-time information to predict the price ahead of time can be used to make reasonable profits before the market adjusts itself. As the time difference increases the profits become less stable, as shown in Table 5.5. From the results using smaller datasets, we can see that the results are similar to our beginning experiments. While the MSE results are still not better than the baseline, the graph of the predictions indicate closer matches as seen in figure 5.24 with less noise as compared to previous experiments. 6.2 Conclusion We set out to build a regression model of stocks using Twitter and Intra-Day minute data. We used several NLP techniques to pre-process the raw Twitter text including word tokenization, stop-word removal, and stemming of words. We implemented fil-

65 Chapter 6. Conclusion 56 tering techniques that included keyword expansion, term weighting using the tf-idf weighting scheme, and the cosine similarity measure to reduce the dataset and create a feature vector space of the tokenized and weighted terms. In our experimentations, we found that the best feature vector with the lowest error measure was obtained from our feature selection methods with the addition of a new feature which we constructed using the average stock price over the last 60 ticks of minute stock quotes. Results show the predictions were very close to our strong baseline with an MSE score of for Apple versus the baseline MSE of Similarly Google s MSE score was compared to the baseline. Finally, we also found that predicting the future price can be achieved at short distances (15 minutes) into the future, but accuracy becomes unstable as the forecast distance increases (30 minutes). From our work, we conclude that information and beliefs can be extracted from the population giving a small but significant advantage in predicting market prices. In Section 6.3 we will point out possible improvements for strengthening our claims. 6.3 Future Work For future work, we would be interested in exploring additional features as well as filtering methods using Twitter meta-data. For example, meta-data contains information about the number of followers and following users. These values can be used to determine important and influential users. The meta-data field statuses count may be useful to distinguish spammers from real users as real users publish on average less than 100 posts per day (Mowbray, 2010). Parallel to new improved features, the problem of spam must also be addressed. Twitter meta-data as well as Twitter statistics, and Twitter specific functions such as re-tweets and hash-tags could be a starting point. Important information may have been lost in the process of tokenizing the data. Therefore, creating proper rules that are specific to the Twitter dataset in conjunction with the topic of financial markets may lead to additional improvements. Additionally, using larger datasets including stocks from different domains as well as recording stock data from time periods with more volatile price changes and stock volumes would be an important field of research. Sentiment analysis has proved to discriminate the belief of the population over different topics. Therefore adding lexical knowledge about positive and negative terms could potentially lead to additional meaningful features.

66 Appendix A Stock Charts The following figures A.1 (AAPL), A.2 (GOOG), and A.3 (FSLR) respectively, illustrate the snapshots of the charts of three out of the four stocks we selected for our experiments. The chart for INTC can be found in Section The snapshots are from the time period between July 19th, 2010 to July 20th, This data was retrieved from the Google Finance 1 website