Social Media Aided Stock Market Predictions by Sparsity Induced Regression

Transcription

1 Social Media Aided Stock Market Predictions by Sparsity Induced Regression Delft Center for Systems and Control

2

3 Social Media Aided Stock Market Predictions by Sparsity Induced Regression For the degree of Master of Science in Systems and Control at Delft University of Technology October 26, 2014 Faculty of Mechanical, Maritime and Materials Engineering (3mE) Delft University of Technology

5 Abstract Prediction of the stock market has been a research topic for decades. Recently, data from social media like Google and Twitter are included in prediction models. This data serves as an indicator of sentiments that are potentially useful for prediction. Interpretation of current prediction methods is cumbersome. Beforehand it is not known which data is relevant for the prediction and hence which data should be added to the model. To improve interpretability and thereby credibility of the results, this thesis uses sparse regression methods that automatically discard data that is not useful for the prediction. Current methods induce sparsity via l 1 -regularization such as the LASSO. In contrast to traditional applications, this thesis assumes that a sparse, time-varying regression vector is estimated from time series data that arrives sequentially over time. Data can thus not be treated in batch form where constant behavior over a window is assumed and hence performance of current sparse regression methods is limited. Therefore, a new Weighted Sparse Kalman Filter (Weighted-SKF) is proposed that induces sparsity in the KF equations. The KF is able to track time-varying behavior, while the sparsity ensures that interpretable results are obtained. Simulations demonstrate that the Weighted- SKF outperforms current regression methods in identifying the time-varying support and regression vector. Moreover, the time-varying usefulness of social media data is demonstrated: the Weighted-SKF includes social media data in its prediction model only during large declines in the stock market.

6 ii Abstract

7 Table of Contents Abstract Acknowledgements i xi 1 Introduction Stock Markets Social Media Aided Stock Market Prediction Sparse Models Approach and Goals of the Thesis Outline of the Thesis The LASSO and its Application to the Stock Market Introduction Mathematical Framework and Notations The LASSO and its Properties Background and Origin of the LASSO Sparsity via l 1 -Regularization The Oracle Properties and the Adaptive LASSO Tuning Parameter λ Application of the LASSO to the Stock Market Classical LASSO applications Stock Market Application of the LASSO Challenges with the Stock Market Application of the LASSO Summary

8 iv Table of Contents 3 Sparse Time-Varying Regression Introduction Windowed LASSO Working Principle Limitations Summary The Dynamic LASSO Working Principle Limitations Summary Kalman Filtered LASSO The Kalman Filter equations Working Principle Limitations Summary The Weighted Sparse Kalman Filter Inducing Sparsity in the Kalman Filter Working Principle Limitations Summary Summary Simulation Results Introduction Estimation Performance for P > N > S Dataset Discussion Simulation Results Estimation Performance for P > N S Dataset Discussion Simulation Results Weighted-SKF and Time-Varying Support Dataset Discussion Simulation Results Simulations with Real Social Media Data Dataset Discussion Simulation Results Summary Conclusions and Recommendations Conclusions Recommendations

9 Table of Contents v A Proofs 65 A-1 The Soft Thresholding Function A-2 Calculating λ max Glossary 71 List of Symbols List of Acronyms

10 vi Table of Contents

11 List of Figures 1-1 (a) Google search volume for DJIA and (b) the actual DJIA from Yahoo! Illustration of the minimization problem (2-2) and the definition of the vectors From left to right: graphical representation of the l 2 -, l 1 - and l 0 -norm in the (x 1, x 2 ) plane Graphical representation of the soft thresholding function (2-5). Small values of x are set exactly to zero, while for the nonzero coefficients a bias is introduced via λ Graphical representation of the shrinkage effects of the (a) l 0 -norm, (b) the l 2 -norm and (c) l 1 -norm for the orthonormal case A T A = I. The 45 degrees dotted line serves as a reference and represents the values before shrinkage Graphical representation of the constraint l 1 -region (left) and l 2 -region (right), with the contour lines of the objective function Ax y 2 2 in the (x 1, x 2 ) plane. A feasible solution is obtained where the constraint region is entered. For the LASSO, this is likely to happen at a vertex where either x 1 or x 2 is zero and hence a sparse solution is obtained Graphical representation of the shrinkage of coefficients in x, in all situations λ = 4. Left: the LASSO with the biased nonzero coefficients. Middle: Adaptive LASSO with γ = 0.5. Right: Adaptive LASSO with γ = 2. It is seen that the bias for larger coefficients is eliminated. Figures borrowed from (Zou, 2006) Illustration of the A-matrix of Scenario 1: N > P > S Illustration of the A-matrix of Scenario 2: P > N > S Illustration of the data vector a t of Scenario 3: P > N S Plots of four coefficients and their estimates. The most accurate constant representation of x t in a window is ˆx t mean. The ˆx t LASSO is only close to ˆx t mean at some instances. Moreover, applying a window obstructs accurate tracking of time-varying behavior True and estimated coefficient for increasing λ 2 in D-LASSO. The values x t are those calculated at T = 50 to illustrate the effect of λ

12 viii List of Figures 3-3 x t and its estimates with λ = 0.2. The red, circled line labeled ˆx D t are the estimates of D-LASSO based on data that is available until time step t. The black, squared line labeled ˆx D t at t = 50, are the estimates calculated at t = T (a) shows a nonzero coefficient and its KF estimate and (b) shows a zero coefficient and its KF estimate. In contrast to Windowed LASSO, the mean of the KF estimates over W time steps accurately approximates the true mean Generated nonzero coefficients in x t with constant support At each time step, the MSE(x t ) calculated over 15 runs is shown for P = 25, N = 8 and S = 3. KF-LASSO converges to a lower MSE(x t ) than LASSO by using dynamic information of x t. Weighted-SKF outperforms all other methods since it also uses dynamic information in support estimation MSE(x t ) of Weighted-SKF for several choices of the tuning parameters. When the parameters are not set exactly, but close to the true value, the Weighted-SKF still performs well (a) The average MSE(x t ) and (b) the percentage of how many times (of in total 75) the support is estimated correctly. Both as a function of N Nonzero coefficient and its estimate by (a) KF, (c) SKF and (e) Weighted- SKF. Zero coefficient and its estimates by (b) KF, (d) SKF and (f) Weighted- SKF. The SKF nonzero coefficient is heavily biased, while this is compensated in the Weighted-SKF Generated nonzero coefficients in x t with time-varying support (a) IEN and the estimated size of the support by the Weighted-SKF of Algorithm 3 and 4 (b) Tracking of a coefficient added to the support Additions to and deletions from the support, tracked by the Weighted-SKF. Detection of additions via the IEN is fast. Deletion is somewhat slower since the mean of the estimate needs to be below the threshold for W time steps Normalized DJIA closing price and the number of Google searches Normalized DJIA closing price and periods where the Weighted-SKF includes Google data DJIA closing price and buy and sell moments of the Weighted-SKF with Model

13 List of Tables 3-1 Performance of the various algorithms Speed of detecting a change in the support with the IEN and finding the correct support with Weighted-SKF of Algorithm 4 for 30 runs financial Google search terms Top 5 financial search terms with correlation coefficient MAPE and DA of the Weighted-SKF and (Mao et al., 2011) for 3 Models Returns of the Weighted-SKF and (Mao et al., 2011) for 3 Models

14 x List of Tables

15 Acknowledgements This thesis is the result of a year of research. The main motivation for this research is that I greatly enjoy to explore whether methods and knowledge, developed in technical research areas, are also applicable outside these original areas. As such I was curious whether knowledge, insights and methods of Systems & Control could be applied for economic problems such as stock market prediction. I believe that by combining knowledge and applications of several research areas, valuable contributions and original solutions can be obtained for the encountered problems. I would like to thank my supervisor, Prof. dr. ir. Michel Verhaegen for his enthusiasm and support for this thesis proposal and the discussions during the past year. Moreover, I would like to thank Joep Kooijman with whom I collaborated a large part of last year for his feedback on my report and the fruitful discussions we had. Finally, I would like to thank my family, especially my parents, who have always supported me in every way they can. They are the ones who made it possible for me to get this far and I am deeply grateful to them. Delft, University of Technology October 26, 2014

16 xii Acknowledgements

17 Does it mean this, does it mean that, that s all anybody wants to know. I d say what any decent poet would say if anyone dared ask him to analyze his work: if you see it, then it s there! Freddie Mercury

18

19 Chapter 1 Introduction Prediction of the stock market has been a research topic for decades. Recently, attempts have been made to improve the accuracy of the predictions by including data from social media like Google and Twitter. Data from social media are regarded as indicators for sentiments that potentially carry useful information in addition to financial data. Current prediction methods, however, do not give results that are easily interpretable. Beforehand it is often not known which data is relevant for the prediction and hence which data should be added to the model. To obtain interpretable results, regression methods that induce sparsity are required: data that is not useful for the prediction is automatically discarded from the model. The goal of this chapter is to introduce the various aspects of social media aided stock market prediction. Section 1-1 discusses stock markets and the influence of sentiments on investors. Section 1-2 discusses recent studies that use data from social media to improve the predictions. Examples are given of studies that use Google and Twitter data. The principle of sparse regression methods is introduced in Section 1-3. It discusses how sparsity can improve the interpretability and it gives two examples of current applications. The goals of this thesis are presented in Section 1-4 and an outline is given in Section Stock Markets The market in which stocks of publicly held companies are traded is called the stock market. In exchange for capital, the one who invests in stocks receives part of the ownership of a company. When a company is profitable, the investor makes money by receiving dividend. Moreover, the stock price increases when the demand is high and the investor makes money by selling his stocks for a higher price. One the other hand, the investor can loose money when the company is not profitable and when stocks are sold for a lower price.

20 2 Introduction The stock market is thus a network of sellers and buyers and the stock price is determined by supply and demand. Behavior economics argues that decisions made by investors on the stock market are influenced by social and emotional factors. Therefore, people make irrational decisions and their behavior does not follow economic models (Smailovic et al., 2013). If these emotions could be captured and used for stock market predictions, more reliable predictions may be obtained. For example, during the financial crisis of 2008 nobody seemed to be able to predict what would happen on the stock market since the models used for prediction are based on fundamental price movements and not on sentiments. So even though the models are quite involved and the trading strategies carefully chosen, the returns were negative when the stock market was dominated by emotions (Anderluh, 2011). This emphasizes that the performance of the prediction models can be improved by capturing sentiments and emotions in the prediction models. 1-2 Social Media Aided Stock Market Prediction Recently, to account for the emotions and sentiments that influence the investors behavior, data from social media is used. The recent availability of vast amounts of social media data, gives the possibility to capture part of these emotions and sentiments. Nowadays, a huge amount of information is shared on social media sites such as Twitter, Google and Facebook. This new development gives the possibility to use sentiments of large groups of people for prediction by extracting this information from these social networks. One of the first papers to address this phenomenon was the paper of (Bollen et al., 2010) that also received a lot of attention from the media. This paper used several public mood indicators based on search terms in Google (e.g. the calmness index) and it showed that these indicators sometimes predict the Dow Jones Industrial Average (DJIA) three days ahead. Similar examples that use social media to enhance predictions are the work of (Asur and Huberman, 2010) and (Jiang et al., 2013). In the past 5 years, social media aided predictions primarily focused on the data from Twitter and Google. First, three illustrative examples of Twitter based predictions are briefly discussed. (Oliveira et al., 2013) performed sentiment analysis on the content of tweets and it investigated posting volume of tweets. It was found that the Twitter posting volume is relevant for modeling the trading volume of the next day. The second study relating to Twitter is that of (Mao et al., 2012). They investigated whether posting volume on Twitter is correlated at three different levels: the stock market as a whole, the industry sector and individual company stocks. It was found that Twitter posting volume was primarily helpful to predict the stock market at the stock market level. The third publication is that of (Rao and Srivastava, 2012). Data was used to model the movements in oil, gold and forex prices. It was found that by including social media data the prediction error was reduced for forecasting the DJIA.

21 1-2 Social Media Aided Stock Market Prediction 3 (a) Google search volume for the term DJIA. Source: (b) Illustration of the DJIA closing price. Source: Figure 1-1: (a) Google search volume for DJIA and (b) the actual DJIA from Yahoo! Two Google based prediction methods are briefly discussed next. The work of (Beer et al., 2013) proposed a novel investor sentiment indicator based on search volumes on Google. It was shown that this sentiment indicator contributes to short-term market returns. The second publication, that of (Preis et al., 2013), investigated search volumes and their relation to future trends. It was found that search volumes of certain search terms are "early warning signs", especially during stock market falls and financial crises. These studies illustrate the potential of using Google and Twitter data for improving stock market predictions. This thesis focuses on data from Google, since this data gives promising results while it is more easily available than Twitter data. Figure 1-1a shows the search volume for the search term DJIA. Hence, this figure depicts the interest in this term over time. Figure 1-1b illustrates the DJIA closing price. A visual examination of Figure 1-1 already shows the resemblance between the social media and the stock market. It is noted that the improvements in prediction accuracy achieved by the previously mentioned studies is difficult to quantify, since the results of adding social media to the prediction differ per scenario. Although a conclusive answer is yet to be found to the question whether data from Google and Twitter can predict the stock market, the previously mentioned studies showed promising results.

22 4 Introduction 1-3 Sparse Models Stock markets are complex, dynamic, time-varying systems of supply and demand. The majority of the aforementioned studies employed neural networks or a Support Vector Machine (SVM) for prediction. These methods can be regarded as a black box: many types of information serve as inputs to the algorithm and via a unknown, nonlinear mapping the prediction is obtained. In terms of prediction accuracy, these nonlinear models often perform well, however, some drawbacks exist in the currently used models. First, inputs to these nonlinear models have to be selected carefully. When additional inputs, such as social media data, are included in the model, the performance may deteriorate when the added inputs contain no useful information. Currently, inputs are selected during a pre-processing step, before they are added to the model. Due to the time-varying character of the stock market it might well be that inputs that are first not useful become useful later, as time progresses. However, these inputs cannot be added to the model halfway the prediction, since they have already been discarded during pre-processing. Moreover, black box models lack interpretability. Because of the complexity of the system, the understandability of predictions is important. Human traders are more likely to trust a prediction if it comes with valid reasoning. Therefore, the system of stock market predictions should be understandable and interpretable. Interpretability - and thereby credibility - of the forecasts is increased when it is clearly understood what inputs are selected and to what extend the prediction is based on what inputs. This thesis therefore proposes to use sparse methods that automatically discard data that is not useful. Currently, such methods are primarily used in system identification problems. For example, in image reconstruction with MRI scans, it is known that using a subset of the measurements is already sufficient to reconstruct the image. Sparse methods are able to exploit this knowledge and to identify this subset and thereby accelerating the imaging speed (Lustig et al., 2007). Another illustrative example is the application of sparsity for climate predictions (Chatterjee et al., 2012). This study investigated what variables are most relevant for prediction of land climate, in order to gain better insight and interpretation. The goal is to obtain the most relevant variables for prediction from a set of ocean climate variables, such as temperature, sea level pressure, wind speed etc. It was shown that off coast temperature and precipitation are most relevant for predicting the land climate. However, current applications of sparse regression methods, such as the aforementioned two examples, assume that multiple measurements of the signal of interest can be obtained. For applications to the stock market, on the other hand, it is argued that it is not possible to take sufficient measurements at one time step. Instead, a sparse signal is estimated from observations that are acquired sequentially over time and it is assumed that this sparse signal is time-varying. Performance of current sparse regression methods is thus expected to be limited when these are applied to time series for stock market prediction. Therefore, this thesis proposes a new sparse regression method that is suitable for stock market predictions.

23 1-4 Approach and Goals of the Thesis Approach and Goals of the Thesis Based on the discussion of stock market predictions and sparse regression methods in the previous sections, the goal of this thesis is formulated. The goal of this thesis is to propose a sparse regression method that is able to retrieve a sparse, time-varying regression vector from time series data that is acquired sequentially over time. The first research question therefore is: 1. How is sparsity induced to obtain interpretable results from a regression model? To understand how interpretability is translated into the mathematical notion of sparsity, it is discussed how sparsity is induced and what effects this has on the regression model. This question is the subject of Chapter What challenges arise when such interpretable regression models are applied to time series? How do current methods meet these challenges? In contrast to current applications, this thesis assumes that a sparse, time-varying regression vector is estimated from time series data that arrives sequentially over time. Challenges will thus arise when current methods are applied to time series for stock market prediction. Section 2-4 puts these challenges in a mathematical framework and it discusses the challenges in detail. Chapter 3, Section 3-2 to 3-4, discusses the most common and best performing sparse regression methods proposed in current literature. It is discussed to what extend these methods are applicable to time series and what causes their limited performance in this new application. 3. Is it possible to propose a sparse, time-varying regression method that performs well with time series data for stock market predictions? The performance of current methods is limited when these are applied to time series data that is acquired sequentially over time. Therefore, a new sparse regression method is proposed. The proposed method should be able to (i) accurately identify the important data (the nonzero coefficients of the regression vector, called the support) and (ii) accurately estimate the values of the nonzero entries in the coefficient vector. This is the subject of Chapter 3, Section 3-5. Moreover, Chapter 4, Section 4-2 to 4-4, discusses the results of numerical simulations of current methods and the proposed method. 4. Can the proposed sparse regression methods for time series be employed to improve stock market predictions with social media? Finally, when the proposed method has been designed, it is investigated whether this method can be employed to predict stock markets with the aid of social media data. This is part of Chapter 4, Section 4-5.

24 6 Introduction 1-5 Outline of the Thesis The outline of the remainder of the thesis is summarized in this section. Chapter 2 discusses how sparsity is induced by l 1 -regularization and how the Least Absolute Shrinkage and Selection Operator (LASSO) can be employed to obtain interpretable results. The mathematical framework and the properties of the LASSO are discussed in that chapter. Moreover, the challenges that arise when the LASSO is applied in social media aided stock market prediction are discussed in detail. Chapter 3 discusses three sparse regression methods that are proposed in current literature for time-varying systems. The working principle of these methods is discussed as well as their limitations when they are applied to stock market predictions. The second part of Chapter 3 introduces the newly proposed Weighted Sparse Kalman Filter (Weighted-SKF) that accurately estimates the time-varying support and the timevarying coefficients in the regression vector for stock market predictions. Chapter 4 discusses the simulation results of the most promising method in current literature, the Kalman Filtered LASSO (KF-LASSO), and it compares the performance to the newly proposed Weighted-SKF for various datasets. Moreover, simulation results on real social media data and stock market data are discussed for the W-SKF. Conclusions are summarized by answering the research questions of Section 1-4 in Chapter 5. Moreover, this chapter gives some recommendations for future research.

25 Chapter 2 The LASSO and its Application to the Stock Market 2-1 Introduction In Chapter 1 it was stated that sparse regression methods are employed in this thesis to obtain interpretable results for stock market predictions. This chapter discusses how sparsity is induced by l 1 -regularization and how the Least Absolute Shrinkage and Selection Operator (LASSO) can be employed to obtain interpretable results. Section 2-2 discusses the mathematical framework and the notations used throughout this report. Then in Section 2-3, the LASSO is introduced that uses sparsity via l 1 - regularization in a least squares problem. The properties of the LASSO are discussed and the Adaptive LASSO is introduced as an extension to the LASSO for unbiased estimates. Section 2-4 discusses how the framework of the LASSO can be applied to the stock market in order to obtain interpretable results and it is discussed what challenges arise in that setting. A summary of this chapter is given in Section Mathematical Framework and Notations Data is collected in the data matrix A R N P that contains N observations and P variables, or features. Each column in the A-matrix is called a feature vector, f p R N. Data of the output is collected in y and the regression-, or coefficient vector is x. A linear relation between A, y and x is described via (2-1), where ɛ is zero-mean white noise: ɛ N (0, σεi). 2 y = Ax + ɛ (2-1)

26 8 The LASSO and its Application to the Stock Market The goal is to estimate x and thus to obtain the relation between the data in A and the phenomenon in y. Hence, the following Ordinary Least Squares (OLS) minimization problem is of interest: Figure 2-1 illustrates problem (2-2) with all defined vectors. min x Ax y 2 2 (2-2) Figure 2-1: Illustration of the minimization problem (2-2) and the definition of the vectors. Without loss of generality, it is assumed throughout this report that the output y is zero-mean and that the feature vectors f p are standardized to have zero-mean and unitvariance (Zou and Hastie, 2005): N y i = 0, i=1 N f i,p = 0 i=1 and N fi,p 2 = 1 i=1 for p = 1, 2,..., P By making f p and y zero-mean, no intercept term is needed in objective function (2-2). Moreover, standardizing the feature vectors assures that all features are approximately on the same scale and that useful solutions for x are obtained. This is particularly useful when regularization is applied in Section 2-3. In Chapter 1 it was already discussed that beforehand it is unknown which features contribute to the prediction of y. Hence, a practical solution would be to add all features to the A-matrix and that the useless features are automatically discarded from the regression. For problem (2-2) this means that certain coefficients in x are set zero, in order do exclude irrelevant features from the regression. The LASSO is such a method and it is the subject of Section The LASSO and its Properties This section discusses the LASSO that uses l 1 -regularization and how this encourages sparsity. Section discusses the origin of the LASSO and what l 1 -regularization distinguishes from other regularization methods. In Section it is discussed and illustrated how the l 1 -regularization encourages sparse solutions. Section discusses two desirable properties that the LASSO should satisfy and it introduces the Adaptive LASSO that satisfies both properties. Finally, Section discusses tuning of the LASSO parameter λ.

27 2-3 The LASSO and its Properties Background and Origin of the LASSO This section discusses the origin of the LASSO and what l 1 -regularization distinguishes from other regularization methods such as l 0 - and l 2 -regularization. An l p -norm of x, x p is defined as ( i x i p) 1 p. With this l p -norm an extended version of the OLS problem in (2-2) is introduced as in (2-3). min x Ax y λ x p (2-3) The tuning parameter λ controls the trade-off between the two terms and it thus controls the amount of regularization applied by the second term. The three most commonly used norms are the l 0 - l 1 - and the l 2 -norm. representation for the two dimensional case P = 2 is shown in Figure 2-2. A graphical Figure 2-2: From left to right: graphical representation of the l 2 -, l 1 - and l 0 -norm in the (x 1, x 2 ) plane. The effects of these three norms in problem (2-3) are briefly discussed below. The l 0 -norm Regularization with the l 0 -norm, x 0, is also known as subset selection. The l 0 -norm counts the nonzero entries of x and it leads to interpretable models, since some coefficients in x become exactly zero. However, a drawback of subset selection is that coefficients are retained in the model whenever their value becomes nonzero. A slight change in the dataset can therefore result in completely different coefficients that remain nonzero which limits the prediction accuracy (Tibshirani, 1996). The l 2 -norm Regularization with the l 2 -norm, x 2, is also known as ridge regression. In ridge regression, all variables are continuously shrunk and hence it is more stable than subset selection. A drawback of ridge regression, however, is that variables are not exactly set to 0, so that the resulting model is less interpretable (Tibshirani, 1996). The l 1 -norm To incorporate the desired behavior of both l 0 - and l 2 -regularization, (Tibshirani, 1996) proposed to apply an l 1 -norm, x 1, in (2-3). This method is known as the LASSO. The LASSO shrinks some variables towards 0 and sets others exactly to 0 and the LASSO thus obtains sparse solutions. This ensures that interpretable models are obtained as in l 0 -regularization, while the results are more stable as in l 2 -regularization. Next, in Section it is discussed in more detail how application of the l 1 -norm ensures that some coefficients are exactly shrunk to zero and hence why the LASSO is able to encourage sparsity.

28 10 The LASSO and its Application to the Stock Market Sparsity via l 1 -Regularization This section discusses how the l 1 -norm in the LASSO of (2-4) encourages sparsity. First, this is illustrated for the orthonormal case, A T A = I, and then it is illustrated for the general case without the assumption of orthonormality. Sparsity via l 1 -Regularization for the Orthonormal Case min x Ax y λ x 1 (2-4) For the orthonormal case, A T A = I, the solutions of (2-4) can be computed as in (2-5). ( ˆx i = sign(ˆx OLS i ) ˆx OLS i λ ) (2-5) 2 In (2-5) the sign( ) denotes the sign vector and ( ) + is max(, 0) and hence only positive values are considered. The proof is given in Appendix A-1. Equation (2-5) is called the soft thresholding function (Fan and R., 2001). The graphical representation of the soft thresholding function (2-5) is shown in Figure 2-3. The horizontal axis shows the coefficient values before shrinkage and the vertical axis the values after shrinkage. Without shrinkage, the diagonal line is obtained, while for the soft thresholding function it can be seen that -when λ is sufficiently large - the small coefficients will be exactly set to zero. This comes at the cost of a biased solution of the nonzero coefficients. + Figure 2-3: Graphical representation of the soft thresholding function (2-5). Small values of x are set exactly to zero, while for the nonzero coefficients a bias is introduced via λ. For comparison purposes, the soft thresholding function of l 1 -regularization is compared with the shrinkage effects of the l 0 - and l 2 -norm in Figure 2-4, for the orthonormal case. This figure is borrowed from (Tibshirani, 1996). It shows that indeed the l 0 -norm sets coefficients exactly to zero, but that this is a discrete process. The l 2 -norm is shown to continuously shrink all coefficients, but none exactly to zero.

29 2-3 The LASSO and its Properties 11 Figure 2-4: Graphical representation of the shrinkage effects of the (a) l 0 -norm, (b) the l 2 -norm and (c) l 1 -norm for the orthonormal case A T A = I. The 45 degrees dotted line serves as a reference and represents the values before shrinkage. Sparsity via l 1 -Regularization for the General Case (Tibshirani, 1996) showed that the various shrinkage effects of the l p -norms discussed in the previous sections also hold for the general, non-orthonormal case. This is illustrated in Figure 2-5 that is borrowed from (Tibshirani, 1996). The figure shows the contour lines of the objective function Ax y 2 2 and the black area is the constraint region of the l 1 - and l 2 -norm, respectively. The OLS function Ax y 2 2 can be rewritten into the function (x ˆxOLS ) T A T A(x ˆx OLS ), where ˆx OLS is the OLS estimate. The minimum of this function is obtained when x = ˆx OLS and hence the contour lines are centered at the OLS estimates. However, due to the regularization term, these solutions are infeasible and the solution to (2-4) is obtained where the contour lines hit the constraint region for the first time. Due to the shape of the constraint l 1 region, (Tibshirani, 1996) stated the contour lines are likely to hit the constraint region in the vertex, where either x 1 or x 2 is zero. This thus yields a sparse solution. The constraint region of the l 2 -norm, on the other hand, has no vertexes and solutions in zero will rarely occur. Figure 2-5: Graphical representation of the constraint l 1 -region (left) and l 2 -region (right), with the contour lines of the objective function Ax y 2 2 in the (x 1, x 2 ) plane. A feasible solution is obtained where the constraint region is entered. For the LASSO, this is likely to happen at a vertex where either x 1 or x 2 is zero and hence a sparse solution is obtained.

30 12 The LASSO and its Application to the Stock Market The Oracle Properties and the Adaptive LASSO This section discusses two important properties that the LASSO should satisfy and an extension of the LASSO, the Adaptive LASSO, is introduced. The Adaptive LASSO prevents the bias that occurs in the LASSO for nonzero regression coefficients. The Oracle Properties The LASSO of (2-4) continuously shrinks all coefficients and it sets small coefficients exactly to zero. It was already noted that the nonzero estimates of the LASSO are biased. This means that the LASSO is unable to accurately retrieve the values of the true coefficient vector x. More formally stated, the LASSO does not satisfy the desirable Oracle Properties. When an algorithm exhibits the Oracle Properties it means that it behaves as if it knew the true subset of nonzero coefficients in advance. In order to discuss the two Oracle Properties, the definition of the support is introduced. Definition 1. Support The support, S, consists of the nonzero coefficients in x. S { } i {1,.., P } : x i 0 Moreover, x S denotes the subvector of x belonging to S, i.e. consisting of only the nonzero entries of x. Furthermore, the size of the support is defined as S S. Without loss of generality it is assumed that the first q coefficients of x are nonzero and hence form x S and that the last q + 1,.., P coefficients of x are zero and hence form x Sc. Here, ( ) c denotes the complement of a set. Then the first q columns of A form A S and the last q + 1,.., P columns form A S c. If now Σ 11 = 1 N AT S A S and Σ 21 = 1 N AT S ca S the two Oracle Properties can be formulated as follows (Zou, 2006): 1. lim t Prob[Ŝt = S] = 1 2. ( t ˆx St x S) N (0, σɛ 2 Σ 1 11 ) Property 1 means that the support is consistently estimated. Property 2 means that the nonzero values are estimated consistently. (Zhao and Yu, 2006) showed that it can be checked whether Property 1 holds by testing whether the Irrepresentable Condition (IC) holds since it is an if and (almost) only if condition. The IC is shown in (2-6) where the inequality holds element-wise and it is assumed that Σ 11 is invertible. η is any constant larger than zero. Σ 21 Σ 1 11 sign(xs ) 1 η (2-6) The LASSO is able to satisfy Oracle Property 1, however, since it gives biased estimates of the nonzero coefficients Oracle Property 2 cannot be satisfied. In order to satisfy both Oracle Properties, the Adaptive LASSO is discussed next.

31 2-3 The LASSO and its Properties 13 The Adaptive LASSO A straight-forward method to get unbiased estimates of x is to apply a two-step method consisting of a LASSO and an OLS. The LASSO is applied for estimation of the support. The OLS is then run over the reduced set consisting of only x S. The OLS gives unbiased estimates and hence both Oracle Properties are satisfied. However, (Zhao and Yu, 2006) concluded that when the IC of (2-6) fails, the amount of shrinkage applied to the nonzero coefficients is too large. Therefore, there is no guarantee that the correct support is obtained and hence the two-step method fails. (Zou, 2006) therefore advised to reduce the amount of shrinkage on the nonzero coefficients in the LASSO directly. They proposed the Adaptive LASSO: a method that gives unbiased estimates, while retaining the convex computational advantage of the LASSO. The Adaptive LASSO is given in (2-7). min x P Ax y λ w i x i (2-7) i=1 The coefficients w i are stacked in the weight vector w. (Zou, 2006) proposes to approximate this vector as w = 1/ x OLS γ, with γ an extra tuning parameter. Their study showed that the estimates of the Adaptive LASSO satisfy both Oracle Properties. A graphical representation of the LASSO versus the Adaptive LASSO is shown in Figure 2-6 for the orthonormal case, A T A = I. It can be seen in Figure 2-6 that the Adaptive LASSO has no bias for the larger coefficients and it thus not only satisfies Property 1, but also Property 2. Figure 2-6: Graphical representation of the shrinkage of coefficients in x, in all situations λ = 4. Left: the LASSO with the biased nonzero coefficients. Middle: Adaptive LASSO with γ = 0.5. Right: Adaptive LASSO with γ = 2. It is seen that the bias for larger coefficients is eliminated. Figures borrowed from (Zou, 2006). It is noted that when in practice (2-6) is violated, the results of the LASSO are not necessarily useless. It means that no formal proof can be given that the support is always correctly estimated. However, in practice these results can still be useful and therefore evaluation of support estimation will be performed based on numerical simulations in Chapter 4.

32 14 The LASSO and its Application to the Stock Market Tuning Parameter λ It has already been noted that λ in the LASSO problem of (2-4) controls the amount of shrinkage applied on x. When λ = 0, the LASSO is reduced to the OLS and on average the correct solution of x is found, but the solution differs for various datasets. In other words, the OLS estimates are unbiased with high variance. When λ is increased, biased estimates are obtained, but the variance is decreased. In the extreme case, when λ, x will contain only zeros. The goal is to find the optimal parameter values. In this context, optimal can be defined in two ways. The first is that λ is optimal when the prediction error is minimized. The second is that λ is regarded optimal when interpretable results are obtained (e.g. to discard extra x ps to increase interpretability at the cost of an increased prediction error). The three most common approaches to select λ are discussed below. The first choice of λ was proposed by (Donoho et al., 1993). They proved that, theoretically, the smallest error E[ x ˆx 2 2 ] is obtained for λ = σ ɛ 2 log(n). This is consistent with the later findings of (Chen et al., 1998) who proposed to use this λ when the variance of the error is known. In practice, however, this value of λ often does not lead to minimal error. Therefore, the second method that is often employed to find λ is cross-validation. Among others, (Zou and Hastie, 2005) and (Angelosante et al., 2009) used cross-validation to find λ. The main idea of cross-validation is that the dataset is split in K parts and that the algorithm is trained over all parts except for the kth part. The training is then validated on this kth part. This procedure is iterated over k = 1,..., K for many different values of λ. It is noted that this may be a computationally expensive procedure and that it may yield unstable estimates, i.e. the optimal value found for λ changes suddenly when the dataset changes slightly (Hirose et al., 2013). Furthermore, when using the cross-validation, it is implicitly assumed that the goal is to minimize the prediction error, since the cross-validation aims at minimizing this prediction error. When the emphasis is more on retrieving an interpretable model, the cross-validation may not be the most appropriate method to apply. The third method that can be employed to choose λ was used by (Farahmand et al., 2011a) among others. They set λ = ρλ max, with ρ [0, 1] and where λ max is the smallest value of λ for which (2-4) gives x = 0. By changing ρ, the sparsity of the solution is controlled. Furthermore, λ max is found to be (2-8) where l is the infinity norm, which is the max of a vector. A proof is given in Appendix A-2. λ max = 2A T y (2-8) The third tuning method is preferred in this thesis. A set of various values of ρ is chosen to perform simulations. The value that yields the smallest prediction error is then selected. This is computationally less expensive than cross-validation. Moreover, in the situations considered in this report, an extensive cross-validation does not guarantee better tuning of λ, since performance of the LASSO is inherently limited, regardless of tuning of λ. This originates from the challenges that arise when the LASSO is applied to the stock market. This is discussed in Section 2-4.

33 2-4 Application of the LASSO to the Stock Market Application of the LASSO to the Stock Market The LASSO introduced in Section 2-3 will be employed to obtain interpretable results for stock market predictions. However, current applications of the LASSO differ from applications on time series in the stock market. The framework of current LASSO applications is discussed in Section Section discusses the framework of LASSO applications in the stock market and how this framework differs from current literature. The challenges that arise with this new framework are discussed in Section Classical LASSO applications In current literature, two scenarios can be distinguished in which a LASSO is commonly applied. These two scenarios are discussed in this section. Scenario 1: The LASSO for N > P > S The first scenario that is considered is that of N > P > S. Hence, more measurements (N) are available than the number of variables, or features (P ). The A-matrix is thus tall, as shown in Figure 2-7. This scenario is often of interest in current LASSO literature. A well-known example is that of the prostate cancer data of (Stamey et al., 1989). This dataset contains measurements of the prostate-specific antigen (PSA) on N = 97 men with prostate cancer and P = 8 features are collected, for example the volume of the cancer, the weight of the prostate, the age of the men etc. The goal was to find a linear relation between the PSA and a subset of the 8 features. It was found that the cancer volume and the prostate weight are sufficient to predict the level of PSA (Tibshirani, 1996). Figure 2-7: Illustration of the A-matrix of Scenario 1: N > P > S.

34 16 The LASSO and its Application to the Stock Market Scenario 2: The LASSO for P > N > S The second scenario to which the LASSO can be applied is P > N > S. Hence, in this situation more features than measurement are available. The data matrix A is thus fat, as illustrated in Figure 2-8, and an underdetermined problem needs to be solved. However, since S features are sufficient to predict y, the problem can be reduced to the overdetermined N S problem by applying the LASSO. One such example is given in the work of (Zou and Hastie, 2005). They considered the problem of microarray classification and gene selection for datasets with thousands of genes, P > 1000, and fewer than hundred samples, N < 100. Also for this (P > N) situation the LASSO is able to discard irrelevant features. Figure 2-8: Illustration of the A-matrix of Scenario 2: P > N > S. Time-Varying Behavior for Scenario 1 and 2 Scenario 1 and 2 are the classical scenarios that are focused on in current literature. Model (2-9) describes the time-varying behavior of x t for scenario 1 and 2. x S t+1 = C t x S t + ζ t y t = A t x t + ɛ t ζ t N (0, σ 2 sysi) ɛ t N (0, σ 2 εi) (2-9) The LASSO problem that relates to (2-9) is given in (2-10). min x t A t x t y t λ x t 1 (2-10) Here A t R N P, y t R N, C t R S S and xt S R S. Hence, the dynamic model (2-9) is only defined for the nonzero coefficients. In practice, it may occur that zero coefficients are added to the support and hence become nonzero. It is assumed that these additions to the support do not originate from the dynamics in (2-9). The LASSO of (2-10) shows that these scenarios assume that multiple measurements are available at each time step and hence the matrix A t is obtained with N rows (measurements). Hence, data is treated in a batch form since it is assumed that the system is constant over the N time steps. For applications to the stock market, however, it is argued that a sparse signal is estimated from data that is acquired sequentially over time. Treating data in a batch form is then inappropriate. This is discussed next.

35 2-4 Application of the LASSO to the Stock Market Stock Market Application of the LASSO The two scenarios of Section 2-4 are not suitable for stock market predictions and time series. Therefore, this thesis introduces a third scenario that relates to stock market predictions. So far, this scenario has received little attention in current literature. Scenario 3: The LASSO for P > N S So far, the two scenarios encountered in LASSO applications assume that sufficient measurements, N, are available to solve the LASSO problem. In the first scenario this is obvious, since N > P, and in the second scenario the problem is solvable by discarding features and hence transforming the problem from N < P to N > S. For applications to the stock market and time series, however, it is argued that N = 1 and the linear Autoregressive model with exogenous inputs (ARX) of (2-11) is obtained. The stock price in y t can thus only be formed by the features that are available at that time (previous stock prices, financial indicators or social media data). ] y t = [y t 1 y t 2 y t d f t,1 f t,2 f t,3... f t,p x t + ɛ t (2-11) } {{ } a t This amounts to the data vector a t, as illustrated in Figure 2-9, instead of a data matrix A t. Figure 2-9: Illustration of the data vector a t of Scenario 3: P > N S. For this scenario, the dynamic model of (2-12) is obtained where a t R P, the scalar y t, C t R S S and x S t R S. x S t+1 = C t x S t + ζ t ζ t N (0, σ 2 sysi) (2-12a) y t = a t x t + ɛ t ɛ t N (0, σ 2 ε) (2-12b) To represent the stock market, his thesis assumes model (2-12) with slowly time-varying nonzero coefficients and slowly time-varying support set. The -non classical- LASSO problem that relates to (2-12) is given in (2-13). min x t (a t x t y t ) 2 + λ x t 1 (2-13) Problem (2-13) is of interest in this thesis. The challenges that arise for solving problem (2-13) are discussed in Section

36 18 The LASSO and its Application to the Stock Market Challenges with the Stock Market Application of the LASSO This section discusses two challenges that arise when the LASSO is applied to the stock market, such as in Scenario 3 of Section The first challenge that arises is that of each feature only N = 1 measurement is available at every time step. Since it is assumed that more than 1 feature is of interest in stock market predictions, S > 1, the new scenario of P > N S is obtained. It is known that when P > N, the LASSO selects at most N variables (Zou and Hastie, 2005). This limits the performance of the LASSO since at least S variables should be selected. The second challenge is that it is assumed that the relevance of the various features varies over time. Social media data, for example, may be more useful during declines in the stock market when sentiments play an important role. In contrast to most applications in scenarios 1 and 2 this indicates that the coefficient vector, x t, is also time-varying. The LASSO in (2-13), however, does not take the dynamics of x t into account and its performance of tracking this time-varying behavior is thus limited. Therefore, Chapter 3 discusses extensions of the LASSO that are proposed for timevarying problems. The performance and limitations of the most promising methods presented in current literature are discussed. Moreover, a new sparse regression method is introduced for N = Summary Section 2-3 introduced the Least Absolute Shrinkage and Selection Operator (LASSO). The LASSO extends the Least Squares (LS) problem with an l 1 -norm over coefficient vector x. This ensures that the LS problem is solved while sparsity on x is induced. The l 1 -norm shrinks all coefficients and it sets some exactly to zero. This results in interpretable models, as in subset selection (l 0 ), while the prediction accuracy is close to that of ridge regression (l 2 ). Moreover, the LASSO should have two properties: (i) it should correctly identify the nonzero coefficients that form the support, S t, and (ii) the values of the coefficients in the support should be accurately estimated. The LASSO gives biased nonzero estimates and the second property is therefore not satisfied. The Adapative LASSO is proposed to compensate for this. Furthermore, Section 2-4 discussed that certain challenges arise when the LASSO is applied to stock market prediction. In current literature it is assumed that sufficient measurements, N, are available at each time step to solve the LASSO. However, it is argued that time series used for stock market prediction contain N = 1 measurement per time step and the LASSO is therefore not directly applicable to stock markets. Moreover, the coefficient vector x t is time-varying and the LASSO cannot accurately track this time-varying behavior. Therefore, Chapter 3 will discuss extensions of the LASSO that are proposed for timevarying problems. Moreover, a new sparse regression method will be introduced for N = 1.