How can we discover stocks that will



Similar documents
Hong Kong Stock Index Forecasting

Machine Learning in Stock Price Trend Forecasting

Employer Health Insurance Premium Prediction Elliott Lui

Stock Market Forecasting Using Machine Learning Algorithms

JetBlue Airways Stock Price Analysis and Prediction

Making Sense of the Mayhem: Machine Learning and March Madness

Beating the NCAA Football Point Spread

Predict Influencers in the Social Network

Forecasting and Hedging in the Foreign Exchange Markets

Classification of Bad Accounts in Credit Card Industry

Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks

Predict the Popularity of YouTube Videos Using Early View Data

NC STATE UNIVERSITY Exploratory Analysis of Massive Data for Distribution Fault Diagnosis in Smart Grids

Neural Networks for Sentiment Detection in Financial Text

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

How to Win the Stock Market Game

Up/Down Analysis of Stock Index by Using Bayesian Network

Predicting the Stock Market with News Articles

Prediction of Stock Performance Using Analytical Techniques

Using Excel for Statistical Analysis

Black Box Trend Following Lifting the Veil

Chapter 7: Data Mining

Introduction. A. Bellaachia Page: 1

An Overview of Knowledge Discovery Database and Data mining Techniques

DATA MINING TECHNIQUES AND APPLICATIONS

How I won the Chess Ratings: Elo vs the rest of the world Competition

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

Bootstrapping Big Data

How to Win at the Track

Machine learning for algo trading

Predictive time series analysis of stock prices using neural network classifier

Early defect identification of semiconductor processes using machine learning

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Active Learning SVM for Blogs recommendation

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Social Media Mining. Data Mining Essentials

Machine Learning in Statistical Arbitrage

Recognizing Informed Option Trading

Flexible Neural Trees Ensemble for Stock Index Modeling

Forecasting stock markets with Twitter

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Knowledge Discovery from patents using KMX Text Analytics

Data Mining Solutions for the Business Environment

Predictive modelling around the world

Rule based Classification of BSE Stock Data with Data Mining

Car Insurance. Prvák, Tomi, Havri

Asymmetric Reactions of Stock Market to Good and Bad News

A Logistic Regression Approach to Ad Click Prediction

Introduction to Data Mining

A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study

Index Options Beginners Tutorial

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Neural Networks in Data Mining

Customer Classification And Prediction Based On Data Mining Technique

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applying Machine Learning to Stock Market Trading Bryce Taylor

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Machine Learning Techniques for Stock Prediction. Vatsal H. Shah

Introduction, Forwards and Futures

Potential research topics for joint research: Forecasting oil prices with forecast combination methods. Dean Fantazzini.

Students will become familiar with the Brandeis Datastream installation as the primary source of pricing, financial and economic data.

Relational Learning for Football-Related Predictions

TURKISH ORACLE USER GROUP

The Artificial Prediction Market

Working Papers. Cointegration Based Trading Strategy For Soft Commodities Market. Piotr Arendarski Łukasz Postek. No. 2/2012 (68)

Cleaned Data. Recommendations

not possible or was possible at a high cost for collecting the data.

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Chapter 12 Discovering New Knowledge Data Mining

The Data Mining Process

Equity forecast: Predicting long term stock price movement using machine learning

Machine Learning in FX Carry Basket Prediction

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

A Proposed Prediction Model for Forecasting the Financial Market Value According to Diversity in Factor

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Black-Scholes Equation for Option Pricing

Using News Articles to Predict Stock Price Movements

Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report

ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis

Concepts in Investments Risks and Returns (Relevant to PBE Paper II Management Accounting and Finance)

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Spam detection with data mining method:

MS1b Statistical Data Mining

Data Mining Part 5. Prediction

INTRODUCTION TO COMMODITY TRADING

Beating the MLB Moneyline

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Better decision making under uncertain conditions using Monte Carlo Simulation

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Data Mining Applications in Higher Education

Neural Networks and Support Vector Machines

Cell Phone based Activity Detection using Markov Logic Network

Answers to Concepts in Review

The (implicit) cost of equity trading at the Oslo Stock Exchange. What does the data tell us?

Transcription:

Algorithmic Trading Strategy Based On Massive Data Mining Haoming Li, Zhijun Yang and Tianlun Li Stanford University Abstract We believe that there is useful information hiding behind the noisy and massive data that can provide us insight into the financial markets. Our goal in this project is to find a strategy to select profitable U.S stocks everyday by mining the public data. To achieve this we build models that predict the daily return of a stock from a set of features. These features are constructed based on quoted and external data that is available before the prediction date. When considering machine learning models we consider both regression and classification approaches and several supervised learning algorithms are implemented. In order to catch the dynamical nature of the financial market, we carefully design out-sample testing and cross validation procedures to ensure that our historical test results are reasonable and is achievable in the real market. Finally, we construct stock portfolios based on our forecast models and illustrate the performance of these portfolios to show that our strategy works indeed. I. Introduction How can we discover stocks that will rise in the future? The general answer is to gather as much relevant and nontrivial information as possible. One possible way to get such information is mining the huge amount of financial and Internet data that cannot be easily understood. This data allows us to define various features for each individual stock. For example, we can distinguish different stocks by their historical performance, trading volume or sensitivity to external economical and financial variables. Then we can use machine learning models to discover the underlying relation between these features and actual performance of stocks. Finally, we can select those stocks that are predicted to have the highest returns. The report is organized as follows. In part 2 we mainly discuss what data we are using and how we collected and processed the data. In part 3 we introduce our methodology of constructing features. Part 4 gives the machine learning models that we are implementing and the procedures of dynamically training and testing. Part 5 gives the results and the performance of our daily selected portfolio as well as some discussions and analysis on the results we get. In Part 6 we draw the summary. II. Data description We collected daily trading data of 2666 U.S stocks trading (or once traded) at NYSE or NASDAQ from 2000-01-01 to 2014-11-10. This dataset includes each day s open price, close price, highest price, lowest price and trading volume of every stock. Data is collected from a free online database named Quandl. Meanwhile, we also collected data that is not directly related to each stock but may contain additional information for forecasting purposes. These include the daily quotes of 5 commodity future contracts (gold, crude oil, nature gas, corn, cotton), 2 foreign currencies (EUR, JPY) and 1 interest rate (10-year treasury rate), all from 2000-01-01 to 2014-11-10. The aggregate size of all data files is 1.11 GB. 1

III. Targets and Features Construction As our goal is to predict the daily return of each stock, then we naturally define our target as stock i s daily return on day t for all i and t: Target i,t = ClosePrice i,t 1 (1) Note that we can also focus only on the direction in spite of the amplitude. Another way of defining or targets is: ( ) Target i,t = sign ClosePricei,t 1 (2) For the first definition we have a regression problem and for the second we have a classification problem. Both of these two setups will be tried. And then we have to construct features that help distinguish (or say, define) each stock every day. These features should be relevant to the performance and should be available before the trading day. It is well known that stock performance is correlated with dozens of things and our model will only employ a relative small amount of features in this paper for simplicity. The features we constructed can be divided into two categories. The first category, which is named as direct features, contains some variables that are constructed by explicit (and lagged) market data of stocks, e.g., open, close, high, low, etc. The other category is named as indirect features and concerns about the information carried by external factors. That is, we construct one feature for each external variable to reflect how a specific stock can be affected at a certain day when the external variable changes. Now we will have a more detailed discussion on how we construct the features by category: Direct Features: Based on our raw data, we constructed 4 direct features: RET i,t = ClosePrice i,t 1 = Target i,t 1 (3) HL i,t = HighPrice i,t LowPrice i,t (4) VOL i,t = ln (TradingVolume i,t 1 ) (5) ( ) TradingVolumei,t 1 VOLCHNG i,t = ln TradingVolume i,t 2 (6) These four features are properly lagged so that can be computed. And these are all relevant since they measure some trends or relative strength of each stock. Indirect Features: The intuition of constructing indirect features corresponding to some external economic indices is to compute the sensitivity of each stock s return to these indices and multiply this sensitivity by the latest indices values. As mentioned above we have data of 8 external indices (5 commodity futures, 2 foreign currencies and 1 interest rate). For index j, we define the corresponding feature for stock i at day t as: EXTRN_j i,t = β_j i,t index jt 1 (7) Where: (c, β_1 i,t,, β_8 i,t ) T = argmin β X i,t β y i,t 2 (8) Where: X i,t = 1 index_1 t 2 index_8 t 2 : :... : 1 index_1 t T 1 index_8 t T 1 (9) y i,t = Target i,t 1 : Target i,t T (10) Here T is an arbitrary window period parameter to compute sensitivity. Intuitively, these 8 indirect features describe the relative change of stock i th possible price change at time t with respect to the change of index j at time t-1. Since everything defined here is lagged, the values of these 8 indirect features are available. Thus we construct 12 features for stocks. Cross-sectional averages of these features are shown as follows: 2

Figure 1: The upper figure indicates the cross-sectional averages of direct features, including RET, HL, VOL, VOLCHNG. The lower figure indicates the cross-sectional averages of indirect features,including gold, crude oil, nature gas, corn, cotton, USDvsEUR, USDvsJPY and Treasury IV. Learning Algorithms Implementation of training and testing models are as follows: 1. Specify a training window parameter W Now that we have specified our targets and features, implementing specific machine learning algorithm is important. As we have specified two ways of defining targets (numerical or categorical), we have two representations of predicting the performance of stocks: classification and regression. In the classification set-up we try to predict the trend of the stock in a specific day. Besides, we predict the exact return of a stock in regression set-up. For simplicity we first try linear models: logistic regression as the classification model and linear regression as the regression model. Then we implement SVM models (classifier and regression) to explore possible non-linear regularities utilizing kernels. Before feeding into models we also normalize and centralize our features to mean 0 and standard deviation 1. However our model is dynamic rather than fixed, depending on the date at which a return is to be predicted. Specifically, our procedure 2. To predict the performance of stocks on the date T i, use the sample during T i 1, T i 2, T i 3,, T i W as the training set to train models. After we generated predictions for every day we should measure how good our predictions are. We can compute every day s correction rate in the classification models and the mean square root error in the regression models but then it would then be abstruse to compare cross these two categories. We therefore give a more practical and visualizable way of measuring performance: testing the performance of stocks selected by our models. The methodology is as follows: 1. Specify a portfolio size (number of stocks to be picked) N 2. Every day choose N stocks according to the predictions generate by the models. 3

For regression model we simply choose the N stocks that are predicted to have the highest return, and for classification models we choose N stocks that are best classified, i.e., with the largest scores in the classification 3. Compute the actual return of every day s stock basket. Compare this time series to the market index such as S&P 500. Furthermore we can denote one specific day as successful if the portfolio we selected has greater return than the market index and as unsuccessful if that didn t happen. Then we can compute the successful rate for each model: SR = N o f succes f ul days N o f total days (11) Then we compare different models and present the model with the best successful rate. Results see Figure 2 and Figure 3 V. Discussions and Analysis The linear models behave well. Actually from the portfolio return graphs we can see that linear classification model and linear regression model give similar results here. The regression approach and classification approach both can capture the underlying regularities. The supporting vector machines don t behave well enough. One possible reason is that for these extremely noisy data linear simple models can behave better. SVM classifier s performance is especially bad. One possible reason is that the confidence of the classification, or say decision function that we are sorting as an indicator of potential success doesn t make much sense in the non-linear case. Using more data to train each day s model may improve the performance of SVM, but the computational cost will increase significantly since we retrain our model for each trading day. Another interesting question to think is whether our models behave stably over time. From the graph we find that the 2 linear models behave relatively stably before the end of 2008. To quantify this we compute SR every 80 days and plot these time series showed in Figure 4 VI. Conclusion To conclude: we derived an approach to predict daily returns of U.S stocks based on their trading data and external financial indices. Our linear models work well in both regression framework and classification framework. The best model turns out to be linear classifier: logistic regression. It gives 56.65% successful rate and 2000% cumulative return over 14 years. However as time pass by the models tend to behave less stably especially after 2008. In the future we can try to find methods that give more stable predictions. From the perspective of machine learning we can try mixing different models our train models with more data every day. From the perspective of investment we can make predictions on alphas rather than returns. Also we can try to get more information from text data. News and social networks can be excellent information resources for predicting stocks returns. References [Abu-Mostafa and Atiya, 1996] Abu-Mostafa, Y. S., & Atiya, A. F. (1996). Introduction to financial forecasting. Applied Intelligence., 6(3):205-213. [Smola and Schölkopf, 2004 ] Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression Statistics and computing., 14(3): 199-222. 4

Figure 2: left figure is the result of Logistic Regression modelin the implementation we specify our window parameter W=5 and portfolio size N = 100. We first implement linear models. The logistic regression gives SR = 56.65%. The cumulative return of stocks selected by this model. The right figure shows linear regression model giving SR = 55.82%. The cumulative return of the portfolio selected.as we can see that for linear models, both classification setup and regression setup behave well. $1 invested in 2000 became about $20 in 2014 if we have continually implemented the trades suggested by these 2 models. This success indicates that our approach did get information from the data we collected. Figure 3: The left is SVM. We use Gaussian Kernel and since the financial data is highly noised we set the parameter C as 0.85. Surprisingly SVM give worse results than linear models. For SVM classifier we have: SR = 49.56%. The return of selected portfolio. The rightis SVM regression we get SR = 53.18%.After we adjusted the window parameter W and the regularization parameter C this doesn t improve much. Figure 4: It can be shown that all models tend to behave less well as time goes by especially after 2008. We believe that from that time daily stock prices change depend on more factors than we have recovered. 5