Knowledge Discovery in Stock Market Data



Similar documents
Is log ratio a good value for measuring return in stock investments

Financial Market Efficiency and Its Implications

Using News Articles to Predict Stock Price Movements

CHAPTER 11: THE EFFICIENT MARKET HYPOTHESIS

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Visualization of Breast Cancer Data by SOM Component Planes

Machine Learning in Stock Price Trend Forecasting

Prediction of Stock Performance Using Analytical Techniques

Data Mining Solutions for the Business Environment

Machine Learning in FX Carry Basket Prediction

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

Stock Trading by Modelling Price Trend with Dynamic Bayesian Networks

Java Modules for Time Series Analysis

The Optimality of Naive Bayes

Marketing Mix Modelling and Big Data P. M Cain

Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit

Data topology visualization for the Self-Organizing Map

Management Science Letters

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis

E-commerce Transaction Anomaly Classification

How to Win the Stock Market Game

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

A Study of Web Log Analysis Using Clustering Techniques

Optimization of technical trading strategies and the profitability in security markets

Review for Exam 2. Instructions: Please read carefully

Data Mining + Business Intelligence. Integration, Design and Implementation

Financial Time Series Analysis (FTSA) Lecture 1: Introduction

Introduction to General and Generalized Linear Models

Rule based Classification of BSE Stock Data with Data Mining

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Long-term Stock Market Forecasting using Gaussian Processes

Testing for Granger causality between stock prices and economic growth

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Predict the Popularity of YouTube Videos Using Early View Data

Linear Threshold Units

ALGORITHMIC TRADING USING MACHINE LEARNING TECH-

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Using Data Mining for Mobile Communication Clustering and Characterization

Statistical Machine Learning

STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

Machine Learning in Statistical Arbitrage

How can we discover stocks that will

Betting with the Kelly Criterion

Comparing Artificial Intelligence Systems for Stock Portfolio Selection

Customer Classification And Prediction Based On Data Mining Technique

Financial Trading System using Combination of Textual and Numerical Data

Forecasting the U.S. Stock Market via Levenberg-Marquardt and Haken Artificial Neural Networks Using ICA&PCA Pre-Processing Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques

ECON4510 Finance Theory Lecture 7

Neural networks and their rules for classification in marine geology

Review for Exam 2. Instructions: Please read carefully

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Studying Achievement

SUMMARY. a) Theoretical prerequisites of Capital Market Theory b) Irrational behavior of investors. d) Some empirical evidence in recent years

Neural Networks for Sentiment Detection in Financial Text

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Due to the development of financial. Performance of Stock Market Prediction. Lai, Ping-fu (Brian) Wong Chung Hang

CFA Examination PORTFOLIO MANAGEMENT Page 1 of 6

Model-Based Cluster Analysis for Web Users Sessions

New Ensemble Combination Scheme

Analysis of Bayesian Dynamic Linear Models

No-Arbitrage Condition of Option Implied Volatility and Bandwidth Selection

Review on Financial Forecasting using Neural Network and Data Mining Technique

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Analysis of Performance Metrics from a Database Management System Using Kohonen s Self Organizing Maps

Predict Influencers in the Social Network

ViSOM A Novel Method for Multivariate Data Projection and Structure Visualization

Quantitative Methods for Finance

Classification of Engineering Consultancy Firms Using Self-Organizing Maps: A Scientific Approach

Market Efficiency and Behavioral Finance. Chapter 12

Using JMP for Technical Analysis of stocks in highly volatile markets Bill Gjertsen, SAS Institute Inc., Cary, NC

Reducing Active Return Variance by Increasing Betting Frequency

Active Versus Passive Low-Volatility Investing

Target Strategy: a practical application to ETFs and ETCs

Linear Classification. Volker Tresp Summer 2015

ETF Total Cost Analysis in Action

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Saving and Investing 101 Preparing for the Stock Market Game. Blue Chips vs. Penny Stocks

A Guide to the Insider Buying Investment Strategy

IS MORE INFORMATION BETTER? THE EFFECT OF TRADERS IRRATIONAL BEHAVIOR ON AN ARTIFICIAL STOCK MARKET

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Bayes and Naïve Bayes. cs534-machine Learning

REGRESSION MODEL OF SALES VOLUME FROM WHOLESALE WAREHOUSE

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

A New Interpretation of Information Rate

Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks

Analyzing Customer Churn in the Software as a Service (SaaS) Industry

Naïve Bayes Classifier And Profitability of Options Gamma Trading

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Transcription:

Knowledge Discovery in Stock Market Data Alfred Ultsch and Hermann Locarek-Junge Abstract This work presents the results of a Data Mining and Knowledge Discovery approach on data from the stock markets using Databionic techniques. Stock market data is analyzed using methods that were learned from nature and previously applied primarily to DNA microarray data. It is demonstrated that the discovery of new insights into the stock markets is possible by the application of sensible preprocessing of daily returns (Relative Differences), application of a projection which has the potential to show emergent structures in the data (U-Matrix) and allows for a nontrivial clustering of the data (U*C). 1 Introduction An issue that is the subject of intense debate among academics and financial professionals is the Efficient Market Hypothesis (EMH). It states that security prices fully reflect all available information at any time. The implications of the EMH are truly profound. Most individuals that buy and sell stocks in practice however, do so under the assumption that the securities they are buying are worth more than the price that they are paying, while securities that they are selling are worth less than the selling price. Empirical evidence has been mixed, but has generally not supported strong forms of the efficient markets hypothesis, e.g. low P/E stocks have greater returns. Earlier papers also refuted the assertion that higher returns could be attributed to higher beta, which has been accepted by efficient market theorists as explaining the anomaly in neat accordance with modern portfolio theory. One can also identify losers as stocks that have had poor returns over some number of past years. Winners would be those stocks that had high returns over a similar period. Some trading rules say that in trends one should buy winners and sell losers. While proponents of the EMH don t believe that it is possible to beat the market, some A. Ultsch (B) Databionics Research Group, University of Marburg, Germany e-mail: ultsch@informatik.uni-marburg.de H. Locarek-Junge and C. Weihs (eds.), Classification as a Tool for Research, Studies in Classification, Data Analysis, and Knowledge Organization, DOI 10.1007/978-3-642-10745-0_68, c Springer-Verlag Berlin Heidelberg 2010 621

622 A. Ultsch and H. Locarek-Junge believe that stocks can be divided into categories based on risk factors. However, these risk factors are considered to be stable over time. In this paper, we analyze a very large stock market to find out whether there exist groups of stocks and clusters of time, where the groups that we find behave similar in the way that the probability of rising or falling stock prices within the created groups can be forecasted and is different from randomness, which would challenge the EMH. 2 Daily Returns on Stocks Primary data in this paperare the adjusteddaily closing prices of stocks traded in the USA. The prices of 7031 stocks were collected from Yahoo Finance (finance.yahoo. com) for the period Jan. 1 st 2000 to 1 st of march 2008 (observation period). This resulted in 2047 trading days. A total of 14,390,410 stock prices were obtained in this way. Standard & Poor s 500 Index S&P 500 gives an overall picture of the market situation during the observation period (see Fig. 1). The S&P 500 is one of the most commonly used benchmarks for the overall U.S. stock market. It can bee seen that the observation period rising as well as falling market conditions. For each day (t) and each price p(t) the daily return was calculated as Relative Difference.RelDiff(t)/: RelDiff(t) D 2.p.t/ p.t 1//=.p.t/ C p.t 1/ 1600 S&P 500 index S&P 500 index 1500 1400 1300 1200 1100 1000 900 800 700 200 400 600 800 1000 1200 1400 1600 1800 2000 Period: 1. Jan 2000 to 1.March 2008 Fig. 1 S&P 500 during observation period

Knowledge Discovery in Stock Market Data 623 Relative Difference has several advantages over other formulas for return, like LogRatio.log.p.t/=p.t 1// or Ratio.p.t/ p.t 1//=p.t 1/. SeeUltsch (2009) for a detailed discussion. The most important for this investigation is that RelDiff possesses a symmetric and finite range: if a company defaults (p.t/ D 0) then RelDiff D 200%. If a company has exorbitant gains.p.t 1/ < p.t// then RelDiff approaches C200%. This allows to model returns with finite variances. In Ultsch (2009) it was shown that returns measured in RelDiff can be modeled with a mixture of distributions using one Normal (Gaussian) and two LogNormal distributions. The definition of logarithms was generalized to negative numbers as log 0.x/ D sign.x/ log.abs.x//. An initial LogNormal, Gaussian, LogNormal (LGL) model was fitted to the data using the Expectation Maximization algorithm (e.g. Izenman 2008). Figure 2 shows the empirical probability distribution measured with a kernel density estimator Pareto Density Estimation (PDE) (Ultsch 2003).The LGL model is depicted in Fig. 2 using dashed lines for each component and a solid line for the mixture. The quality of the model was assessed with a quantile/quantile plot resulting in an extremely good fit (see Ultsch 2009,Fig.5). This model can be naturally interpreted as a random result for returns, i.e. the central Gaussian N.0; 1:7/ with a fraction of 75% of all returns. Furthermore there are two non random distributions for returns, losses (12.5%) and wins (12.5%), which are lognormal distributed. A Log-Gauss-Log model of Returns 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 8 6 4 2 0 2 4 6 8 Fig. 2 The Log-Gauss-Log model of all returns Returns in %

624 A. Ultsch and H. Locarek-Junge 3 Knowledge Discovery in Market Activities Using the model described in the last chapter it can be decided, whether a return belongs to the Random, Losses or Wins class using Bayes decision. We define UnitWin D p.return > Random/ p.return < Random/, where the probabilities are calculated with Bayes theorem on the model developed above. UnitWin gives 1 for Losses, 0 for Random and C1 for Wins. In Fig. 3 UnitWin is shown for all returns. The advantage of UnitWin is that differences in returns within same group are zero. UnitWin is therefore a good measure to compare the performance of different stocks for all trading days. The market activity on each day can be measured as the average number of non random returns for that day. This gives Activity(t) D mean.abs.unitwin.t; i/// i The distribution of Activity is shown in Fig. 4 using PDE. It can be seen that Activity can be modeled as a mixture of Gaussians GMM (see Fig. 4). Using this GMM active days and inactive days can be distinguished. We found that the market was active for 2,045 days during our observation period. The next question is, whether there are days with more than average performance of the stock market. We defined the DailyPerformance.i/ of a as the sum of all UnitWins for stock i. We found that the DailyPerformance consisted of three different distributions: a Gaussian around zero, i.e. passive performance or sideways 1 UnitWin p(return>random) - p(return>random) 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 10 8 6 4 2 0 2 4 6 8 10 Fig. 3 UnitWin as a function of stock s returns Return as RelDiff [%]

Knowledge Discovery in Stock Market Data 625 Classes of trading days 2 x 10 3 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 200 400 600 800 1000 Fig. 4 Distribution of market activity for all days Activity movement of stocks and a winner and loser distribution, both lognormal distributed. Furthermore we found that only 568 D 8% of all stocks dominated the performance of the stock market (marked leaders). 4 Types of Marked States With UnitWins returns can be compared for sets of stocks and groups of days. The similarity respectively dissimilarity of marked days was defined as the Euclidean distance of the UnitWins of a set of stocks. Using this distance definition the different types of market days (winning, losing and passive) were compared for the marked leaders. For each of these groups a clustering procedure was performed using Emergent Self Organizing Feature Maps (ESOM) with the U-Matrix display (Deboeck and Ultsch 2000) and the clustering algorithm U*C (Ultsch 2007). Figure 5 shows an example of a U-matrix. This 3D landscape is interpreted as follows: data in valleys are close in the high dimensional input space. Data separated by mountains are in different clusters. The U C clustering resulted in three clusters for Winner days (w 1 ;:::w 3 ), four classes for Loser days(l 1 ;:::l 4 ) and only one class for the Passive days. As a next step the transition frequencies for each class were counted. The results is shown in Fig. 6. It is remarkable, that some states are rather persistent. For class l3, one of the loser classes, the probability that the next day is also a loser class is 74%.

626 A. Ultsch and H. Locarek-Junge 10 20 30 40 50 60 70 80 0 5 col U - Matrix 10 15 20 25 line U - Matrix 30 35 40 45 50 Fig. 5 U-Matrix of the winner days for the market leaders 40% 46% 60% 88% 6% w1 w2 w3 passive 34% 29% 50% 29% 26% 38% 34% 39% 6% I1 I2 I3 I4 Fig. 6 Transition frequencies for the market classes 61% 42% 74% 57% For class w 3 it was also observed that with a probability of 60% the next day is also a winning day. Other states are instable. E.g. in loser state l 2 with 58% probability, the next day is either passive or winning. 5 Discussion This paper is an example of knowledge discovery in stock marked data. Knowledge Discovery is defined as the discovery of understandable knowledge which is new and useful. We have found that there are three types of returns: random, losses

Knowledge Discovery in Stock Market Data 627 and wins. Using Bayesian decision a meaningful aggregation of collective behavior could be defined (UnitWin). Market activity was found to be either active or inactive. Daily performance could be classified in passive, winning and losing. UnitWin can be used for the definition of a sensible distance function. It has the advantage that inner group differences, e.g. within passive stocks or days, are zero. Inter group differences contribute a precise and finite value to the distance function. Using this distance function, a clustering of the winner and loser market days was possible. The usefulness of these clusters can be seen in the transition frequencies to other. Some of the states suggest the buying (e.g. l 2 ) others the selling of stocks (e.g. w 1 ). It was not intended that this works may be used for the generation of buy-or-sell signals. It may, however, be useful to calculate measures for the overall state of a market day. 6 Conclusion The EMH is the backbone of classical capital market theory. It has been tested empirically quite often, using econometrical testing and event studies. Several anomalies have been found, but they could mostly explained by applying risk measures and models for investor utility. In this paper, knowledge discovery in stock marked data is applied. In the paper we found that there are three types of returns: random, losses and wins. A meaningful aggregation of collective behavior was defined and market activity was found to be either active or inactive while performance could be classified in passive, winning and losing. A clustering of the winner and loser market days was possible, where some of the states suggest buying, others the selling of stocks. It was not intended that this work may be used for the generation of buy-signals or sell-signals. It may, however, be useful to calculate measures for the overall state of a market day. The authors will try to test the properties out-of-sample and in various other markets to find out whether the method works only in the sample period or it is a general property of the stock market, which remains to be proven with an independent test set. This challenge for the EMH remains future work. References Deboeck, G. J., & Ultsch, A. (2000). Picking stocks with emergent self-organizing value maps (Vol. 10, pp. 203 216). Prague: Neural Networks World, Institute of Computer Science. Izenman, A. J. (2008). Modern multivariate statistical techniques: regression, classification, and manifold learning. Berlin: Springer. Ultsch, A. (2003). Pareto density estimation: A density estimation for knowledge discovery. D. Baier and K. D. Wernecke (Eds.), Innovations in classification, data science, and information

628 A. Ultsch and H. Locarek-Junge systems Proceedings 27th annual conference of the german classification society (GfKl) (pp. 91 100). Berlin, Heidelberg: Springer. Ultsch, A. (2007). Analysis and practical results of U*C clustering. Proceedings 30th annual conference of the german classification society (GfKl 2006). Berlin, Germany. Ultsch, A. (2009). Is log ratio a good value for measuring return in stock investments? In: A. Fink, B. Lausen, W. Seidel & A. Ultsch (Eds.): Advances in Data Analysis, Data Handling and Business Intelligence, (pp. 505 511). Springer.