An Investigation Into Stock Market Predictions Using Neural Networks Applied To Fundamental Financial Data

Transcription

1 An Investigation Into Stock Market Predictions Using Neural Networks Applied To Fundamental Financial Data Submitted by Luke Biermann for the degree of BSc in Computer Science of the University of Bath 2006

2 AN INVESTIGATION INTO STOCK MARKET PREDICTIONS US- ING NEURAL NETWORKS APPLIED TO FUNDAMENTAL FINAN- CIAL DATA Submitted by Luke Biermann COPYRIGHT Attention is drawn to the fact that copyright of this dissertation rests with its author. The Intellectual Property Rights of the products produced as part of the project belong to the University of Bath (see This copy of the dissertation has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with its author and that no quotation from the dissertation and no information derived from it may be published without the prior written consent of the author. Declaration This dissertation is submitted to the University of Bath in accordance with the requirements of the degree of Bachelor of Science in the Department of Computer Science. No portion of this work in this dissertation has been submitted in support of an application for any other degree or qualification of this or any other university of institution of learning. Except where specifically acknowledged, it is the work of the author. This thesis may be made available for consultation within the University Library and may be photocopied or lent to other libraries for the purpose of consultation. Signed:... Luke Biermann/ An Investigation Into Stock Market Predictions Using Neural Networks Applied To Fundamental Financial Data

3 Abstract Artificial Neural Networks (ANNs) have the ability to find non-linear correlations between time-series. This makes them ideal for the stock price prediction domain. Previous research has investigated this, but very rarely by applying ANNs to fundamental data (that is data concerning the business or macro-economic environment). This paper applies ANNs to fundamental data to make stock price predictions, to consider whether it is a worthy application. The findings suggest that applying ANNs to fundamental data can be just as effective as applying them to technical data (that is data concerning the stock), however the complexity dramatically increases. During the course of the investigation the problem of ANN optimisation and bespoke settings are examined and an experimental studies framework is developed. Pruning is used extensively in order to improve the generality of ANN learning and to determine the most suitable fundamental data inputs. As result, the pitfalls and benefits of using pruning is also discussed.

4 Acknowledgements This dissertation could not have been completed without the help and support of a number people. Warm thanks must go to: Alwyn Barry, my project supervisor, for his guidance and advice, Libby Christensen for her meticulous proof reading and continued support, and my family and friends for their support and encouragement. I

5 Contents 1 Introduction Hypothesis Time Management Literature Review Artificial Neural Networks Background to Neural Networks How Artificial Neural Networks Work The Transformation Function (Nonlinearity Function) Learning Pre-Processing Topology Investment and Financial Prediction The Efficient Market Hypothesis (EMH) Technical Analysis Fundamental Analysis Investment and Financial Prediction using Artificial Neural Networks Pre-processing for Financial Data Input Data Topology for Financial Applications Testing Requirements Requirements Analysis Requirements Specification Overview Implementation Input Data Types Pre-processing II

6 3.2.5 Topology Pre-Experimental Studies Suitability Assessment Non-Functional Requirements Conclusions Design Topology Optimisation Strategy Pruning Algorithms Accuracy Measures Input Data Considerations Constructing the input time-series Pre-processing Considerations Finding the most appropriate input data Suitability Experiment ANN Implementation Conclusions Artificial Neural Network Software Available Neural Network Packages and Systems Fast Artificial Neural Network (FANN) Aine Annie Auratus Others Stuttgart Neural Network System Installing JavaNNS Creating an ANN The use of JavaNNS in the investigation Conclusions Development Development Process Pre-Experimental Studies Framework Pre-Experimental Studies - Finding Consistency Random presentation of input-target pairs Activation Functions: tanh vs logistic Random weights: Small or large III

7 6.4 Accuracy Studies Accuracy with a Sine Wave Accuracy with the Bumps function Accuracy and Generalisation Studies Hidden Layers Study Accuracy and Generalisation Studies: Part 2 - Online vs Batch Error Propagation Pruning Algorithms and Parameters Selection of Input Data - Phase One Conclusions from Experimental Studies Suitability Experiment and Results Suitability Experiment Configuration Adjustment to Pruning Parameters Results of the Suitability Experiment General Conclusions Conclusions Pitfalls Domain ANN Hypothesis JavaNNS Contribution to the Field Further Work Personal Reflections A Gantt Chart 104 B Bumps Function 106 C Pre-experimental Study Results 107 C.1 Hidden Layer Study Results C.2 Online vs Batch Error Propagation Results C.3 Pruning Algorithms and Parameters Results D Input Selection Charts 114 E Revised Pruning Parameters Pre-experimental Study 121 IV

8 F Suitability Experiment Pruning Analysis 123 V

9 Chapter 1 Introduction Accurate prediction of stock market variation is largely seen as the optimum get rich quick scheme. As a result, the field has been extensively researched and analysed. Since the start of the 20th century people have been dedicating their careers to the search for the ideal trading/investing strategy and as a result the field has rapidly expanded. Strategies have included studying astrology, reading tree stumps, trading psychology using technical data 1 and company analysis using fundamental data 2 [Dreman, 1998]. Since the 1990s a significant transition has taken place. As technology has developed in the machine learning field, the initial linear 3 strategies and theories are making way for non-linear advanced pattern analysis, which were previously not possible. The results of this have been at least promising and in some cases astonishing, possibly redefining the way financial markets will trade in the future. One of the most highly developed areas is the use of Artificial Neural Networks (ANNs). Companies including Goldman Sachs and Fidelity Investment are currently using this technology for commercial benefit [Shachmurove, 2002]. However, the vast majority of research and execution has trained ANNs with technical data. Research Analysts base their stock forecasts on company analysis, which uses fundamental data. For this reason, as well as the fact that little research currently exists, fundamental data is applied to neural networks for this investigation. The ANN algorithms and techniques best suited to this form of time-series data will be developed. Analysis will potentially include the structure and format of the inputs, optimisation of the configuration and component integration, for example integrating fuzzy logic to train the artificial neural network. The aim of this project is to provide a detailed study into the tailoring of ANNs and fundamental data input. This is to achieve the most accurate stock predictions possible, comparing them to the standard linear strategies employed and technical data 1 Technical Data: Data detailing the stock s opening price, closing price, intra-day high, intraday low, and volume. Usually used in conjunction with indicators including Exponential Moving Averages, Fast and Slow Stochastics to mention only a limited few. These details are purely trading ones and tell nothing about the company as a business. This information is best used for short term forecasting. 2 Fundamental Data: Data detailing a company s earnings, dividend yield, debt, p/e ratio etc. This informs the reader about the business and its profitability and potential pitfalls. Scandals like Enron s could not have been predicted from Technical Data but could have been with fundamental data and a keen eye. 3 Linear: a linear function hold two properties: additivity (f(x + y) = f(x) + f(y)) and homogeneity (f(ax) = af(x)). Linearity means that the system is the sum of parts. Being linear means the system s solutions are easier to compute. In nonlinear systems these assumptions cannot be made, the system is more than the sum of its parts, which causes nonlinear systems to be extremely hard to model (quoted from wikipedia). 1

10 ANNs if time permits. A number of problems have been identified which will need to be confronted if this project is to be successful. Firstly, as fundamental data is infrequently used resources detailing this information are not plentiful. The largest pitfall with Artificial Neural Networks is they require a significant amount of training data in order for success. Therefore, a reliable source, or set of sources must be identified from which sufficient data can be extracted allowing for adequate network training and testing. Secondly, all stock predictions are limited by the accuracy of the input data. Data stretching back ten or more years usually contains a number of inaccuracies or missing data. Therefore, all input data must be classified based upon its accuracy. This will require the creation of an accuracy analysis plan and then also an enhancement plan for dealing with issues including, missing data, and probable inaccuracies. Strategies for dealing with these issues are discussed in [Numerical Algorithms Group, 2002]. There is no perfect generic configuration of Artificial Neural Networks and many papers discuss this matter [Kolarik and Rudorfer, 1994, Bodis, 2004]. Potential configurations and models for time-series data must be identified that may prove well suited to the problem. Then these will be analysed individually and in combination to determine the configuration which will be incorporated into the final design. This is likely to require a number of experiments or small tests which will require an appropriate testing environment. Configuration analysis will look at how many hidden layers and how many neurons to utilize and how to structure and present the input data. At this stage fundamentals will be analysed for use as input data. Limited to only several inputs, it will be necessary to find the optimum combination of fundamentals and macro-economic data to provide the best forecasts. During optimisation and once the system is completed it will need testing against traditional stock selection and prediction methods. For this testing to be effective and accurate it must be unbiased and mimic the real live investing environment as far as possible. For this it is believed that accurate trading data over a significant period of time will be required. As a result of initial reading and investigation, this period should be two years, not included in the training data and include at least twenty separate predictions across a number of stocks. The predictions will be compared to the real prices and compared to predictions made using traditional methods. 1.1 Hypothesis Artificial Neural Networks have been applied to technical financial data to make accurate stock price predictions. The link between fundamental financial data and a company s stock price has been made. Therefore, this investigation will attempt to conclude the validity of the following hypothesis: Artificial Neural Networks can be applied to Fundamental Data to make accurate stock price predictions. 1.2 Time Management Without carefully breaking the project into tasks and developing a time plan, problems are likely to occur. The final documentation must be delivered on 8th May, therefore there are 31 weeks in which to complete background research, develop the 2

11 system, draw conclusions and document fully. Figure A.1 in Appendix A shows the gantt chart that has been developed as part of the time management plan. The project has been divided into three main sections. These are: Literature Review, Implementation and Write-up. Before the implementation can begin the literature investigation must be completed to prevent naive assumptions and unnecessary mistakes. The implementation can be broken into three main sections: initial ANN development, configuration refinements and the final suitability experiment. It would appear from Figure A.1 that approximately 50% of the implementation time has been given to the initial ANN development, however this is due to the following assumptions: Implementation is forecast to start a week before the literature survey is to be submitted. This means that realistically the progress made during this week will be somewhat limited due to the pressing literature survey deadline. This period falls over Christmas and the New Year, as well as semester one assessments. Therefore, productivity will be lower because of external engagements. The time given for the first four activities in the implementation phase is seen as approximate and what is scheduled in the gantt chart is the worst case scenario. The pre-experimental studies may begin earlier if: The Christmas and assessment periods are not as disruptive as anticipated A suitable ANN library or system can be found, reducing the time taken to build the initial ANN skeleton. The pre-experimental studies and ANN refining phases are cyclic. This is because refinements will be as a result of pre-experimental studies, which will then lead to the next phase of pre-experimental studies, starting the next cycle. This period of the investigation is viewed has being of top priority, as at this stage a suitable configuration is being determined and the configuration of the ANN is often critical to success. Finally, the write-up largely follows a waterfall life-cycle, with slight overlaps at the transitions between stages. This is because each stage naturally follows the previous in this manner. Before beginning a stage in the implementation it must be planned. Therefore, it seems appropriate to document the planning, thus the next stage of the write-up correspondes with the next stage of the development. Once the inital planning has been completed the write-up may revert back to the previous section in order to conclude it. It is for this reason that the overlap during transitions has been included. Four weeks have been allowed for refining the write-up. Proof reading and making alterations always takes longer than anticipated, thus extra time has been allowed. 3

12 Chapter 2 Literature Review 2.1 Artificial Neural Networks Background to Neural Networks In 1943 W. McCulloch and W. Pitts created the first mathematical model of a biological neuron. From this work Neural Networks were born [Kartalopoulos, 1996]. Artificial Neural Networks (ANNs) have been around for a number of decades now. Initially there was massive hype and excitement surrounding the neural network concept after Rosenblatt developed the Perceptron. However, after Minsky and Pappert demonstrated that the early perceptron model was computationally incomplete in their 1969 book Perceptrons, funding dramatically tailed off and research came to a near standstill. In the 1970s and 80s several events rejuvenated enthusiasm. In 1974 Werbos first developed the idea of back propagation, this was then independently rediscovered by Parker and Rummelhart and McClelland in the 1980s. Then in 1982, John Hopfield presented a paper to the national academy of sciences where he not only attempted to model the brain but also made useful devices. By using mathematical analysis he showed that such neural networks could work and what was feasibly possible. In 1986 Rummelhart and McClelland published Parallel Distributed Processing, which discussed how the back propagation algorithm could allow for multiple layered networks with hidden layers. This filled the computational incompleteness in the original perceptron model and re-fuelled interest and research. This enthusiasm has grown largely due to the rapid expansion of computing power and memory [Shachmurove, 2002], which has allowed ANNs do be a more feasible technology for real-life applications. This has kick-started a positive cycle of increased research, creating increased enthusiasm, which in turn has led to more research. ANNs are now used in a large variety of scenarios including handwriting recognition, signal processing, medical imaging, economic modelling, financial engineering, and more. This is largely due to ANNs powerful pattern recognition properties, their ability to detect patterns analogous to human thinking and their ability to offer a highly flexible approximation tool [Refenes, 1995, Hui, 2000, Dorsey and Sexton, 2000]. 4

13 2.1.2 How Artificial Neural Networks Work Neural network programs learn as data is presented to them. This allows them to perform many different and unspecified tasks, which is in direct contrast to typical computer programs where the human programmer focuses on accomplishing a particular, well-specified task [Lane and Neidinger, 1995]. Artificial Neural Networks, as the name suggests, attempt to create a man-made computation which mimics the human brain. The human brain contains around 100 billion neurons linked together into a network. A neuron contains a cell body or soma, branching extensions called dendrites and an axon. The junction between the dendrites and axons is called the synapse. A single neuron may have 1000 to 10,000 synapses and may be connected with some 1000 neurons. The synapse is made up of three parts, the presynaptic terminal that belongs to the axon, the cleft and the postsynaptic terminal that belongs to the dendrite. It is believed that neurons compute a simple threshold calculation. This computation is completed using biochemical and electrical signal processing. The input signals are collected and combined in the soma. If the combined signal strength exceeds the predetermined threshold then the neuron sends an electrical signal along a single path, the axon. At the end of the axon is a tree of paths called axonic endings which allows the axon to connect to multiple dendrites, which pass the input. The electrical output signal is the result of transforming the input signals. When the signal reaches the end of the axon it is converted into a chemical messenger, known as a neurotransmitter, which crosses the cleft in the synapses to the dendrite of the next neuron. At the postsynaptic terminal the charge generated by the neurotransmitter is weighted, determined by the association with that particular input [Refenes, 1995, Kartalopoulos, 1996]. For a more detailed discussion of this see [Kartalopoulos, 1996], where a number of pages delve into a high degree of detail in an easy to comprehend manner. However, this goes beyond what is necessary to understand existing ANN designs and as such is not discussed here. Figure 2.1: A Biological Neuron Although ANNs have varying degrees of resemblance to their biological cousins, they do have similar characteristics. They contain layers of neurons, or processing elements which are interconnected. However, it is worth noting that the artificial neuron does not describe the biological neuron. Each connection has a weighting which determines the bias to certain data. Similarly to biological neurons, this is set before the main processing takes place. The input signals are combined in a bias term using a sum- 5

14 mation function (sometimes called the activation function). This combined signal is titled the internal stimulation or activation level. The activation level is then passed to the transformation function 1 which determines the output. If the activation level exceeds the threshold then the neuron fires, ensuring that the output is bound. This is unlike the biological neuron, which does not need to bind the output. This component is added to the artificial neuron to ensure the neuron output is controllable in the times when large activating stimuli are received. This is not necessary in the biological world because conditioning of stimuli is completed by the sensory inputs. A good example of this is that what we hear as twice as loud is actually about ten times louder, but the ear dampens the sound [Kartalopoulos, 1996, Trippi, 1996]. The power of neural computing resides in the threshold concept. It provides a way to transform complex interrelationships into simple yes-no situations. When the combination of several factors become overly complex, the neuron model conceives an intermediate yes-no node to retain simplicity [Shachmurove, 2002] The Transformation Function (Nonlinearity Function) As mentioned previously, the purpose of the transformation function is to determine the neuron s output and to bind it. There are three popular functions for this exercise: hard limiters, sigmoids and pseudolinear functions. Hard limiter functions were the transformation functions for the first generation of neural networks. They produce values in the range [0, 1], depending on the total input. For example, any values below t, the threshold value, may be changed to zero, anything above is rounded to 1. This is a very simple function but eloquently fulfils the basic requirements of the transformation function. Figure 2.2: Hard Limited Function Sigmoid functions introduced the second generation of transformation functions which made a better approximation of the biological process. Sigmoid functions are the most widely used in all types of learning. There are two forms of sigmoid function: asymmetric and symmetric. A typical symmetric form would be: X T = e X (2.1) 1 Different nonlinearity functions are used for the transformation function, depending on the paradigm and the algorithms used. 6

15 Figure 2.3: Sigmoidal Function Where X T is the transformed (or normalised) value of X. The sigmoid is made asymmetric to incorporate the threshold previously mentioned [Lane and Neidinger, 1995]. The sigmoid has become so popular because it is easy to compute and satisfies the following property: d sig(t) = sig(t)(1 sig(t)) (2.2) dt The pseudolinear function (sometimes known as piecewise linear) is the simple bridge between the hard limiter and the sigmoidal functions as it derives a linear approximation of the sigmoidal function. The psyeudolinear function is of the form: Figure 2.4: Pseudolinear Function 0 if x x min f(x) = mx + b if x max > x < x min 1 if x x max [Shachmurove, 2002] During the last few years the paradigm for biological neural systems has progressed significantly. The focus has been on spiking neurons. These more accurately model the biological neuron in terms of temporal patterns and neuronal activity, as the 7

16 biological neuron fires pulses which are not represented in the hard limiter and sigmoidal functions. Although research has shown this form of model to have higher computational ability in some paradigms, integrative models of hard limiting and sigmoids have frequently been shown as adequate for the financial domain. For this reason, further discussion of this topic is left to the reader 2. What has been examined so far is the basic architecture of an ANN. It is quite simple, with no learning or adaptation. Warren McCulloch and Walter Pitts work inspired the further development of ANNs and the introduction of training algorithms to determine the weightings for each neuron input. This depends on the paradigm and will be discussed next Learning Like the biological neural network the ANN must be taught first. This is where a number of different ANNs are produced as different training techniques are applied. Learning starts with an error function, which is expressed in terms of thresholds and weights of the ANN. The objective is to reach a minimum error by adjusting the weights of every input at every neuron [Shachmurove, 2002]. At this point the ANN is said to be in a steady state. Selecting the correct learning algorithm is one of the most important parameters when designing an ANN. A number of common learning algorithms implemented in the financial domain are discussed below. The list of learning algorithms continues to extend and detailing them all would be beyond the context of this literature survey. Due to the high degree of existing research into ANNs in finance, an assumption is made that if a learning algorithm has not been employed within a suitably matched system then it is not appropriate for this study and thus does not warrant investigation. This assumption is necessary due to time constraints. Learning is largely broken into two categories: Supervised and Unsupervised learning. Supervised Learning The ANN s output is compared to the target output. Adjustments are then made to the neuron weightings relative to the error size. The intention is to minimise the error, possibly to zero. Error minimisation requires a special circuit called the teacher 3 or supervisor. The network either stops learning after a number of iterations or when a sufficiently low error is reached [Kartalopoulos, 1996]. Training runs the same test data repeatedly, altering the weights for each iteration. Before the training starts the weights are non-zero, randomly generated. Below, we consider the different supervised learning architectures and algorithms. The Perceptron The Perceptron is a basic ANN component and probably the most discussed. First devised by Frank Rosenblatt, the perceptron paradigm requires supervised learning. Once training has occurred the perceptron (neuron) is fixed. That is, the input values, synaptic weights and threshold values cannot change. If one wishes to change any of these, the network must be re-trained. Thus, different 2 [Maas and Schmitt, 1997] is a good introduction into the research surrounding spiking neurons and their complexity. 3 The teacher notion comes from biological studies, where a human learns from comparing our results with what we know is correct. 8

17 sets of patterns create different perceptrons. From this idea the multi-layer perceptron framework has formed. Multi-layer Perceptron (MLP) Largely seen as the most popular ANN framework, as stated in [Shachmurove, 2002], the perceptrons are grouped into layers. Perceptrons in a layer can only communicate with the previous and next layers 4. Generally there are three categories of layer: input, hidden and output. The input layer takes the raw input data as its input, the hidden layers take the output from the previous layer as its input and the output(s) from the output layer are seen as the results of the ANN. Hidden layers are titled hidden because one does not see or know their inputs or outputs. They are often characterised as feature detectors [Shachmurove, 2002]. Generally, it is accepted to have n input perceptrons for n inputs but there is no recognised optimum number of hidden perceptrons or layers and the number of output perceptrons depends on the number of outputs necessary [Kartalopoulos, 1996, Hastie et al., 2001]. No indication is provided as to the optimal number of nodes per layer. There is no formal method to determine this optimal number; typically one uses trial and error. [Kartalopoulos, 1996] Different learning algorithms have been applied to MLPs, but the most common are the Delta Rule and Back Propagation [Kartalopoulos, 1996]. Delta Learning Algorithm The Delta learning algorithm uses the least-squareerror (LSE) minimisation method to determine the weights of inputs. The idea is to adjust the weight of w ij, where i is the ith input to the jth neuron, so that the error between the neuron s target and actual outputs is minimised [Kartalopoulos, 1996]. This is used in the back propagation algorithm. The LSE is calculated by: E = 0.5 (t i x i ) 2 (2.3) Where t is the target output at neuron i and x is the output at neuron i. The error at weight ij is: w ij = n (t j x j ) x i (2.4) Where w ij is the weight for input i on neuron j, n is the learning rate 5, t is the target output for j and the x variables are outputs from j and i. 4 When perceptrons can only communicate with the next layer the topology is termed feedforward. If there are communication loops the network is titled recurrent. 5 Learning rate a small positive scalar. This scalar may be decreasing at every iteration as learning progresses, or it may be a constant fixed value throughout. If the learning rate can decrease, the rate depends on the speed of convergence to the optimum solution and on the termination procedures. If the learning rate is set to high then the algorithm may over step the optimum weights. [Kolarik and Rudorfer, 1994] has a good investigation into this, clearly showing evidence as proof that a too large learning rate should be avoided. 9

18 Weight w ij is then updated using: w ij (k + 1) = w ij (k) + n(t i x i ) x i (2.5) where k is an iteration count. Back Propagation The backpropagation learning procedure has become the single most popular method to train networks. [Lane and Neidinger, 1995] Back-propagation (BP) was discovered in 1974 by Paul Werbos and since then has been extensively used for teaching MLPs. The method steps back through the layers adjusting the weights at each neuron until it gets back to the input layer. This is because the only outputs you can compare are the outputs from the output layer. We don t independently know what the output from a hidden neuron should be, thus cannot compare the actual output to a target. The algorithm starts by calculating the error at the output layer compared to the target. From this, the algorithm computes the rate at which the error varies with the activation level of the output neuron. The algorithm next steps back one layer and re-calculates the weights of the output layer so that the error is minimised, limited by the learning rate. This process is then repeated stepping back a layer at a time until the input layer is reached. This whole process is continued until the weights no longer change or a sufficiently small error is achieved; at this point the ANN moves on to the next pair of input-target patterns and repeats the process [Kartalopoulos, 1996]. BP has faced some criticisms, especially concerning its realism as a biological process, as biological neurons do not seem to work backward to adjust efficacy of their synaptic weights. The algorithm requires a large number of calculations, making training slow. One must also be careful when choosing the learning rate, because if it is set too high then the connection weights can get stuck in local minimums meaning the predictions are inaccurate. However, there is a reason for its popularity. The extensive research involving it has shown BP is the most suitable learning algorithm for the financial domain [Quah and Srinivasan, 1999]. This also means it has the most powerful optimisation mechanisms and processes. On a wider scale it is recognised as a good algorithm for generalisation. Unsupervised Learning Unsupervised learning does not require a teacher. The network receives inputs and tries to create categories for the input pattern. If it cannot match the input pattern with a stored category it creates a new pattern to be stored. Even though this form of network does not require a teacher it does need guidelines as to how to form groups. This form of ANN is commonly used for classification problems because in its very nature it is calculating classifications. Unsupervised ANNs are sometimes applied to a dataset as a pre-processor for a supervised learning ANN, to speed up the latter s learning phase [Hui, 2000]. 10

19 Kohonen ANN This is the most common form of neural network which implements unsupervised learning. To prevent confusion, Kohonen is the type of network 6 not the learning algorithm, however the architecture is built with the learning strategy in mind. To form a self-organised map, first the input data is coded into features, forming an input space of N-dimensional vectors. Input items are then randomly drawn from the input distribution and presented to the network one at a time. The mapping from the external input patterns to the network s activity patterns is accomplished in two phases. First, similarity matches are identified. During this phase the network is trying to couple the input pattern with an internally formed pattern. At this point the weights are adapted to realise the fit or highlight incompatibility [Refenes, 1995]. For financial forecasting using ANNs, unsupervised networks have in general been found inappropriate. Therefore, they will not play a part in further investigations. However, additional information on unsupervised and Kohonen ANNs can be found in [Hastie et al., 2001]. The conclusion drawn from this section is that the choice of ANN design is dependant on the type of learning algorithm chosen, which is dependant on the inputs and outputs. In section 2.3 we discuss the ANN architectures, topologies and learning algorithms that have been previously used in the domain of stock market forecasting. Following this section a refined judgement can be made as to what is most suitable for this project. However, initially it is believed that a supervised learning algorithm, probably back-propagation, would be most appropriate. This is because the actual stock price could be used as the target output and back-propagation has been highly implemented for financial applications, the domain of interest Pre-Processing Pre-processing is a vital part of the system. There are 3 categories of general preprocessing that apply to all time-series input data. However, in section 2.3.1, where we investigate pre-processing for financial ANNs, a number of other pre-processing steps and operations are uncovered that must also be considered for financial time series data. One may have an excellent data and ANN combination but without the correct data pre-processing results may prove inaccurate. It is stated in [Hui, 2000] that one of ANNs drawbacks is their sensitivity to the format of the input data. Therefore, the pre-processing phase of the implementation must be given due care and attention, to ensure that the results give a true representation of how effectively the ANN can process the real trends and correlations in the underlying data. Neural networks plugged to relatively unprocessed data usually fail. [Zemke, 2002] 6 Kohonen ANNs have a single layer of neurons, which makes them distinctly recognisable from MLPs. 11

20 Neural networks are data-dependent, so the algorithms are only as good as the data shown to them. [Shachmurove, 2002] In this section we will identify the major pre-processing computations which have proved beneficial. Within section 2.3 this information will be more finely discussed and merited for financial and investing applications. There are a number of essential pre-processing steps which must be considered for the ANN to be effective. Visual Inspection Visual inspection is invaluable [Bodis, 2004, Zemke, 2002]. At this point one is looking for trends, missing values, outliers and any irregularities, as this helps decide which pre-processing steps, including detrending and normalisation, are required. Outliers within stock market data may include stock splits, where the price of the stock immediately reduces by a factor depending on the split ratio. If this is not considered then the split will appear as a crash in the stock price, where as it is actually just a realignment and should not have any meaningful bearing on the profitability of the stock. Detrending This is the process of removing seasonal and/or general trends from the data. Strong trends in the variables can lead to correlations and regressions, because the network learns general features more easily than the actual relationships between variables. This problem is particularly relevant to financial time-series, which are dominated by trends. The growth of times series is removed using detrending. A simple way to accomplish this is to take relative values rather than absolute: y = x t x t 1 (2.6) which removes the linear trend. A similar computation can be used to remove seasonality, y = x t x t s (2.7) where s is the time interval that shows similar patterns. An alternative way is to use the relative values as percentage difference: y = x t x t 1 x t 1 (2.8) Finally, taking the logarithms of these values is commonly used to smooth out trends [Bodis, 2004]. Of all the ANN implementations investigated, only [Bodis, 2004] actively tackled the process of detrending while [Hui, 2000, Dorsey and Sexton, 2000, Kolarik and Rudorfer, 1994, Lina et al., 2002] do not make it clear whether detrending is incorporated or not. [Vanstone et al., 2004] introduce a counter opinion. They examine two research papers which conclude that using raw input data achieved significantly better results, stating 12

21 The use of raw data is preferred to differences, to avoid destruction of fragile structure inherent only in the original time series. [Vanstone et al., 2004] This process could have a strong bearing on the final results. Detrending may be essential for some inputs because fundamental data like earnings usually follow a seasonal pattern. However, the price of a stock does not move with these seasonal changes as they are factored in throughout the year or cycle. As a result of the discussion in [Vanstone et al., 2004] it is worth running tests where detrending is and isn t incorporated and comparing the results. This topic does require further investigation. Normalisation In a highly volatile market ANNs can get disturbed by the dramatic fluctuations in the time-series. Normalisation is a way of preventing this by standardising the possible numerical range that the variables can take. The main reason for this is to translate the data to within the normal operating range of the transformation function. Otherwise it is possible to have transformation functions that tend towards zero, which can bring training to a virtual standstill (known as network paralysis). Variables are usually normalised in order to have a zero mean and unit standard deviation. A common equation for this is: Xi scl (t) = X i(t) X i (2.9) σ xi Where X scl i (t) is the scaled variable, X i (t) the original variable, σ xi the standard deviation of X i (t), and X i the mean value of X i (t). [Bodis, 2004] In [Hui, 2000] the data is scaled to between [0, 1] and [Trippi, 1996, Kartalopoulos, 1996, Refenes, 1995] suggest this is the norm. Normalisation is a good method to reduce noise but there are others, including neighbourhood averaging. This is substituting for the average of the surrounding area, thus smoothing the time-series. An alternative way to reduce noise is by increasing patterns by instance multiplication. This is where instances are cloned with some features supplemented with a form of artificial noise, methods include: Gaussian noise, 0-mean, deviation between the noise level already present in the feature/series and the deviation of that series. This can be useful when there are limited instances for an interesting type. Such data forces the ANN to look for important characteristics ignoring the noise. Also, by relatively increasing the number of interesting cases, training will pay more attention to their recognition [Zemke, 2002]. [Zemke, 2002] states that normalisation should precede any feature selection as nonnormalised series may confuse the process. Unlike detrending, this is a pre-processing step which is not debated, even if only in its simplest scaling form. The necessity to reduce noise is dependant on the stocks chosen. Quiet, little known stocks usually contain minimal noise in comparison to trends and real patterns of movement. This is because they tend to have lower trading volume and trades are based on fundamental analysis rather than technical analysis. In other words, the trading is because of underlying changes in the company rather than because of market sentiment. Popular stocks contain a large amount of noise because people trade on market psychology, as there is a lot of sub-conscious 13

22 psychology surrounding stock selection and trading. As such, it would be interesting to sample a cross-section of large and small companies, based on their market capitalisation Topology ANN topology is the configuration of the network architecture in terms of hidden layers, neurons and connectivity. It is an important consideration as it can have a considerable bearing on the ANN s computational ability and thus its effectiveness. The optimum network topology is largely paradigm-specific. There are a number of optimisation techniques that can be implemented in order to converge on the best suited topology for the chosen application. Although [Bodis, 2004, Kolarik and Rudorfer, 1994, Shachmurove, 2002] explain there is no set rule for the optimum topology, it is a very important factor to consider. Too many neurons or hidden layers can cause the network to overfit the target output. This affects the network s computational abilities, as rather than approximating to the trends and patterns the random noise is also fitted [Dorsey and Sexton, 2000]. This results in very negative effects when the network is applied to out of sample data, where the noise usually follows a different random pattern. If there are too few neurons and hidden layers the network will approximate too loosely and move to the other end of the scale where, during learning the network fails to pick up the correlations and underlying trends. [Lane and Neidinger, 1995] state that a commonly used architecture incorporates three layers (one hidden layer), with full interconnectivity 7. This idea is backed up by [Abraham, 2003, Kolarik and Rudorfer, 1994]. [Dorsey and Sexton, 2000] also employed only one hidden layer, however after optimisation the number of connections were dramatically reduced from the full interconnectivity initially used. The most common form of action taken to optimise the network topologies is a disciplined trial-and-error strategy [Lane and Neidinger, 1995], which is outlined in Constructive and Pruning techniques below. The drawback to these methods is they involve a slow, laborious process [Lane and Neidinger, 1995]. [Crone, 2005] argues that this drawback is partially the reason why ANNs are not well established in business practice. As topology optimisation is a slow process it is often not sufficiently completed, resulting in inconsistent results and pessimistic reports on ANN reliability. This in turn means there is insufficient confidence to apply them to a commercial purpose. In conclusion, the topology optimisation is an important consideration for this study and the accuracy of the results may hinge on it. As this has been identified as an important consideration we will look at the categories of optimisation in a little more detail. Analytic Estimation These are techniques in which algebraic or statistical analysis is used to find the number of hidden neurons to implement. This is based on a study of the input vector space in terms of size and dimensionality. This method is unlikely to produce the most optimum topology, but in the absence of a strong model analytical estimation provides a reasonable starting point. Later in this section a suitable starting topology is derived, thus this technique will not be further examined. However, [Refenes, 1995] looks deeper into methods of analytic estimation. 7 Connections are the synapses joining two neurons together. If a network is fully connected it means that every neuron is connected to every neuron in the next layer. This means each neuron sends its output to every neuron in the proceeding layer. 14

23 Constructive Techniques This is a process of trial and error addition. It begins with one hidden layer with one hidden neuron. Then another hidden neuron is added and the results compared. The idea is to slowly add neurons and hidden layers until performance starts to diverge from the peak. By showing that at least one neuron in each layer makes fewer mistakes than a corresponding unit in the earlier layer, we are guaranteed to converge on zero error eventually. Cascade correlation, the tiling algorithm, the neural decision tree, and the CLS procedure are all based on this technique [Refenes, 1995]. The drawbacks with this are that it is a very time consuming process and can create computationally intensive ANNs[Refenes, 1995] Pruning Techniques This works in the opposite fashion to the constructive techniques. Beginning with too many hidden layers and neurons, progressively they are removed until divergence from peak performance; each time removing redundant or least sensitive connections. This classification includes the artificial selection technique [Refenes, 1995]. Artificial selection is a two stage process where each hidden neuron s effectiveness is recorded. After convergence has been achieved the procedure then goes back through the neuron effectiveness record stripping out the neurons which still show little worth. Pruning can improve performance by reducing overfitting and improve speed by reducing the number of unnecessary computations [Zemke, 2002]. [Crone, 2005] applies an alternative heuristic for the problem of topology optimisation. Before testing, a predefined set of architectures are catalogued for use. Then after all of the architectures have been trialled the best one is chosen, based on the validation errors and correlation information from the tests. The performance of an ANN usually conforms to a converging trend as the topology is adjusted towards the optimum. This approach appears to include a number of unnecessary trials, as one could dynamically adjust the topology based on the trend appearing, thus more quickly find an optimum performance. Therefore, [Crone, 2005] s strategy will not be employed. [Vanstone et al., 2004] implements the constructive strategy, using 14 input nodes and progressively builds the hidden layer. However, the final architecture is not discussed, and a number of the other papers did not show how they settled on their final topology. It appears they start with an intuitive guess as to the optimum topology and then add or remove connections and nodes based on the comparison of performance and error against the previous topology. This appears to be a good way to converge on the best configuration more rapidly, rather than starting at either the minimal or maximal topology as suggested by pruning and constructive techniques in their purest forms. The optimum topologies used in similar scenarios and applications will be investigated, taking the average of these as the initial topology and then via informed trial and error use pruning and constructive techniques as necessary. [Dorsey and Sexton, 2000] investigated using a genetic algorithm to dynamically optimise the topology used for forecasting financial time series. The genetic algorithm minimises the number of redundant connections, which is a pruning technique. Starting with 61 connections based on ten input nodes, one hidden layer containing 5 hidden neurons and a single output neuron, the number of connections were dramatically reduced with no substantive loss of predictive capability. This is an interesting result because [Vanstone et al., 2004], which closely corresponds to my particular interest, uses 14 inputs, a similar number to [Dorsey and Sexton, 2000]. [Shachmurove, 2002] states that the hidden layers commonly include at least 75% of the number of input nodes. However, [Crone, 2005, Kolarik and Rudorfer, 1994, Bodis, 2004, Rudorfer, 1995] all conclude with topologies which follow the pattern N-N-1, where N is the number of inputs. This is particularly interesting because [Crone, 2005] uses 15

24 14 inputs, which is the same as [Vanstone et al., 2004] and close to [Dorsey and Sexton, 2000], suggesting this is a good number of inputs to start with. As a result of these findings, it is felt that a good strategy for developing the topology would be to initially have between inputs, with one hidden layer and an architecture of the form N-N-1, which is fully interconnected. Then as a result of [Dorsey and Sexton, 2000] s findings, prune the architecture to remove redundant connections and possibly redundant inputs. 2.2 Investment and Financial Prediction In this section traditional stock market theories and views are detailed, traditional stock market prediction methods are highlighted and their strengths and limitations analysed. This is concluded with a discussion of the forms of data which have proved influential or highly used in the study of stock price trends and patterns. This will include a description of the data, why it is used and how it may be beneficial or counter productive in my implementation. There are three somewhat contradictory chains of thought to capital market predictions. All three have been developed from Eugene Fama s original work on The Efficient Market Hypothesis (EMH). The efficient markets model continues to provide a framework that is widely used by financial economists [Dimson and Mussavian, March 1998] The Efficient Market Hypothesis (EMH) The Efficient Market Hypothesis is the most hotly debated theory in capital markets. It is readily viewed in three forms of dilution: strong, semi-strong and weak. The strong form asserts that markets follow a random walk which cannot be predicted from the past prices [Swingler, 1996]. The strong form of EMH believes that a stock price is the evaluation of all data available regarding the company, both publicly and privately and as soon as new data is known or published the stock price immediately factors it in [Dimson and Mussavian, March 1998]. This implies it is impossible to gain returns above the market average for a sustained period of time. The semi-strong form says the asset s price reflects all publicly available information [Dimson and Mussavian, March 1998]. This implies that if you know something that the market doesn t you can prosper, however if this is insider information the purchase or sale of stocks resulting from this knowledge is illegal. As both fundamental and technical analyses rely on publicly available information this hypothesis implies that neither will help result in above market average returns. Finally, the weak form of EMH claims that prices fully reflect the information implicit in the sequence of past prices [Dimson and Mussavian, March 1998]. Therefore, technical analysis is deemed pointless because it cannot tell you anything that is not already described in the price. The Efficient Market Hypotheses have faced some strong criticism, particularly the semi-strong and strong forms. This is because they imply that it is impossible to 8 5 being chosen for the lower bound because it is the final number of inputs used in [Dorsey and Sexton, 2000] and is also the lowest number of inputs discussed in [Kolarik and Rudorfer, 1994] 16

25 make above market average returns on capital market investments over a sustained period. However, infamous investors including Benjamin Graham, Warren Buffet and Peter Lynch have outperformed the market average for 20 years or more. [Dreman, 1998] is particularly critical of these capital market beliefs, concluding that because investors are human they are open to psychological biases, which means if you can understand these biases you can consistently beat the market. [Swingler, 1996] introduces an interesting view, stating that if asset prices do not follow a random walk, but rather a chaotic one then anybody who is able to model the price structure and make valid predictions using this model will have access to information which is not publicly available. This means that an individual can outperform the market average over large periods of time, because in effect they know more than the market. [Lakonishok et al., 1994] discusses a number of papers and articles that consent to this argument. The paper draws on research which concludes that value strategies out-perform the market. Value strategies look at ratios that suggest a company is undervalued. These ratios and indicators compare price to: Earnings Dividends Cash flows Operating income Book value [Lakonishok et al., 1994] also looks at: Growth rate of sales All of these ratios fall under the fundamental analysis umbrella and have been shown to have a correlation with the future stock price. The experiments in [Lakonishok et al., 1994] measure the trends for one, two and five year periods. This correlation is also backed up by [Dreman, 1998] where similar results are concluded and the research goes even further than in [Lakonishok et al., 1994] by looking a periods up to and in excess of twenty years. Therefore, these data have the potential to be used as the inputs for the prediction system and should be included in initial tests. According to the efficient market models, price-earnings ratios (commonly abbreviated to p/e) and dividend-price ratios(commonly called divided yield), should be useful in forecasting future dividend growth, future earnings growth and future productivity growth. However, [Campbell and Shiller, 2001] found that they are more accurate at forecasting stock price changes, which is contradictory to the strong and semistrong forms of EMH. The study found that when the dividend-price ratio dropped below the historical average the stock market value followed suit and dropped until the dividend-price ratio once again moved above the historical average. However, [Campbell and Shiller, 2001] also conclude that this correlation is not as strong as the price-earnings ratio. The price-earnings ratios show an opposite correlation to the dividend-price ratios. As the price-earning ratio rises above the historical average the probability of a market drop intensifies. [Campbell and Shiller, 2001] look at a ten year smoothed p/e ratio and its effect in a ten year forecast window. This forecasting is too long-term for my application as finding sufficient input data to allow for a ten year forecast will be difficult. Also, forecasting ten years ahead is generally very approximate because a lot can happen to a business in that time frame, and 17

26 therefore be too risky to employ. However, the correlation between the price-earnings ratio and stock prices over a shorter period of time is fairly well known as highlighted in [Dreman, 1998, Rey, 2004, Graham and Zweig, 2003]. The findings from [Campbell and Shiller, 2001] are largely backed up in [Rey, 2004] where it states that the ability to forecast future stock prices is considerably stronger when fundamental analysis is involved. They include variables such as short-term interest rates, yield spreads between long-term and short-term interest rates, between low- and high-quality bond yields, book-to-market ratios (also known as price-tobook value), dividend-payout (including dividend-price) and price-earnings ratios. This paper mainly examines dividend-price ratios, continuing the work of Campbell and Shiller. In particular it cements [Campbell and Shiller, 2001] s findings on the correlation between dividend-price ratio and the future stock price. An examination of fundamental analysis would not be complete without a discussion regarding Benjamin Graham, who is widely recognised as one of the earliest practitioners of fundamental analysis and whose legacy of value investing still remains prominent. Benjamin Graham was the original founder of value investing principles back in the 1920s to 1960s. He published a number of books concerning his strategies and findings. Graham used fundamental analysis studies to make investing decisions and stock market predictions. He created a number of rigid investing rules based on fundamental analysis. [Graham and Zweig, 2003] examine Graham s rules and conclusions in the light of the modern economy and Jason Zweig finds they predominantly still hold. The fundamental data that are considered are current ratios, earnings growth, price-earnings ratio, dividend record and price-to-book value. Only past data is used so forward forecasting ratios are ignored, also in some instances diluted 9, for instance price-earnings ratios. Pessimistic figures and moving averages are also considered. [Quah and Srinivasan, 1999] recommends using growth ratios including return on equity (ROE), which has not been mentioned in [Graham and Zweig, 2003, Campbell and Shiller, 2001, Rey, 2004, Dreman, 1998, Swingler, 1996]. Return on equity indicates how well the company is using the capital it receives from investors. It therefore has a bearing on the stock price because it indicates the returns one will receive from an investment. These findings show there is a strong correlation between stock price predictions and particular forms of fundamental data. As a result, there is a good basis for using fundamental analysis with an ANN for stock predictions. Particular ratios appear time and time again as good indicators of future prices. These are price-earnings, dividend-price and price-to-book, which may be highly influential inputs for the ANN. Before an ANN prediction application can be designed, one must understand why particular types of inputs are used. They must be related to classical prediction theories and methodologies, so that an understanding can be developed as to how they can have a bearing. Investing appraisal is classically split into two major disciplines: Technical Analysis and Fundamental Analysis. 9 Diluted means any one time benefits or costs are discounted or removed. This is so that you get a fairer indication of repeatable values as these one time events are not likely to happen in the future and thus may give an unrealistic forecasting viewpoint. 18

27 2.2.2 Technical Analysis Practitioners of technical analysis believe the market is 10% logical and 90% psychological. They believe that stock prices are largely influenced by psychological factors, meaning that if one can understand the psychology the market can be beaten. Technical Analysis looks solely at trading information and price movements, basing predictions on trading and price patterns and disregarding information concerning the company for which the stock regards ownership of. Technical analysis is predominantly used for short term trading. Variances in stock prices due to psychological factors generally dissipate over time, resulting in technical analysis being a poor indicator for long term price movements. As this investigation concerns incorporating fundamental analysis with an ANN, this is outside the context of the research Fundamental Analysis Practitioners of fundamental analysis recognise that ownership of a stock means ownership of a proportion of the company. Fundamental data includes company earnings figures, dividend yields, and pension deficits, to name a few. Fundamental data also envelopes macro-economic data like unemployment figures and the national interest rate. This is due to the fact that macro-economic policies and information can effect a company and their results and thus have an effect on their stock price. The philosophy behind fundamental analysis is the market is 90% logical and 10% psychological. Equity research analysts base their stock price predictions largely on fundamental analysis. Fundamental analysis is used in long term stock predictions as the long term performance of a stock is dependent upon the company s performance, because that determines its worth. As owning a stock means owning a piece of a financial asset, the company, if its valuation increases so does the value of your share, the stock price. Fundamental Data Types In discussing the merits of the EMH theories a number of fundamental data types are detailed and their potential for predicting the price movements of a stock are highlighted. However, the set of fundamental analysis data types is far larger than this. Here we will briefly describe a number that have not been detailed earlier in section 2.2 but can play a role in stock price movements. Operating Margin This describes the difference between operating profits and sales. The larger the margin the more attractive the firm is as it means they can remain competitive in difficult trading periods. This means their profits are less affected by the industry cycle. Earnings per share (EPS) Growth This highlights how earnings are growing in relation to share capital employed. EPS is used in the price-to-earnings calculation, thus it is an important ratio. This ratio is often viewed over a number of years to see what the growth pattern is. 10 The reader is recommended to look at [Elder, May 2002], which is a good book on using technical analysis for trading. 19

28 Price to earnings growth (PEG) PEG is an indicator used to determine if a company is undervalued looking to the future. A PEG below one is an indication of good value and greater than one can be viewed as the opposite. Debt Ratios The Gearing ratio and Debt ratio are used to determine what proportion of the firm s capital or assets is covered by debt, in the form of loans, bonds etc. The higher the debt the more volatile the stock price can be, because the shareholders returns are magnified. National Interest Rate This is particularly influential to companies with high debt ratios, as a change in the interest rate will change their repayments, thus altering their net profits. The national interest rate can also cause the whole market to move, thus every company in the market is usually affected. National Production levels This can affect the national interest rate and have a strong bearing on manufacturing sectors. It is felt that this indicator may only be useful for making forecasts on manufacturing companies. National Inflation figures The national inflation figures have a significant bearing on the national interest rate. The retail sector in particular is affected by this, but similarly to the national interest rate the whole market tends to move when inflation figures are released. The inflation rate affects firms sales and costs and therefore can be significant in determining stock prices. National consumer confidence index The national consumer confidence rating is most significant to the retail and housing sectors. If consumer confidence is low then these sectors tend to be worst affected and vice versa when the trend reverses. Similarly to the national production level, this indicator could be influential to the sectors just mentioned. 2.3 Investment and Financial Prediction using Artificial Neural Networks Due to the high degree of noise in financial data, traditional computing s rigid conditions cannot be met, but ANNs have been shown to handle noise well [Lakonishok et al., 1994]. ANNs are good tools for predictions in the investment and financial domain because the time-series data are usually dynamic in nature, so it is necessary to have non-linear tools in order to discern the relationships. ANNs have proved the best at discovering these types of relationships in comparison to traditional methods, which is backed up in [Shachmurove, 2002] and [Bodis, 2004], as they discuss the performance of a number ANNs developed in studies. It is also worth noting that financial time-series data often contain missing data or data that should be treated as such. ANNs perform well with this in comparison with traditional regression analysis [Shachmurove, 2002]. However, [Numerical Algorithms Group, 2002], discussed below, introduces ways of tackling missing and inconvenient data, further improving ANN performance. ANNs are well researched and implemented for financial forecasting applications. The vast majority of applications apply ANNs to technical analysis data with varying degrees of success. [Hui, 2000, Dorsey and Sexton, 2000, Liu and Leung, 2001, Abraham, 2003, Bodis, 2004] are all papers which include discussions or details regarding ANNs that have been applied to technical analysis data. 20

29 2.3.1 Pre-processing for Financial Data Accuracy of publicly available financial data is often a topic of discussion. Due to the sheer volume of financial data created and provided, the accuracy is not always dependable. Historical databases stretch back as far as the 1960s and even earlier when data storage was a far harder task, thus it is far from unusual to find missing or anomalous data. Therefore, these hindrances must be considered and overcome if accurate research and prediction models are to be devised. These problems fall under four categories: Missing values Impossible values Inconsistent values Unlikely values [Numerical Algorithms Group, 2002] propose strategies for dealing with this. For impossible values the problem and reason is usually obvious, for example a negative price when a positive one is expected, therefore the correct value can easily be substituted. For the occasions when the correct value is not straightforward to determine the value should be re-evaluated as missing. Inconsistent values may be when a p/e ratio disagrees with the price and earnings figures. In times like this a decision must be made as to which figures are most likely to be correct and these should be taken as the correct values. If this is not easily solved the methods discussed for unlikely and missing values may prove helpful for this category also. Unlikely values are those which may be correct but logically seem unlikely. An example may be a popular stock which triples in price in one trading session. A greater than single digit percentage increase is usually deemed high, so a 300% rise appears unrealistic. In situations like this the value should be treated as missing. Alternatively, an investigation could be initiated. When dealing with large quantities of data however, time constraints may make this unfeasible. In cases like this an automatic approach for dealing with missing data is necessary. Two approaches discussed include: Select a donor This means selecting an alternative value from the sample. This could be a random selection or a deliberate choice, perhaps the next value in the series because this is more likely to be a close match than the preceding figure. Sudden, almost random changes can occur that cannot be forecast in the preceding figure, but would be incorporated in the post value. Predict the value This method requires a model for computing the value. This method should be feasible as within the time-series there are usually only a small number of missing figures. However, it may fall down over predominantly empty series as the model has nothing to use to generate the values. It is worth noting the potential pitfalls associated with the automatic method for value determination. A poor choice of method or model may distort results, even producing misleading patterns in the series. All the strategies described are based on analytical methods, however visualisation can be very powerful. This is especially true for highlighting bad values. 21

30 As a result of this paper, data from within the last twenty years should be used as the input set. This is because since computers and in particular the internet, have become common-place financial information has become easier to gather and store. Therefore, one can deduce that financial data sourced from this era is likely to be more copious and accurate as it is presumed this data has been automatically gathered and stored, as opposed to manual entry prior to this period. As a result, the issue of missing or inaccurate data should be less severe and consequentially less damaging to the predictions Input Data As ANN performance is largely dependant on the relationship with the input data it is important to investigate cases when an artificial neural network has been applied to fundamental analysis data. As stated in the introduction, the combination of ANN with fundamental analysis data for stock price predictions is a relatively unexplored paradigm. However, this duo has been used quite extensively for similar financial applications. One study found did use fundamental analysis with ANNs for stock prediction purposes. [Vanstone et al., 2004] uses 14 inputs that are predominantly fundamental data and merges them with the five common technical data fields. Back-propagation is the learning technique of choice and the data is sampled in two month intervals. The system is designed to classify a stock as either a winner or loser. Each stock is given a point score between 0 and 100, stocks scoring over 50 are labelled winners. To measure the effectiveness of the system it is used to build a portfolio and trade over a two year period. The conclusion suggests that the system out performs standard methods of portfolio selection. However, the results were found to be limited and difficult to interpret. This study does not conclusively identify whether fundamental analysis data can be effectively applied to neural networks for stock prediction, because of the inclusion of technical analysis data in the input set. The positive results could be due to the technical analysis data, with the weights for connections to fundamental analysis nodes being insignificant, thus virtually or completely eliminating them from the modelling. A similar study to [Vanstone et al., 2004] is documented in [Refenes, 1995]. Once again fundamental analysis data is applied to a neural network to make stock picks. This is an interesting study because it employs feature selection, described in section and documents in good detail the steps taken. More significantly, however this research veers away from the popular back-propagation algorithm in favour of the Boltzmann machine (BM). The Boltzmann machine is also a type of supervised system, however it looks at a randomized optimisation called simulated annealing. In essence this approach allows the BM to develop the relationship between inputs and outcomes and increase particular input weights for certain outcomes. The results show that BM has potential for stock prediction as the results are favourable. However, proceeding notes show that fundamental data has been effectively applied to the backpropagation for a number of financial prediction systems. Therefore, as there is more research surrounding the coupling of fundamental analysis with BP, rather than with BM, more information is available to use to optimise the investigation. An interesting continuation of our work would be to switch the BP for a BM and compare the results. [Rudorfer, 1995] uses an ANN for early bankruptcy detection. An MLP with back propagation is the ANN of choice. The following five financial ratios were used as the 22

31 inputs: Cash flow / liabilities Quick (current) assets / current liabilities Quick (current) assets / total assets Liabilities / total assets Profit or loss / total assets The output is 1 for insolvent companies and 0 for those that are solvent. The learning algorithm is given a constant learning rate of 0.3 and 5000 learning iterations are executed. The ANN uses a hard limited transformation function with threshold 0.5. This is necessary to classify the output as 1 or 0. [Rudorfer, 1995] concludes that a company seems to be in danger when the liabilities / total assets or quick assets / assets ratios have high positive values and that sound companies have small liabilities / total assets and quick assets / assets ratios. As a result of these conclusions, these two ratios could be influential in the stock price prediction ANN. This is because if a company s solvency looks questionable its stock price usually takes a severe hit, but if a company s financials look very healthy it can lead to the stocks selling at a premium. Whether there is a strong enough correlation between these two ratios and the stock price in comparison to other potential inputs like price / earnings or divided yield will have to be considered during the implementation stage. If they do not have a strong enough correlation then they may not help the ANN in its prediction. As [Shachmurove, 2002] deduces that it is more effective to have a small number of highly correlating inputs it may prove more effective not to include them. [Rudorfer, 1995] s results were very impressive with only one incorrect classification of an insolvent company. Using only three years of data for twenty companies in the test set, sceptics could speculate that the test set is too limited to be 100% conclusive and further tests are required. However, [Quah and Srinivasan, 1999] and [Refenes, 1995] discuss a number of papers which tackle the same bankruptcy problem using ANNs and the results in these are similarly impressive. With such positive results it shows that ANNs can be effectively combined with fundamental analysis for forecasting applications Topology for Financial Applications Supervised learning networks have proved popular, with Back-propagation most commonly implemented and with good performance [Hui, 2000, Bodis, 2004, Rudorfer, 1995, Vanstone et al., 2004]. This is due to the fact that the domain provides the price as a known target output and inputs which determine this price, via some nonlinear correlation. Feedforward networks are most often implemented but not always in solitude. [Hui, 2000] is a demonstration of when a feedforward information flow has been combined with a Kohonen unsupervised network, largely to good effect. The supervised multilayer perceptron part of [Hui, 2000] s ANN, has two hidden layers with 20 and 15 neurons in the first and second layers respectively, using unipolar sigmoidal transfer functions. The use of a sigmoidal transfer function is not uncommon for this paradigm as confirmed by [Kolarik and Rudorfer, 1994, Bodis, 2004, Hui, 2000], however [Crone, 2005, Kolarik and Rudorfer, 1994] all found that a single hidden layer gave the best performance. [Crone, 2005] compares 36 different MLP architectures, taking noisy time-series data as the input, so its findings are particularly 23

32 interesting because the investigation of topologies was extensive. Regarding the input layer, [Shachmurove, 2002] notes that studies have shown using a few well-chosen variables will give significantly better results than trying to use every economic indicator as a viable predictor. This means care must be taken when choosing inputs. This will make the input investigation more time consuming as rather than putting them all together and seeing which inputs get weighted out other investigations must be executed to find redundant data types Testing [Hui, 2000] takes into consideration only two stocks and starts with a specified quantity of capital to invest. The ANN then uses a classification system to instruct the user to buy or sell stocks in the company depending on whether the ANN predicts the stock price will rise or fall. The result of their trading is then compared to those of a Kohonen and an MLP ANN applied to the same task on the same data. The results for each ANN are measured in Liquid Cash and Portfolio Value, which will give the total amount held in cash and stock and therefore the total value of capital held. The results stated that the hybrid ANN developed in the paper was the superior performer, however according to [Zemke, 2002] a few important factors have not been documented which could highlight potential pitfalls in the experiments. The first potential pitfall regards the number of correct predictions, or accuracy as titled in [Zemke, 2002]. The hybrid ANN could have made fewer correct predictions, but the predictions they made were fortunate enough to be most favourable. This would result in inaccurate results regarding the hybrid ANNs abilities, because with a different set of test data the results could be very different. Recording the number of correct and incorrect predictions gives an indication of the frequency of accurate predictions, which can cement the capital accumulation results or highlight the need for further experiments. [Hui, 2000] s results could potentially be misleading based on regularity of the trading, which is not detailed in the results. The regularity of trading may have correlated to patterns best identified by the hybrid ANN, thus its results were more positive as it made the most accurate stock predictions, but purely at the times when the trades were made. Take this random scenario to help better this explanation, the hybrid ANN has identified that every time the stock drops by 1% it then climbs dramatically, but the other ANN systems failed to identify this occurrence. Now suppose this coincidentally happens once a month and the ANN systems are queried at the very same time. The hybrid ANN will come out favourably as it can predict this event accurately, but what if that is the only pattern the hybrid ANN can identify and the other ANNs can correctly identify all other correlations? This would give an inaccurate picture of which was the most suited ANN. To return to the testing strategies documented in [Zemke, 2002], two other methods which could not be used within [Hui, 2000] are worth noting. Measuring the sum of squared deviations from the actual outputs shows how wide of the mark the predictions were. This can only be implemented when the predictor gives a quantifiable output where one can measure the difference between the prediction and the real target value. The other strategy is for classification predictors, similar to [Hui, 2000]. This involves building into the predictor a confidence factor where the predictor assigns a confidence rating to each forecast. [Zemke, 2002] believes this is equally important, but as difficult to develop as the predictor itself. This makes the ANN appear more human by stating when a prediction is not confidently asserted, which is possible with ANNs as the pattern matching may straddle numerous classification 24

33 boundaries, thus making the end classification less clear cut. As a result of these findings, it is felt that it is essential to critically examine the testing procedure to ensure it is fair and also, to ensure that the results can provide a definitive conclusion. A number of important pitfalls and indicators have been outlined which will be beneficial to the execution of these goals. 25

34 Chapter 3 Requirements 3.1 Requirements Analysis The literature review has highlighted the wide use of ANNs within stock market prediction systems. As a result, there is much to draw on, yet still much to explore afresh. Fundamental data is rarely used as inputs, never exclusively and the documentation has provided limited detail. Therefore, the requirements must guide this project towards conclusive findings regarding the suitability of fundamental data as inputs for this domain. Inputs may have a very strong correlation to the output, but an ANN my fail to ascertain the correlation if it is inadequately configured. Therefore, requirements must consider the ANN configuration and optimisations. This is difficult because every category of the configuration has a bearing on every other. As a result, one part may be optimised in relation to another, but if the latter is adjusted the former may need to be re-configured. This makes the complexity exponential in nature. [Crone, 2005] states that there are 18 configurable paramaters in an ANN system. Therefore, it is virtually impossible to gain 100% optimisation of an ANN within a finite time scale. This must be acknowledged when establishing requirements to ensure that they are realistic and achievable. Using knowledge developed in the literature review, the following requirements have been established. This is an experimental project and as such, many traditional methods of requirements gathering and elicitation, such as viewpoint-oriented elicitation, use-cases and ethnography, are redundant. This is because there will not be multiple users, organisational policies or existing system compatability issues. Although this project will not contain the complexities of a software development project, there are critical aspects that must be identified as requirements to ensure they are not overlooked. 3.2 Requirements Specification Overview As stated in the Introduction, this dissertation is intended to highlight whether fundamental data are worthy input data for predicting stock market movements with an 26

35 ANN. Therefore, requirements must be gathered that ensure the project stays true to the initial aim and ensures that a conclusive answer to this question can be provided Implementation To truely explore the configuration limits, it would be interesting to experiment with all the different parts of an ANN which are changeable, however this would be a complex challenge not possible within the time-frame. As a result, a number of characteristics will be fixed. The literature review maintained that the Multi-Layer Perceptron (MLP) is the most suitable form of neural network and the Back-Propagation learning algorithm has been effective with technical data, thus it will be used in this system. This consistency will also allow for direct comparison between technical and fundamental data. The activation function will be sigmoidal, because it is generally seen as the default for this type of ANN and the literature review discusses drawbacks linked to alternative functions. To adjust the configuration of the network topology according to [Refenes, 1995], it must be possible to manually set the connections between layers. After training is complete the resulting outputs must be returned so that they can be visably compared to the target outputs. This should make it clear whether under or over fitting has occurred. If it is deemed that either of these events have happened, a decision must be made as to whether to run tests on the network or to re-train; this decision must be logged to aid with future analysis. This requirement is the result of studying [Sarle, 1995] where charts were included which showed over-fitting. The weights for each connection should also be outputted once training is complete to determine which inputs were influential. This is important to allow the topology optimisations outlined by [Zemke, 2002]. The terminating error level will be This is the early termination level employed in [Crone, 2005]. With an error level this low the network has comprehensively learnt the patterns and any further learning may cause over-fitting. The number of epochs per training run should initially be set to This value is in excess of the length used by most of the papers researched, however if a training run gets stuck in a local minimum then it is possible for convergence to take longer. Due to this problem, a chart of the MSE error should be displayed at the end of a training run. This will clearly show if the training has successfully converged on the minimum error by trailing off with a near horizontal line. As a result of viewing the MSE graph, the number of epochs can be adjusted accordingly. However, increasing the number of epochs may not be the solution to convergence problems because they may be the result of network paralysis, which is discussed in section When creating the ANN, if a system can be sourced which already contains all the necessary functionality, it should be used. Otherwise, if a suitable library of ANN functions is available it must be written in a language that is either known or could be easily learnt, so that combining the functions into a working system is not time consuming. Artificial Neural Networks require a large number of mathematical computations, therefore the language should be efficient and designed to handle this form of system. This will limit the time taken while training and fulfil one of the nonfunctional requirements. These requirements are such, because this is a mature field with extensive support. To write a bespoke ANN would require extensive testing and verification that would detract from the main objective of this project. 27

36 3.2.3 Input Data Types The input data must be purely fundamental in nature, that is no data types resident within the technical data set are allowed. This removes the argument that any results are due to the technical data. As a minimum, ten years of data will be required for the training phase and two years for testing. If additional data can be sourced, a decision must be made whether to increase the training or testing phases. Ideally, the testing phase would be increased because this project is looking at longer term investing, a time frame which is usually recognised as a least 2-5 years. However due to the frequency of sampling, it may be necessary to increase the training phase if it is too small, to facilitate accurate learning. The number of input data types must be within the range of five to fourteen, as discussed in the literature review. A strategy must be designed so that only the data types are included that show a strong correlation with the output data and allow predictions to be as accurate as is possible. As stated in the project s introduction, this cannot be identified by visual methods of analysis alone, as non-linear patterns may reside that are not easily detected. Therefore, more advanced methods must also be included, such as using the ANN with pruning to remove redundant inputs Pre-processing As the literature review found, pre-processing is considered a very important part of the configuration. There are two alternative views, both for and against this topic. Therefore, this study should consider both processed and un-processed input data. All inputs must be scaled to between [-1, 1] or [0, 1], dependent on the activation function, otherwise the system is likely to fail 1. Visual inspection, Normalisation, and De-trending should be applied as pre-processing steps because the literature review found that without these the ANN is likely to fail. The other pre-processing steps which are often required for stock market data include replacing missing, impossible, inconsistent and unlikely values. However, after an inspection of the input data that will be used, it has been decided that these will not be necessary as sufficient data can be sourced that is both accurate and complete, yet still not limiting the investigation. If and when more data is required this decision may need to be reconsidered. As a result of [Zemke, 2002], Visual inspection is required to ensure that all data looks accurate and to determine whether de-trending is required. If the ANN encounters Network Paralysis then this is usually due a lack of, or incorrect pre-processing, therefore pre-processing must be re-assessed. Network paralysis occurs when a local minimum is encountered during training, which can be determined by observing the Mean Square Error (MSE). If MSE fails to reduce for an extended period it can be assumed network paralysis has occured. Initially the network should be re-trained because the particular training run could have landed on a local minimum. If network paralysis is unavoidable an investigation into corrective methods will be required Topology The topology of the system is hugely influencial in determining how successful training can be. Over and under-fitting usually occur because of the network topology. As a 1 See where this is discussed in detail. 28

37 result of this knowledge, it is important that the network topology does not facilitate either of these events in the final system. Therefore, a pre-experimental study strategy must be designed so that the topology can be optimised with regard to the input data. The initial network topology should be N-N-1, where N is the number of inputs believed appropriate, because in the literature review this initial topology was deemed most suitable. The number of hidden layers in initial studies must not be limited to only one, because pruning will be employed which requires the initial topology to be too large, as opposed to too small, and research found a number of papers where results for multiple hidden layers were interesting. Therefore, the ANN system developed must be capable of implementing a topology with an abritary number of hidden layers and neurons Pre-Experimental Studies Pre-experimental studies are required to find the optimum input data and topology settings. Included in this is determining which pre-processing steps are most suitable for the data, which inputs are the most influential in combination, which connections are required and how many hidden layers and hidden nodes are optimal. Studies should be designed so that optimal settings can be distinguished as quickly and easily as possible. This is especially important because ANN training is a time consuming function, so time limitations may mean that a limit on the number of inputs is required. There are currently a large number of variable settings and components in this project. Therefore, in latter stages time constraints may require a number of these to be limited. In this case, priority must be given to prediction quality. If assumptions are required they must be critically reasoned so as not to dimish the ANN s predictions and the investigation s credibility Suitability Assessment This area of the research is where the project s hypotheses will be answered. Therefore, it is essential that the results are as clear as possible, with minimal ambiguities that may leave the findings open to interpretation. Fundamental data is generally used to make long term predictions so a long term forecast will be most appropriate. Therefore, predictions should be no less than one month into the future. At this point it is believed that a six months forward prediction will be most suitable, as this should remove sufficient psychological trading noise, allowing correlations between fundamental data and the share price to be detectable. This judgement is made because Equity Analysts give a price prediction for six months into the future. As they use fundamental data they must feel confident that within six months the share price will reflect the companies fundamental value Non-Functional Requirements [Sommerville, 2001] states that non-functional requirements arise through user needs and because of organisational policies, amongst other reasons. The system developed here will not be used by a range of users or developed for an organisation. Therefore, the non-functional requirements have been developed to ensure the investigation is completed within the time constraints, findings are accurate and generic policies and legislations are adhered to. Training should not take an unacceptable length of time. 29

38 Results should be clear to interpret. All data sourced must be gathered in a legal and ethical manner. Pre-experimental studies should only include assumptions where research has shown they are valid or constraints make them unavoidable. All studies and tests must be designed with fairness and conclusiveness in mind. 3.3 Conclusions At this stage the requirements are generally of a high level nature because a large amount of the detail cannot be determined until after pre-experimental studies. In this chapter we have discussed the requirements for the system, which can be established as a result of the literature review. These were derived by analysing the broad life-cycle steps of the project, which are implementation, pre-experimental studies and the suitability experiment and the general categories of ANN components, which are input data, pre-processing steps and topology. Now that a general frame-work has been outlined we can progress to the Design stage. 30

39 Chapter 4 Design In order to get meaningful and conclusive results it is imperative to design a fair and complete experimental study. Therefore, the optimisation strategy, pre-processing, input data types, accuracy measures, suitabilty tests and ANN implementation must all be carefully considered. The literature review highlighted ANN optimisation as a potentially troublesome and time consuming phase and yet underlined its privotal importance. Therefore, finding the best way to tackle this is a high priority. The largest limitation with ANN optimisation is the fact that configuration components are inter-dependent and thus optimisation is a continuous, iterative process. When designing a strategy it is important to select a design and implementation life-cycle which encompasses the iterative nature and that a detailed and clever strategy is devised. 4.1 Topology Optimisation Strategy The topology optimisation strategy must consider firstly accuracy and secondly speed. When optimising the ANN topology it is important that an effective architecture is ultimately derived, however the time limitations of this project must not be forgotten. From the literature review it is possible to assert that pruning, as opposed to constructing the network is largely the optimisation technique of choice for financial prediction ANNs. This is backed up by [Thivierge et al., 2003] where it states that constructive algorithms create networks with large interconnected systems that are very deep in layers, which may disrupt generality and increase computational demands. As stated in the introduction to this chapter, optimisation is possibly the most important issue to consider. It should be effective, the process must be repeatable and executable in a timely fashion. In order to fulfill these requirements automated processes were explored, leading to pruning algorithms Pruning Algorithms Pruning algorithms are a way of automating the pruning process to improve both speed and efficiency. A large set of potential algorithms were uncovered. 31

40 There were two common themes eminating from all the papers researched. Firstly, that pruning can greatly help a networks ability to generalise and secondly, that there is no one perfect pruning technique for all data sets. However, a strategy for determining the best algorithm for a data set was never discussed, instead exhaustive comparative experiments were devised. For this reason, if time permits, it would be preferable to implement as many algorithms as possible to find the one best suited to this particular problem. Research found that the majority of pruning algorithms are titled Sensitivity Algorithms, based upon measuring the saliency of each connection and pruning out those that show the least. [Lundin and Moerland, March, 1997, Zell et al., 1998] both explain that the algorithms common process is to: Train the network with back-propagation Compute the saliency of each element (link or neuron) Prune the elements with lowest saliency Retrain the network Measure the error and act accordingly (either cease pruning or repeat) The two most famous algorithms are Magnitude and Optimal Brain Surgery (OBS), [Thimm and Fiesler, 1996]. Here we will discuss these two, as well as Optimal Brain Damage (OBD), which uses a similar technique to OBS, the discussion will include their merits and drawbacks. Magnitude Algorithm This algorithm is the automated version of the pruning technique that was established as most appropriate within the literature review. Very simply, at the end of a training run a fixed percentage of the least significant weights 1 are removed. This is an iterative process and after each iteration the MSE error is re-calculated. If it has increased above an allowed level, in terms of both overall error and in relation to the previous iterations error, then pruning ceases and the network is returned to the best version. [Thimm and Fiesler, February, 1997] compares this algorithm to a number of others on a host of data sets. Findings showed it gave the best generalisation overall and very rarely did worse than a network which was unpruned. Therefore, this would be a very beneficial algorithm to employ. It should be simple to implement, it is in-line with recommendations coming from the literature review and most importantly studies have shown it can be very effective. Optimal Brain Damage (OBD) Here parameters are observed, rather than connections, because several connections can be controlled by a parameter. The second derivative of the objective function is taken to compute saliencies of parameters, by predicting the change in the objective function when a parameter is deleted. This is computed by predicting the effect of perturbing the parameter vector. A perturbation of δu of the parameter vector will change the objective function by δe = i g i δu i + 1 h ii δu 2 i + 1 h ij δu i δu j + O( δu 3 ) 2 2 i i j 1 The fixed percentage is determined based upon how quickly the network is to be pruned. A large percentage would cause a large number of connections to be pruned. Although this would be fast it may suffer from overpruning and leave the network without the capabilities to learn the patterns accurately. 32

41 where δu i are the components of δu, g i are components of the gradient G of the error E with respect to U, and h ij are the elements of the Hessian matrix 2 H of E with respect to U: g i = E U i and h ij = 2 E u i u j The aim is to find the set of parameters whose deletion have least effect on error E. However, this is very difficult as the matrix H is big and difficult to compute. Therefore, a simplifying assumption is introduced - that the Hessian matrix is diagonal. [Cun et al., 1990] gives details on implementing this algorithm, but the basics are as follows: Train the network Compute the second derivative h kk for each parameter Compute the saliency for each parameter: s k = h kk u 2 k /2 Sort the parameters by saliency and delete parameters of lowest saliency 3 There are drawbacks with OBD however. [Hassibi and Stork, 1993] and [Ragg et al., 1997] state that assuming the Hessian matrix is diagonal is invalid for many networks because it causes the wrong weights to be deleted. Optimal Brain Surgery (OBS) Optimal Brain Surgery was developed as an extension to Optimal Brain Damage, to erradicate the assumption concerning the Hessian matrix. Firstly, the network is trained to a local minimum. Then the error is calculated by: δe = ( E ) T 1 δw + w 2 δwt H δw + O( δu 3 ) where H = 2 E w is the Hessian matrix, and superscript T denotes vector transpose. 2 As the network is trained to a local minimum the first term vanishes, the third term is also ignored. Next, a weight (w q ) must be set to zero to minimise the error increase in δe. This is expressed as: δe T q δw + w q = 0 where e q is the unit vector in (scalar) weight space corresponding to w q. We can now solve: Min q {Min δw { 1 2 δwt H δw} such that e T q δw + w q = 0} by generating a Lagrangian 4, such that: 2 The Hessian matrix is the square matrix of second partial derivatives of a scalar-valued function. [ 3 As stated for the Magnitude algorithm, the low-saliency measure is based on a fixed percentage. 4 A Lagrangian of a dynamic system is a function of the dynamical variables and concisely describes the equations of motion of the system. [ 33

42 L = 1 2 δwt H δw + λ(e T q δw + w q ) where λ is a Lagrange undetermined multiplier. Functional derivatives are taken and matrix inversion is used to find the optimal weight change and resulting error change: w q w 2 q δw = [H 1 H 1 e q and L = 1 ] qq 2 2[H 1 ] qq Neither H nor H 1 need be diagonal which removes the assumption made by Optimal Brain Damage. Optimal Brain Surgery should be less computationally expensive than the other two algorithms because it does not require retraining between each pruning step. The procedure for Optimal Brain Surgery is outlined as follows: Train the network Compute H 1 Find the q that gives the smallest L = w 2 q/(2[h 1 ] qq ) If this error increase is smaller than E, then the q th weight is deleted. Use q to update all the weights Repeat this process until no q provides an L smaller than E [Hassibi and Stork, 1993] goes into greater detail explaining this algorithm. This paper also explains how the outcome of the magnitude algorithm differs from that of OBS. The magnitude algorithm attacks the smallest weight (A), thus magnifies another weight (B). In opposition, the OBS algorithm attacks B, and intensifies A. As a result, it is believed that because of the obvious opposition in the algorithms strategies, trials should clearly highlight which is most appropriate for our data set. After investigating pruning algorithms, it is clear they are an excellent solution to the problem of finding the optimum combination of inputs, connections and hidden layers. This is because in the process of removing connections, redundant neurons, both input and hidden, are also removed. They should also produce a dramatic reduction in the time required to converge on this optimum and also improve consistency by removing the opportunity for human error. Depending on time constraints, at least the magnitude pruning algorithm should be implemented. This is the simplest of the discussed algorithms so it should take the shortest time to implement and it fits with the pruning actions discussed in [Refenes, 1995, Zemke, 2002], which have been shown to be effective. If time allows, the Optimal Brain Surgery algorithm should be implemented and finally Optimal Brain Damage. Optimal Brain Surgery should be applied before Optimal Brain Damage because, as explained above, OBD makes assumptions based upon the Hessian matrix, which according to [Ragg et al., 1997] limits its potency for some networks. The potential problem with using these algorithms concerns their complexity. They are complicated functions which may make them difficult and time consuming to implement. However, once the OBS algorithm is functioning correctly it should be fairly straight-forward to modify the code to implement Optimal Brain Damage. 34

43 Each pruning algorithm relies on parameters to determine its pruning constraints. [Cun et al., 1990] states that the effectiveness of pruning algorithms is dependent on finding effective parameters. Therefore, it will be necessary to find the optimum parameters in order to get the best results, which are dependant on the input-output correlations, the prediction volatility 5 and the severity of the pruning. For example, if only very few inputs have a correlation with the output, pruning will need to be extensive. In order to find the optimum parameters as quickly as possible fair tests will have to be established. Therefore, the following plan has been designed. For each parameter all other parameters will be kept constant and a range of samples taken, including extreme points. Then a chart of the prediction results will be plotted and from this it should be possible to infere the optimum value. This should be repeated for all parameters. As between companies there will be some variance concerning the error surface and the input-output correlations, it is also likely that there will be a variance in the optimum settings. This investigation would not be fair if in the suitability experiment the optimum parameters were used for each company, because to find these settings one must compare the test prediction 6 with the test target. This would not be possible in a real scenario, and as such must not be allowed here. Therefore, the average optimum parameters for a cross-section of the companies must be calculated and used. This is by no means a perfect solution, but a more suitable alternative cannot be found, and so this drawback must be noted and considered in the analysis of the results Accuracy Measures Before it is possible to converge on optimisations or measure performance it is necessary to determine an appropriate performance metric. Purely measuring the error between predicted and target values is somewhat crude and does not incorporate factors like directional change. Therefore, it is important to establish an accurate measure for determining the quality of the ANN predictions. What follows are four popular measures of error. To accurately measure the performance of the ANN, all four should be incorporated into the performance comparisons to determine when an optimised network has been derived. The correlation coefficient is a popular measure of absolute prediction accuracy. It measures the average correlation between predicted and actual values. n (x i x)(y i y) i=1 R = n n (x i x) 2 (y i y) 2 i=1 i=1 where x = 1 n n i=1 x i, and y = 1 n n i=1 y i. R = 1 denotes perfect correlation, while R = 0 demonstrates no correlation. The magnitude of the value explains the level of 5 Prediction volatility is determined by the error surface. If it is highly volatile, with many basins of attraction then it is possible to have a large variance between each training run as learning algorithm converges on different basins 6 The test prediction is the prediction on the data not used in the training, in this case the two years post training data 35

44 correlation and a negative R denotes the correlation was reflexsive. The correlation coefficient is an interesting indicator because the way it should be interpreted is dependent on the stage in the development. When in the network optimisation phase one should focus on its magnitude. If the value is negative but high it is not necessarily bad as it shows that the network has been able to proficiently detect the correlations; we must then look at ways to invert the correlation learnt. However, in the suitability experiment stage the sign of the value is important, a high negative correlation coefficient would be catastrophic as it would result in maximum losses, because trading on the predictions would result in buy signals when the price will drop and sell signals when the price will rise. According to [Refenes, 1995] the correlation coefficient, or alternatively R 2, should always be reported, therefore it will be used in this study 7. It is useful to compare performance to a trivial predictor, based on the randomwalk hypothesis. The hypothesis asserts that the best estimate of tomorrow s price is today s price. This comparison is called the information coefficient. n (y t x t ) 2 t=1 T r = n (x t x t 1 ) 2 t=1 T r identifies whether the prediction is more accurate than the trivial predictor. If T r 1 then the predictor is worse than the trivial predictor, otherwise the predictor is better. This comparison is useful because it keeps the project grounded, that is it makes sure we are considering the ANN predictor against current prediction methods, which is the intention of the project. Another trivial predictor uses the historical mean, or mean reversion. n (y t x t ) 2 t=1 T µ = n (x x t+1 ) 2 t=1 Predicting the mean will give T µ = 1. So if we get an output of T µ < 1 it implies we are making better predictions than merely predicting the mean. This is an interesting measure and should be included because it is less volatile than the information coefficient. As it is comparing the predictor to the target mean, it is measuring the prediction error more generally giving more of an overall picture. This is also good because it measures the mean squared error normalised by the variance of the test set, thus it incorporates MSE, which is used during training and is generally viewed as a simple but important error metric. The final indicator considered here is also the most important, this measures directional change. One hundred percent accuracy in predicting the level of price changes 7 During latter stages of this project correlation coefficient is often termed the absolute error. 36

45 is the ultimate aim, however this is almost certainly unobtainable. important to consider directional changes in price. It is therefore Most investment analysts are usually far more accurate in predicting directional changes in asset price than predicting the actual level. The directional change can be measured as follows: [Refenes, 1995] d = 1 n n i a i where { 1 if (xt+1 x a i = t )(y t+1 x t ) > 0 0 otherwise When d = 1 the predictor is accurate at predicting the directional change 100% of the time. d = 0 implies 0% accuracy. When d > 0.5 the predictor is more accurate then just tossing a coin. Now that we have decided on the indicators of success, a strategy for dealing with contradictory results is required. Highest importance is given to the price directional change measure. This is because if the directional change prediction is accurate, the only losses that are incurred are if the price does not change sufficiently to cover the initial commission and stamp duty, which are nominal. Therefore, if we have a predictor that can accurately predict price swings then losses are minimised, which is the first objective of investing. Priority falls next to the correlation coefficient measure (or absolute error). This is because at this point the performance of the predictor in comparison to other methods is unimportant, as it is still being refined. With the correlation coefficient we are only comparing the prediction to the target. This allows for the most elementary of comparisons between different versions of the ANN predictor, thus giving the most elementary comparison of performance. These indicators are to be deemed the most important and although the other two are interesting, at this point they should only be used to shed additional light on the convergence or divergence from optimum performance. We give higher precedence to Mean reversion than the information coefficient because [Refenes, 1995] states that it is a good performance estimate for comparing different models. The importance of the information coefficient measure is dependent on ones belief in the random-walk hypothesis, therefore this measure s worth is open to debate. 4.2 Input Data Considerations Constructing the input time-series There will be a two phase process to constructing the input time-series. Phase one will be the construction of the time-series so that all fundamental data is aligned to the share price based upon when it became available to the public, as this is when it would effect the share price. For annual data, that is data which comes purely from the financial report. When building the time series the data must be aligned with the 37

46 month when it was published. This design decision is based on the reasoning that most active investors 8 are investing professionals who react quickly to company press releases. This means that the company updates are quickly (that is within a number of hours or at most days) factored into the share price. This reasoning is part of the basis of the Efficient Market Hypothesis, which is generally accepted, if only in its weaker forms. Unfortuately, this is likely to mean that some psychological trading noise is included as it is most extreme when companies release new information. However, this is the happy medium in a far from perfect situation. It is possible to argue that the data should be set to a month before it was released, because in the run-up to a release forecasts are made by investment analysts as to what the results will be. These are usually fairly accurate and the share price moves on this information. See Figure 4.1 for an example. Figure 4.1: This chart shows how the analysts forecasts can move the share price. Severn Trent s share price is shown for the last two years. The two red arrows at the bottom show when the financial year ended and the compiled results were released. As one can see, the share price trend rose between the two dates. This is most probably from analysts forecasting what the final results would be. Either side are a few months of negative trends, which suggest that something different is happening in the period of interest. The bars at the bottom indicate the trading volumes and can be ignored. Unfortunately, getting these forecasts will be impractical so aligning the inputs to a month before they are released would be a step towards discounting the forecasts effect. However, on occasions when these forecasts are incorrect there is a large movement in the share price on the month of the release 9 to correct the incorrect earlier assumptions. This only occurs on these special occasions, but nothing in the input data explains to the ANN when this will occur. Also, this method could not 8 active investor - a person who directly invests in companies and determines when to buy or sell. The opposite would be a passive investor, someone who leaves these decisions to others, for example a person who invests in mutual funds. 9 This would be a month after the ANN sees the release 38

47 be implemented in a real life scenario, as it would be impossible to know the data a month before it was released. The alternative argument would be to set the data to one month after it was released to avoid the psychological trading noise that surrounds new releases. This noise is demonstrated in Figure 4.2. This would also be unsuitable because it means that the prediction for the month when the data is released would be incorrect, because it is based upon out of date data which has already been forgotten by the market, hence the share price will no longer be based upon it. Figure 4.2: Using Severn Trent s share price chart again, we can see that in the month when the final results were published the share price showed a large amount of volatility, as highlighted by the circle. This can be seen as psychological noise as these large short-term swings are usually the result of psychological trading. The second phase is implementing a lag. This is shifting the price back by six months so that the input data is aligned with the share price of six months hence. This is necessary because, as discussed in the literature review, fundamental data is generally used for longer term investing, thus share price correlations to fundamental data may not become apparent in the short term. Although six months is still viewed in the short to medium term investing horizon, it is unfortunately at the upper limit of what is possible as ANNs require a large amount of data Pre-processing Considerations As stated in the requirements section, we must determine whether pre-processed or unprocessed input data is best for this domain. Even for unprocessed inputs the data must be scaled so that it fits within the bounds of the activation function. It is important that when scaling the unprocessed data the relationship between time series points are kept intact, so as to maintain fragile structures highlighted in [Vanstone et al., 2004]. Therefore, the following equations will be used to scale the data: 39

48 For when the tanh activation function is employed: y = 2x (max + min) max min For when the logistic activation function is employed: y = x min max min The pre-processing step which should be investigated is using relative data. This removes trending and accentuates the fluctuations in price, which can be significantly diminished within unprocessed data because the range is larger. However, as the system is attempting to predict the absolute share price, trending is important as the price follows trends. It is believed that for the best predictions a combination of absolute and relative data will be needed. The absolute data will allow the network to learn the price trends, but the relative data will allow the network to learn the month-to-month fluctuations. For the annual data the relative values will not be computed. This is because only once in twelve time-steps will there be a fluctuation, the other eleven steps will produce a neutral flat line, which the ANN will have difficulty correlating to a fluctuating share price. This cannot lead to accurate predictions. Figure 4.3 gives a visual explanation of the problem. Figure 4.3: This chart shows a company s return on capital employed in relative form, plotted against their share price in absolute form. As one can see, the share price is continually changing while the return on capital employed only fluctuates periodically. To determine whether the relative inputs have improved prediction quality, the error indicators will be observed in cases when relative inputs were included and excluded. If the errors are too similar to confidently arrive at a conclusion it is appropriate to visually inspect the prediction to target charts and qualitatively decide Finding the most appropriate input data There is an expanse of fundamental input data types which could be suitable for the system. However, we have seen from the literature review that using too many inputs can limit the system s abilities. Our requirements state that the number of input nodes must lie within the range of five to fourteen. The optimum number of inputs 40

49 within this range will be determined through network optimisation, but initially actions can be taken to approximate each input data type s suitability. Through visual inspection it should be possible to interpret a degree of correlation between data type and share price. Using the visual inspection, we can reduce the set to only those that visually demonstrate a degree of correlation with the share price. Then using the ANN and pruning, the set can be further condensed until the final input set is gathered. We must now perfect a strategy for the latter phase of determining the suitable input data types. This involves using the pruning algorithm. After the visual inspection, the set should be sufficiently small to apply to an ANN. Without this initial step the set of inputs would be too large 10 because training and pruning would be too expensive. However, it is advantageous to use an ANN to conclude the input selection, because they are capable of finding non-linear correlations which would be missed by visual inspection. The pruning algorithms remove inputs which show insufficient correlation with the target. This condenses the inputs down to those that show a general, or long-term relationship with the share price. These are the inputs necessary for good test predictions, because it is only long term correlations which persist. Therefore, the data type analysis will be as follows: Construct the input time-series Visually inspect the data for linear correlations and remove the data types which do not demonstrate a relationship with the share price. Deploy the remaining inputs within the ANN 11. Using the pruning algorithm and its predetermined optimum parameters, prune away the inputs which reduce generality and overall prediction performance. These inputs will be determined by the pruning parameters, which constrain the predictive error. 4.3 Suitability Experiment After evaluation of the initial experiment designed to provide a yes or no verdict to the project s hypothesis, it was concluded that the results would be insubstantial, thus it was discarded 12. In its place, an experiment has been developed which should suggest when the hypothesis is most likely to be correct, providing a more descriptive answer. This gives the reader more detail, which should be beneficial for any further work. The resulting suitability experiment breaks the market into two segments in order to determine when the ANN predictor would be most successful. This is because the way fundamental data should be analysed is dependent on the type of company. Each sector will have different fundamental data types which are seen as influential. Therefore, the following hypothesis for the suitability experiment has been deemed appropriate. 10 In excess of the fourteen inputs calculated as the upper bound of input neurons 11 Remembering to incorporate any predetermined pre-processing steps. 12 The initial experiment follows so the reader can realise its faults. 41

50 Hypothesis The prediction quality will be optimum for asset intensive, industrial firms 13. This hypothesis is because their share price should correspond to the tangible measures of their business, rather than the intangibles, such as brand name where the valuation is open to interpretation. As less guessing is involved it is felt that correlations between the data types and the share price should be most readily distinguishable. This hypothesis is also because these industries are seen as boring and unattractive, which means there should be less investing based on hype and psychology, as this is usually most extensive in highly branded, media hungry companies. To allow a conclusive answer to this hypothesis five companies must be selected from the following industries to populate the asset intensive part of the study: Construction & Materials General Industries Oil & Gas Production Chemicals Gas, Water & Multi-utility These industry sectors have been chosen because they are heavily asset intensive, produce commodity or commodity-esque products and generally do not have advertising campaigns aimed at the general public. Five companies must be selected from the following industries to populate the nonasset intensive part of the study: General Retailer Support Services Insurance Media Retail Banking Investment Services These sectors have been chosen because of at least one of the following reasons: They do not require a large investment in assets, thus asset based fundamentals are not usually deemed important in their investment analysis. They either directly or indirectly advertise to a large cross section of consumers. This makes them open to psychological trading because they often appear in the news or are largely visible to all investors. 13 These are companies who manufacture products with commodity-esque characteristics. These have large asset holdings in comparison to earnings and there is generally little differentiation between products within the sector. 42

51 For all ten companies, section must be completed in full, at which point the four error indicators can be used to measure the accuracy of the predictions. It will then be possible to compute the T-test 14 for each error indicator. The T-test is an excellent measure for this because it allows one to compute a degree of confidence when asserting that two sets are independent. If the level of significants is above ninetyfive percent then the results can be seen as statistically significant and support or contradict the hypothesis. The Inappropriate Suitability Experiment What follows is the initial suitability experiment that was designed but after further considerations dismissed as inappropriate. This was because the trading would be monthly but the ANN is predicting the share price in six months time. With monthly trading, a trade would be placed based upon a prediction which would not be realised within the trade s duration 15. The alternative would be to trade every six months, however this would only provide four trades over the two years of test data, which is not sufficient to rule out stochastic variance and establish meaningful data. Portfolio management simulations have been used in similar projects, but they have used technical data and short term trading. They usually predict what the share price will be at the close of trading each day. This allows approximately 522 test predictions and trades over the same two year period, which provides a sufficient quantity of results to make meaningful conclusions. As a result, a suitable strategy for employing a portfolio management simulation could not be established. The Experiment A portfolio management simulation will be executed to compare the ANN study with traditional strategies employed with fundamental data. The most common of these is the buy-and-hold strategy. This is when a stock is purchased and held for a significant period. The initial buy and the eventual sell are the only trades that take place. The profit made from this system is what is made from the share price change and the dividends. For the purpose of this experiment dividends will be excluded, the reasoning will be explained shortly. The second approach, also common with longer term investing, uses cost averaging. This is when a position is slowly built up through incremental purchases. This system essentially averages the purchase share price over the period, resulting in lower risk and potential returns. These two strategies will compete against the ANN over a two year period, as this is the amount of test data available. The buy-and-hold portfolio will purchase worth of shares equally spread across five companies, at the beginning of the two year period and sell them all at the end. The cost-averaging portfolio will also purchase 2000 worth of shares in each company, but evenly spread over the 2 years, purchases made monthly. In the 24th month the portfolio will close all positions. 14 The Student s T-test is a statistical tool to measure the level of independence between two populations. If the T-test shows that the two populations are statistically independent then one can rule out the null hypothesis and assert a suitable alternative. To use the T-test the two populations must have normal distribution. There are a number of different variations of the T-test but the one we are interested in is the non-paired with unequal variance version, as these settings match the populations that will be produced. 15 As the prediction is for what the share price will be five months after the trade is closed. 43

52 For this strategy, commissions will be per transaction and stamp duty will stay unchanged. Finally, the ANN s portfolio will make purchases monthly based on the predictions. Each month the share which is predicted to make the largest gain is invested in, closing the current position, but only if the gain is forecast to exceed trading costs. The value of the portfolio is invested in the designated share, this could be greater or smaller than the original dependent on the success of previous trades. At the end of the two years all stamp duties and commissions will be subtracted and the returns from the investments will be compared. Stamp duty is set at 0.5% of the transaction and commission will be set at 15 per transaction. Both are inline with current figures. Dividends will be excluded because all three portfolios may be entitled to dividend payments, dependent on whether they have a holding in the particular shares on the date that they go ex-dividend 17. In real-world portfolios this information would be considered and factored into trading decisions. As this experiment does not allow for dividend based decisions, to avoid any biases dividends will be excluded. Just comparing returns would give a very naive judgement on the success of the investment strategy, because other factors surrounding risk and stochastic variance must be considered. In acknowledgement of this, the yield curves of each strategy will be visually analysed and compared. This will demonstrate the riskiness 18 of the strategy. The cost averaging strategy should be the most risk averse. The level of risk associated with the ANN prediction portfolio is dependent on the success of the ANN predictions. This is because if 100% accurate predictions are made then losses will be completely erradicated. Finally, when assessing the success of the ANN strategy we must consider stochastic variance. It is possible for the system to make a profit, even though the majority of its predictions were incorrect, if it was lucky with its correct predictions. That is, the correct predictions made large profits, enough to cancel the many losses; the inverse can also occur. For this reason we must use some statistical indicators. The percentage of correct and incorrect predictions will be calculated, in terms of the share price movement up and down. Another measure of stochastic variance is to measure the percentage portfolio return accountantable to the largest profitable and unprofitable trades. This will demonstrate how much of the portfolio s profit or loss was dependent on a single non-recurring event and therefore give a measure of the luck of the ANN prediction strategy. 4.4 ANN Implementation The project supervisor strongly advised an in-depth investigation into existing ANN software, because of the maturity of the field. As a result, a large number of open source neural network libraries and systems were found widely available. However, there are a number of important design constraints that any existing libraries or systems must fulfill before they can be deemed appropriate for this project. They are as follows: 16 This is the commission currently imposed by The Halifax, who was a retail broker found offering this strategy 17 ex-dividend - This is the date when dividend payees are determined. Holders of ordinary shares in the company on this day are entitled to a dividend payment in line with the number of shares held 18 risk - Is measured as the volatility of returns. If returns are highly volatile then the risk is seen as high. 44

53 1. Documentation must be detailed and complete. 2. Testing must be sufficient to determine that functionality operates correctly. 3. The implementation must recognise the time hungry drawback of ANNs and attempt to limit it. 4. Reflecting the requirements the following functionality must either already be incorporated or be easily implemented: (a) It must be possible to manually assert connections between nodes. (b) The learning algorithm must be back-propagation. (c) The weights of each connection must be output so that they can be compared (d) It must be possible to stop learning once a minimum error is reached or the maximum number of epochs have been processed. (e) The activation function must be sigmoidal, between [0,1] (logistic) and [-1,1] (tanh). (f) Once training is complete it must be possible to visually compare the output time-series to the target time-series to determine pattern matching. (g) The MSE errors during training must be output, as raw values or in visual chart form. (h) The system must have the flexibility to have multiple hidden layers and nodes. (i) It must be possible to apply the aforementioned pruning algorithms to the network. This is likely to mean that creating and adding pruning algorithms must be possible. (j) It must be possible to randomise the order in which the network and learning algorithm sees the input-target pairs. This is because, as discussed in the literature review, this can limit the probability of the ANN finding a local minimum. 4.5 Conclusions In this chapter we have identified pruning algorithms as the most appropriate way to optimise the ANN. If time permits, three pruning algorithms will be compared, they are: the Magnitude algorithm, Optimal Brain Surgery and Optimal Brain Damage. To measure the effect of optimisation and quality of test predictions we have investigated and discussed a number of performance measures. These are directional change, correlation coefficient (also known as the absolute error), mean reversion and the information coefficient. The ordering here is their ordering of importance. The next important consideration was the input data. Here a strategy for compiling the input time-series was discussed as well as pre-processing steps and the method for identifying the most suitable data types. Finally, the difficulty with developing an appropriate suitability experiment was considered, but an experiment was outlined which should suggest which types of companies this form of ANN system would be most appropriate for. At this point, no further design and ANN configuration can be completed without experimental studies, thus we are now ready to progress to the development stage. 45

54 Chapter 5 Artificial Neural Network Software This section investigates potential ANN systems or libraries that fulfill the software requirements outlined in section 4.4. There are numerous ANN packages available through open source networks, here we discuss the ones where the overview description advocated potential. Using the software requirements defined in section 4.4 the packages are critically evaluated. 5.1 Available Neural Network Packages and Systems Fast Artificial Neural Network (FANN) An open source package written in C. The documentation demonstrates positive speed comparisons between this package and others. It is well documented, including examples. A number of activation functions and learning algorithms are implemented. Also included is a GUI executable, however this only allows the user to state the number of connections as a percentage of full connectivity, but a requirement is to have the flexibility to add and remove single connections manually Aine Aine is also an open source package, however functionality is limited to the basic necessities and the only documentation is a version update log. This package is written in C++ so speed should be adequate. However, due to the lack of documentation and functionality this package fails to meet some of the requirements and as such has been excluded from further considerations Annie Another open source package written in C++. This Library has additional functionality, including the ability to import and export networks from Matlab. Examples are included, however the code is somewhat complicated to follow and documentation 46

55 is limited to installation and setup. Therefore, this package fails to fulfill all of the necessary requirements Auratus Written in Java, extendability has been extensively considered. The document discusses the development of the package, however determining the incorporated functionality is difficult. A GUI is included but appears fairly primitive. As this library has been developed with extendability in mind, once a good understanding of the code and functionality has been established the implementation of additional features should be relatively uncomplicated. The speed of Java may be a hinderance however and the documentation does not provide adequate support Others Neural Network Toolkit, libnn-utility and GNNS were also considered after initially showing potential, however they were deemed inappropriate as a result of either limited documentation, limited functionality or complex or poorly understood code. There is a large pool of free neural network systems available on and elsewhere on the internet. They range in functionality, generality, extendability and included literature. Many are packages created for student or other, often highly specific, research projects, often making them too bespoke, lacking in professional coding practices, or sufficiently detailed documentation. 5.2 Stuttgart Neural Network System As a result of the software investigation SNNS was discovered and deemed suitable. The Stuttgart Neural Network System passed all thirteen criteria stated in the Design as essential. 1. [Zell et al., 1998] is the documentation bundled with the software. It is a 350 page, extensive manual covering everything from user extensions to explanations of the algorithms available. 2. As stated below, SNNS has been utilised in a number of other papers. SNNS has been in existence since at least 1988 and is currently in version 4.2. This means all major bugs have been resolved and all of the functionality that will be employed has been a part of the system since an earlier version. As a result, a high degree of confidence can be asserted concerning the systems reliability. 3. The kernel is written in C, which is an efficient and fast language. Training a MLP with Backpropagation with a set of 151 points over epochs took less than two minutes. Finally, having knowledge and experience in C means extensions can be implemented with relative ease. 4. Functionality: (a) Connections can be manually asserted and removed. It is also possible to fully connect a network with a single click. 47

56 (b) There are a number of back-propagation learning algorithms, including batch and online weight updating and backpropagation with momentum or weight decay. (c) The weights for each connection are outputted in a colour coded diagram. This makes it extremely easy to observe and compare weights. Alternatively, the exact weight values can be seen when using the edit link function. (d) Training automatically stops when the designated minimum error or the maximum number of epochs has been reached. (e) SNNS has a large catalogue of activation functions and learning and pruning algorithms, including hyperbolic tangent and logistic sigmoidal functions. (f) There is an Analyzer function, which displays a chart of the prediction. This allows one to visualise the results of training and testing. The log panel allows the raw values to be exported, which allows the outputs to be compared against targets using a simple charting package 1. (g) Similarly to when using the Analyzer function, during training the MSE error is outputted in the log panel at regular intevals 2. (h) The system allows for an arbitrary number of hidden nodes and layers. (i) All of the pruning algorithms discussed in the design phase are implemented, including an additional two named skeletonisation and non-contribution. This is hugely beneficial because all three earlier mentioned pruning techniques can be applied with relative ease, hence the optimum pruning algorithm can be asserted. This improves the conclusiveness of results by placing emphasis on the suitability of the input data, as opposed to the choice of optimisation technique. (j) Shuffle functionality is incorporated to randomise the training order of inputs if necessary. The system is also easily extended by writing C functions and linking them to the kernel. This is done by inserting the name of the function in the function table file, kernel/func_tbl.c. Then depending on the type of function, the code should be added to the corresponding function file, see [Zell et al., 1998] for the complete procedure. SNNS is a well known and respected Artificial Neural Network system, which can be seen as it is the ANN system of choice in a number of papers 3. This is beneficial because it is a good indicator that the system has been thoroughly tested and functions correctly and also that there are resources for help. The version of SNNS which has been chosen for use is called JavaNNS. This version has the SNNS kernel and functionality but an improved GUI written in Java. This version has been chosen because it has increased useability but most importantly, it includes a log panel, which allows for the results from learning and testing to be easily exported for further analysis, including charting and accuracy measurements. 1 Such as Microsoft Excel. 2 The intevals are determined by the maximum epochs parameter. 3 contains a list of thirty-one papers which either implement SNNS or cite it as popular 48

57 5.2.1 Installing JavaNNS JavaNNS can be downloaded, free of charge, from The system is fully contained in a single Jar file so no installation is required. To start the system enter the following into a command prompt 4 : java -jar JavaNNS.jar Creating an ANN Input-target patterns are stored as.pat files. At the top of the file the control parameters are placed stating how many patterns, how many inputs and how many outputs 5 are contained in the file. There needs to be one pattern file for the training data and one for the test data. Networks can be saved as.net files to be stored and used again. If these files are opened within a text editor full details regarding the state of the network can be seen, including each of the connection weight s exact values. Building the ANN topology is very flexible and easy using the GUI forms and using the control panel the network can be initialised, trained and pruned The use of JavaNNS in the investigation Contained withing JavaNNS is the functionality to normalise and scale input and target data. However, the manual states that using this functionality dramatically reduces performance. As a result, the pattern files should always contain pre-scaled data. All pre-processing steps are to be completed in Microsoft Excel before the pattern files are created. JavaNNS will only be used for the ANN training, testing and pruning. At every step, when results are required or status logged, the error charts and weights graphs will be screen captured if necessary, and the Analyzer can be used to output the results to the log panel from where they can be copied and pasted into Excel sheets. This allows for the greatest level of detail and flexibility. 5.3 Conclusions There are a number of ANN systems or libraries of functions available to the open source community. A number of these were investigated and JavaNNS was found to be a suitable system for this project.. JavaNNS is a complete ANN system which has an array of adjustable parameters and built-in functionality. It is in version 4.2, thus is stable and should function correctly. 4 Ensuring the command prompt is pointing at the JavaNNS directory. 5 When designing an ANN architecture the number of inputs and outputs detailed in the pattern file must match with the number of input and output neurons in the topology. 49

58 A strategy for using JavaNNS with Microsoft Excel has been developed which will allow for all the experimental set-up, results gathering, and findings analysis to be successfully completed. With the software defined to facilitate the experiments, the pre-experimental studies can be designed and the development phase can begin. 50

59 Chapter 6 Development 6.1 Development Process As a system has been discovered which potentially contains all of the functionality required, the development process will be largely in the form of pre-experimental studies. As such, the focus of this chapter will be on pre-experimental studies and using them to develop an ANN configuration best suited to the problem under investigation. The literature review, requirements and design stages, uncovered a number of parameters and configurations for which there is no one best solution, as such, a number of small studies are required. The development stage must prototype each possible solution, based upon hypotheses and logical deductions. This is to compare settings to find the most appropriate. Therefore, this stage must follow an iterative, cyclic lifecycle. As explained in [Preece et al., 2002], the spiral lifecycle model is regarded as best suited, because prototyping is a prime feature. This model also encourages the consideration of alternatives when addressing problems. Figure 6.1 is the simplified spiral model which will be followed. It is important to remember the potential sideeffects of the spiral lifecycle in order to guard against them. The iterative nature could lead to constant adjustments to the pre-experimental studies, leaving them bloated. This in turn can lead to a loss of focus on the objective of the project and may even lead to studies which become corrupted by an unfair environment. In order to prevent this, it will be important to clearly define the experimental strategies, frameworks and goals, before each study or category of studies. This should keep each study focused with the objectives clearly visible, in turn keeping this whole phase properly managed and ensuring it is effective. Due to the nature of ANNs and the project, some studies may have implications on previous studies and may diminish or even nullify their findings. Unfortunately, previous experience suggests that due to time constraints, repetitions may be out of the question. If this situation arises then logical assumptions must be asserted; with consequences considered in the results analysis. Upon the conclusion of the pre-experimental studies the best configurations will be known, allowing for the suitability test to begin. The following configuration settings are to be determined using the pre-experimental studies detailed above: 51

60 Figure 6.1: Here is the simplified spiral lifecycle which will be implemented for the process of pre-experimental studies. Each spiral is a seperate study, which feeds into the next study. Activation Function - The activation function is key to the effectiveness of ANNs. This is demonstrated by the influence that activation functions have had on evolving the field. We have already asserted that the function should be sigmoidal, so the aim of this study is to determine whether the tanh or logistic version is most appropriate. Random or non-random presentation of Input-Target pairs - Research has shown that randomising the presentation of input-target pairs can allow the ANN to more easily avoid local minima. However, the time dependent nature may mean presentation by time order is more appropriate. The optimum number of hidden layers - The number of hidden layers has a substantial effect on the potency of learning. We know that having too many hidden layers can cause the learning to over-fit the training data, but at this stage we are unaware as to how many hidden layers are too many. Optimum pruning algorithm and parameters - The Design phase has shown that pruning algorithms can improve an ANN ability to generalise during training, but the choice of pruning algorithm is dependent on the problem. We also found that finding the correct parameters is critical and also dependent on the problem. The aim of this study or group of studies is to find the optimum answers to these problem dependent variables. The final stage of development is the first phase of the input selection process. The second stage of input selection is left to the suitability experiment, because following the re-design of the suitability experiment and the consideration involved, it is felt that each stock may prune out different inputs. To ensure the correct inputs are removed, the optimum ANN parameters and algorithms must be determined, therefore this phase cannot be completed until after the pre-experimental studies. 52

61 6.2 Pre-Experimental Studies Framework Before any studies can be performed a robust and conclusive framework must be outlined, to ensure all tests are fair and accurate. As development progresses, new studies may become necessary and the framework may no longer be suitable. For this reason the top-level framework will be kept as simple and non-intrusive as possible and then a more rigid sub-level framework will be establish for each group of studies. Currently, the pre-experimental phase contains two types of studies, parametric and algorithmic. With this knowledge a top-level framework is established. The Top-Level Framework Before the ANN starts training, the connection weights should be re-initialised. This means learning starts afresh every time and removes any unfair biases towards previous results. For each test complete training and testing will be repeated five times, re-initialising weights before each and recording every result. After the five iterations the median iteration will be chosen for comparison. This will be decided by visually inspecting the test predictions. Investigations to develop this framework showed that after five iterations at least three showed obvious traits of commonality, thus the consistent prediction could be differentiated from the outliers. See Figure 6.2 for an example. Figure 6.2: Five tests, all with identical initial parameters. The chart clearly shows a trend. Iteration 2 would be taken as the median result as it lies in the middle of the prediction range and has the average pattern. The input data will be sourced from Datastream. Datastream is a comprehensive commercial database which contains a whole array of fundamental datatypes for the vast majority of London Stock Exchange listed companies. The stock that will be used for the pre-experimental studies is British Pretroleum (BP). This is because it is a large company which has been listed for years. For this reason accurate data concerning BP is relatively easy to source and after visual inspection all the timeseries extracted were complete and appeared accurate. The data will be taken from December 1990 to December This is a fifteen year period which is the longest period which can be realistically used. This is because in the suitability experiment a number of different companies are required, thus we do not want to overly limit the choice of companies by selecting a time period too large. Thirteen years will be used for the training phase and two years for the testing. This makes 151 input-target 53

62 pairs for training and 24 test predictions. These values should be sufficiently large to facilitate adequate learning and enough test predictions to accurately measure prediction quality. With a framework defined we can now move to implementing the pre-experimental studies. 6.3 Pre-Experimental Studies - Finding Consistency Early studies failed to provide any consistency after training. Figure 6.3 shows the share price predictions made during the testing phase and is an example of two test predictions where all settings were identical 1 yet the predictions are vastly different. At this stage consistency is the most important characteristic. Without consistency it is not possible to have confidence in the predictions. In a real-life scenario which prediction would be chosen? It also becomes very difficult to observe traits which indicate where improvements can be made. Figure 6.3: Two test predictions for which all parameters and procedures were identical. One can see there are very different. A number of possible reasons for the inconsistency have been identified, which through experimental studies will be investigated. 1 The settings were: Activation function: Tanh, Topology: fully connected, Connection initiation: random [-1,1], Shuffle: No, Epochs: 20000, All inputs (and target) scaled, Learning algorithm: standard BP (online updates) 54

63 6.3.1 Random presentation of input-target pairs Problem Description Input-target pairs can be presented to the ANN in the correct order or in a randomly shuffled order. [Kolarik and Rudorfer, 1994, Rudorfer, 1995, Lane and Neidinger, 1995] all randomise the presentation of patterns. However, [Vanstone et al., 2004, Shachmurove, 2002, Dorsey and Sexton, 2000] fail to state their procedure. Currently, after five iterations of training and testing, no common prediction can be determined. This could be because the input-target pair are not presented at random, thus the training is getting trapped in local minima, as explained in [Kolarik and Rudorfer, 1994]. Justification of Choice A lack of consistency in the predictions is most likely to be due to local minima. As the input-target pairs are not randomised the training will converge on local minima because they are prominent. As the weights are initially randomised the local minimum targeted depends on the initial weights. When the input-target patterns are randomised the structure of the error surface is broken, which makes local minima less prominent and allows the learning to avoid them more easily. Hypothesis Randomising the input-target pairs will improve prediction consistency by allowing the training runs to avoid local minima more easily and converge on the optimum basin of attraction. Study Framework Five iterations of the test will be taken. The following ANN parameters are to be set: Each time the connections weights are to be re-initialised to random values, within the range [-1, 1] Activation function: tanh learning rate: 0.2 learning algorithm: back-propagation with online 2 weight updates. epoch: topology: The set of test predictions are to be displayed in a chart. The first set of five iterations will have the Shuffle facility disabled, while the second set will have Shuffle actived. If it is beneficial to randomly present the input-target pairs to the ANN, then the improvement in test prediction consistency should be clearly visible when the charts are compared. 2 also known as incremental weight updates. 55

64 Results Figure 6.4 contains the test prediction charts for when input-target pairs were presented in random order and time order. Figure 6.4: The predictions when input-target pairs were presented to the network in random order (top chart), and in time order (bottom chart). Conclusions Although consistency is still not sufficient to assert confidence, figure 6.4 shows a slight improvement when shuffling is active. This is because for the shuffled trials, after prediction seven every iteration makes the same directional changes between predictions. For the non-shuffled trials, only iterations one and five show similar patterns. [Kolarik and Rudorfer, 1994] explains that by randomising the presentation of input-target pairs, local minima are more easily avoided. This is important because local minima alias as optimum weight settings. Unfortunately, due to the hill climbing characteristics of Neural Network algorithms, in particular back-propagation, Neural Networks are always prone to being caught in local minima. However, by shuffling the patterns this drawback can be reduced. Although the results of this study are not conclusive, they do suggest that randomly shuffling the input-target patterns may be more appropriate and this is confirmed by prior work. Therefore, all future studies must have input-target pattern presentation randomised. Further studies are required however, as the consistency of results are still insufficient to assert confidence in predictions, without which this study cannot find ANNs 56

65 with fundamental data effective for stock price predictions Activation Functions: tanh vs logistic Problem Description Although both hyperbolic tangent (tanh) and logistic are both sigmoidal functions their ranges are different. Tanh has a range of [-1,1], whereas logistic has a range of [0, 1]. It is believed the mapping range may affect the volatility of predictions. This is because when the range of neuron outputs is larger, which is the case with the tanh activation function, then the range of weights can be higher. This magnifies the volatility of the error surface, resulting in any irregular characteristics being more pronounced. This would make it easier for the training to get caught in a local minimum. Justification of Choice Initially tanh was considered the most appropriate activation function because of the inclusion of relative data at a latter stage. This is because relative data, by default, is in the negative and positive range, as opposed to just the positive range. However, [Bodis, 2004] implements the logistic activation function and also uses pre-processed inputs, hinting that using the tanh function may not be the most appropriate. Scaling allows any range to be translated into another, whilst maintaining the structure. Therefore, the initial assumption that the tanh activation function should be employed because some inputs will be in relative form is redundant. As the range of the logistic function is narrower than tanh, the volatility of predictions may be reduced as the neuron s activation is of a lower magnitude, thus allowing the training to avoid local minima more easily and settle on the optimum weights. This should improve both the consistency of predictions and the performance. Hypothesis This study will show that the logistic activation function promotes higher consistency between iterations than the tanh function. Study Framework This study should follow exactly the same structure as the study in The constant ANN settings must be as follows: Each time the connections weights are to be re-initialised to random values, within the range [-1, 1] Shuffling: yes learning rate: 0.2 learning algorithm: back-propagation with online weight updates. epoch: topology: , fully connected 57

66 Results Figure 6.5 contains the test predictions charts when the activation function used is tanh and logistic. Figure 6.5: The top chart shows the test predictions when the logistic activation function is employed. The bottom chart shows the test predictions when the hyperbolic tangent function is employed. Conclusions The charts give conclusive results. The volatility of predictions has been dramatically reduced by using the logistic activation function. All further studies should implement the logistic activation function. As stated in the justification of choice, the range and structure of the un-processed data does not have to be considered because the inputs can always be re-scaled to between [0, 1] without losing any fragile structures. Thus they can be combined with the logistic activation function without any deterioration of prediction quality. In fact, to ensure the validity of the previous statement, the error measures 3 were also calculated for the two populations. The results can be seen in Tables 6.1 and 6.2. Tables 6.1 and 6.2 show that all four error measures show consistently better results for the logistic activation function. On average the directional change is correct 57% 3 See for the error measure calculations and interpretation. 58

67 Table 6.1: The accuracy measures computed on the test predictions when the tanh activation function was used. Tanh activation function Iteration Absolute Direction Mean Information Table 6.2: The accuracy measures computed on the test predictions when the logistic activation function was used. Logistic activation function Iteration Absolute Direction Mean Information Table 6.3: T-tests on the four accuracy measures to determine the independence of the results T-test Results. The t stat is for the two tailed test. Absolute Direction Mean Information df t Stat Significants of the time for the logistic activation function, in comparison to 51% for the tanh function. The absolute error was 0.53 for logistic and for tanh. This is a positive improvement in two ways. Firstly, the magnitude of the absolute error is higher for logistic, which demonstrates that the correlation between prediction and target is higher. Secondly, for the tanh function the value is negative, meaning the correlation is reflexive, which as discussed in has extremely negative trading implications. Finally, Table 6.3 shows that three of the four accuracy measures showed statistical significance above 94.8% when the T-test 4 was applied. This is a strong indicator that the two sets of results are distinct and suggests that the findings are conclusive. This study has demonstrated that the logistic activation function is the optimum for this investigation. 4 This particular T-test was the two samples assuming unequal variances version. 59

68 6.3.3 Random weights: Small or large Problem Description [Shachmurove, 2002] is one of a number of papers which states that initally the weights are set to small random values. Small is a fuzzy description which is open to interpretation. Therefore, it is felt that the size of the random weights should be investigated. If the size of the initial weights is a factor in the effectiveness of training we must investigate a suitable range. Justification of Choice The initial weights are a factor because of the sigmoidal activation function s characteristics. The sigmoidal function is steepest around 0.5. Therefore when the net value of the neuron is around 0.5, the activation function s derivative will be largest, thus the weights will change most dramatically. If the weights are initially too large or too small the net value of the neurons will be close to either zero or one, which is when the derivative is smallest. This will cause the weights to update very slowly and may fail to converge. This study should use extreme values in order to find meaningful results. For this reason, there will be three initialisations: [-5, 5], [-1, 1] and [-0.1, 0.1]. This range was considered because, by default, small was initially considered to be [-1, 1]. The very small and large ranges were determined after observing the weights following a number of training runs. The vast majority were greater than [-0.1, 0.1] and less than [-5, 5]. Also observed is that [-5, 5] is 50 times larger than the very small randoms set, thus the difference in training results should be of a large enough magnitude to observe. Hypothesis The most accurate results will be when the initial weights are set within the range of [-1, 1]. As the inputs are already scaled this should be small enough to prevent the neurons net values being too large. It is believed the range [-0.1, 0.1] will be too small and result in learning being too slow, which may even prevent convergence resulting in the learning being ineffective. Highest consistency will be in the [-0.1, 0.1] trials because the range of initial weights is small and as already discussed, it is believed the weights will not change much in the course of learning. This means there is the least potential for variations in weights between iterations, thus highest consistency. Study Framework Once again the same experimental rules are applied to this study as the previous studies. The constant settings are: Each time the connections weights are to be re-initialised to random values; 60

69 this time the range is determined by the trial Shuffling: yes learning rate: 0.2 learning algorithm: back-propagation with online weight updates. epoch: topology activation function: logistic Results Figure 6.6 shows the test predictions charts for the different initial weight ranges. The error values can be seen in Tables 6.4, 6.5 and 6.6. Table 6.4: The accuracy measures computed on the test predictions when the very small range for initial random weights was used. Very small range [-0.1, 0.1] Iteration Absolute Direction Mean Information Table 6.5: The accuracy measures computed on the test predictions when the small range for initial random weights was used. Small range [-1, 1] Iteration Absolute Direction Mean Information

70 Figure 6.6: These three charts show the test predictions with the different initial random ranges. Table 6.6: The accuracy measures computed on the test predictions when the large range for initial random weights was used. Large range [-5, 5] Iteration Absolute Direction Mean Information

71 Table 6.7: The accuracy measures for the median results for the very small and small random weights trials. Trial Absolute Direction Mean Information Very small Small Both Figure 6.6 and the error measures, in Tables 6.4, 6.5 and 6.6, show that the large set fails to show the best consistency or most accurate predictions. Therefore, it will be ignored in further considerations. Figure 6.6 shows good consistency for both the small and very small ranges. The very small range has slightly better consistency in both the predictions chart and the performance measures. However, the accuracy of predictions is higher with the small set. Table 6.7 shows the averages for all four prediction accuracy measures. The small range out performs the very small range in all. Conclusions This study has been very informative. The results clearly suggests that the smaller the range of initial random weights, the higher the training consistency. However the effectiveness of training does tail-off, as expected in the justification of choice and hypothesis. For all future studies, weights should be randomly initialised within the range [-1, 1]. 6.4 Accuracy Studies Accuracy with a Sine Wave Problem Description The results so far have shown positive performance values. However, when observing the test prediction charts the accuracy does not look so convincing. After consultation, it was suggested that the correctness of the ANN and parameterisation should be tested by applying it to an accepted problem for which the predictive capabilities of the ANN are known. Justification of Choice If the function which maps the inputs to the target is obvious, an ANN should be able to learn the function sufficiently to make accurate out of sample predictions. ANNs have on numerous occasions been demonstrated predicting a simple sine wave. This is usually done using windowing 5, however because our network does not use 5 This is when the inputs are previous target values. If the window size is ten then the network will have ten input neurons which take the last ten target values (in time order). Figure 6 in [Kolarik 63

72 windowing this would be inappropriate. Therefore, a test has been designed which is in-line with the ANN implementation in this study, but should also show whether the ANN is working correctly. The following experiment has been designed: The target ouput is a sine wave, of the form sin(index), where the index ranges [0, 8] at 0.05 increments, making a total of 80 samples. The first 80% is used for training and the remaining 20% is to be used for testing. The inputs must be as follows: a simple binary function: 1 when the target output is positive and 0 when it is negative. This will give a square wave which cycles at the same frequency and phase as the target sine wave. sin(2 index) sin((3 index)/5) sin(index) + cos(2 index) These inputs were chosen because there is a clear function correlating them to the output, however to the human eye these correlations are not obvious. Therefore, if the ANN accurately predicts the sine wave then we know that the ANN is not malfunctioning and does not have incorrect parameters which would hinder the investigation. As other studies have had extremely accurate predictions in similar scenarios, it is felt that for the ANN to have been successful, the test prediction must overlap the target when plotted on a chart. Hypothesis The ANN will pass the experiment by accurately predicting the sine wave. Study Framework The same framework applies as in previous studies, but this time the topology is 4-4-1, as there are only 4 inputs. Results Figure 6.7 shows the training and testing predictions chart for the sine experiment. and Rudorfer, 1994] is a good graphic representation. 64

73 Figure 6.7: The target is plotted with the five test trials. On the x axis, 1 to 64 are the training results and 1 to 16 are the test results. The chart combines training with testing because if the prediction is perfect it should not be possible to see where the training and testing join. Figure 6.7 shows that the results are not as good as hoped. Between 32 and 33 in the training phase, there is clearly a problem and the sine wave has not been accurately learnt. The test predictions then fail to be on the correct gradient and there is some fluctuations in the initial test predictions. Further investigations were made. The index range was increased to [0, 80] in 0.2 increments, making a total of 400 data-points. Another study used the same range but at a lower resolution of 0.4 increments, making 200 data-points, the chart for this can be seen in Figure 6.8. Figure 6.8 clearly shows that when more of the function is shown to the ANN it learns the correlations more accurately. This second study is interesting because it suggests that the resolution is not important. From the charts it is difficult to differentiate the quality of the predictions between 200 and 400 data-points. However, if we look at the accuracy measures in Tables 6.8 and 6.9, we can see that the 200 data-point trials actually demonstrate better prediction accuracy 6. Table 6.8: The accuracy measures computed on the test predictions made during the sine wave experiment using 400 input-target pairs. Sine Wave Experiment, 400 Data-Points Trial Absolute Direction Mean Information Only the Test Results are shown because it is only the testing which is important to this project. However, when the accuracy measures were applied to the training, results also favoured the 200 data-point trails. 65

74 Figure 6.8: The target is plotted with the five test trials. The chart shows that the sine wave has been acturately predicted. Table 6.9: The accuracy measures computed on the test predictions made during the sine wave experiment using 200 input-target pairs. Sine Wave Experiment, 200 Data-Points Trial Absolute Direction Mean Information Conclusions This study has been beneficial because the results have demonstrated that the ANN is functioning correctly with the current parameters. Furthermore, this study has uncovered a potential drawback when using ANNs with fundamental data, which is not the case when using technical data. As stated in the Design chapter, we are limited to how much data can be gathered and presented to the network. This is because we 66

75 are restricted to monthly samples with fundamental data as opposed daily samples for technical data, so we have only been able to extract 175 samples over the 15 year period examined. With technical data it would be possible to gather approximately 5400 samples. This means that our ANN is exposed to a section of the function equivalent to approximately 3% of the section that would be visible if technical data were used. Our study has shown that the amount of the function that the ANN is exposed to is an important factor in how accurately it learns it. This means that ANNs with fundamental data immediately suffer from a handicap, thus when comparing this project s findings with those from similar studies using technical data, such as [Crone, 2005, Kolarik and Rudorfer, 1994, Bodis, 2004], this disability must be considered Accuracy with the Bumps function Problem Description As the last study was found to be very informative, a similar study as been deemed worthy of attention. Although the results of the sine wave study were enlightening and beneficial, it can be argued that the time-series involved showed very little similarities to the financial time-series used in this project. Therefore, showing that the ANN and the established parameters work very well with a sine wave, does not sufficiently infer that they are correct for our studies. However, the sine wave study can be extended to establish the inference we require. The Bumps function 7 is well known and commonly employed when testing pattern matching technologies, because it is a difficult function for systems to learn and predict. In [Booker, 2005] it is utilised along with the Blocks, Doppler and HeaviSine functions. The Bumps function has been chosen because when plotted it demonstrates the same random and volatile characteristics evident in financial time-series. By using this function it is possible to establish a similar study to the sine wave study, but more in-line with financial time-series. Justification of Choice This study is worthwhile for two reasons. Firstly, as stated in the Problem Description, this study can be used as a control study to establish whether the ANN is set up correctly. By using a well defined function which mimics financial time-series, we can determine to a higher level of conclusiveness whether the ANN parameters are correct for this investigation and that the ANN is functioning correctly. Secondly, the Bumps function is more complex and harder to learn than the simple sine function, therefore undermining the argument that the sine wave example was trivial, meaning its results are insignificant. Results regarding exposure to the function, shown in figure 6.7, were significantly negative and because they were conclusive on a trivial scenario it is reasonable to assert that a more complex study would provide similar results. Therefore, this trial will not be extended here. However, trials two and three regarding the size of the data-sets (or resolution) will be re-applied, because one could argue that the sine wave study was too trivial for resolution to be a factor. The Bumps function is well known and used because it has proved to be a difficult function to teach to pattern matching 7 See Appendix B for a plot of the function together with its definition. 67

76 systems. If the resolution of training sets is shown to be a redundant consideration in this study, then within this project we need not deliberate on it further. Therefore, we know that the Data Multiplication 8 pre-processing methods described in [Zemke, 2002] will not be necessary. The target output will be the Bumps function with index (which is t) in the range of [0, 1]. The input functions will be: 20 - Bumps - this is the bumps function inverted and offset. Bumps/index + mod(index, 0.5) - this divides the bumps at a varying rate and adds a varying offset. Bumps/4 such that the time-series is sub-divided into tenths, with every odd tenth set to zero 9. Bumps/4 such that every even tenth block is set to zero - between inputs 3 and 4 the bumps function can be made, but to a diluted magnitude. These inputs were chosen for the same reason as in the sine wave study. Hypothesis The ANN will accurately learn and predict the Bumps function using the provided inputs. The performance will not match that shown by the ANN in the sine wave study, however when visably examining the test predictions it is clear that they are accurate. Comparing trials, the study will show that resolution, that is size of training and testing data-sets, will not affect the accuracy of predictions. Study Framework The ANN settings will, once again, be as follows: Each time the connections weights are to be re-initialised to random values, within the range [-1, 1] Shuffling: yes learning rate: 0.2 learning algorithm: back-propagation with online weight updates. epoch: topology activation function: logistic 8 Data Multiplication is used to increase the resolution of time-series data. Remember however, this does not mean the ANN sees a larger section of the functions, but rather sees the same section in more detail. 9 For example, if the time-series contained one hundred points, set the points with the following blocks to zero, 1-10, 21-30, 41-50, 61-70,

77 As can be seen in Figure 6.8, predictions in the sine study were extremely consistent. Therefore, to avoid unnecessary additional trials only two iterations of training and learning will initially be executed. If results fail to show sufficient consistency then the study will revert to five iterations per trial. There will be 3 trials. The data sets will be created by taking index values in the following manner: Increment the index in steps, making a set of 200. Increment the index in steps, making a set of 500 Increment the index in steps, making a set of Once again, the initial 80% will be used for training and the remaining 20% will be used for testing. Results Tables 6.10, 6.11 and 6.12 show the accuracy measures for the different data-set sizes. Figure 6.9 shows the corresponding test predictions charts. Table 6.10: The accuracy measures computed on the test predictions made during the Bumps function experiment, using 200 input-target pairs. Bumps Experiment, 200 Data-Points Trial Absolute Direction Mean Information Table 6.11: The accuracy measures computed on the test predictions made during the Bumps function experiment, using 500 input-target pairs. Bumps Experiment, 500 Data-Points Trial Absolute Direction Mean Information

78 Table 6.12: The accuracy measures computed on the test predictions made during the Bumps function experiment, using 1000 input-target pairs. Bumps Experiment, 1000 Data-Points Trial Absolute Direction Mean Information Figure 6.9: These charts show the results of training and testing with the different sizes of sampling data. Figure 6.9 shows that the system has learnt and predicted the Bumps function well. 70

79 The results show that the prediction quality is unaffected by the resolution of the data-sets. Conclusions In conclusion, the system has accurately learnt and predicted the Bumps function. As the Bumps function shows random and volatile characteristics, which can be observed in financial time-series, we can assert that the ANN is functioning correctly and the parameters are adequate for our dataset. Results suggest that the the resolution of the dataset had an insubstantial effect on the quality of learning and testing and as a result resolution will not be further considered in this project. 6.5 Accuracy and Generalisation Studies Hidden Layers Study Problem Description [Bodis, 2004] states that: The optimum number of hidden layers is highly problem dependent and is a matter for experimentation. [Crone, 2005, Kolarik and Rudorfer, 1994, Bodis, 2004, Rudorfer, 1995] all conclude with a single hidden layer. However, they all use technical data, which means the problem tackled in this project is not the same, thus a study is warranted. Justification of Choice As the number of hidden layers are increased, the network s ability to learn the input data also increases as the ANN can replicate more complicated functions. However, as stated in the Literature Review and pruning algorithms research, if the ANN over-fits the learning data, out of sample predictions usually deteriorate because the network loses generality. Therefore, similarly to constructive algorithms the number of hidden layers in this study should be incrementally increased until testing performance deteriorates. The accuracy measures have thus far proved good indicators of ANN performance and therefore they will also be used in this study to determine the accuracy of test predictions. There are two reasons why over-fitting would be particularly negative for this domain. Firstly, as time progresses companies usually alter their business model and general marketing tactics in order to keep in touch with competitive and rapidly changing modern markets. This will result in investors adapting their evaluation strategy. As a consequence, the functions mapping inputs to the target output will be dynamic over time, with probably only the general trends and patterns being maintained. If the learning over-fits these functions it will learn detailed patterns and correlations which may only exist over the learning period. This would cause 71

80 the test predictions to be incorrect because the learnt mappings are by then out of date. Secondly, the share price time-series contains a large amount of noise, which follows a random pattern. If the network over-fits the share price then it will learn the noise patterns as well. The noise is random in nature and this could hamper test predictions because during the testing phase the noise follows an unrelated pattern. As a result, the optimum number of hidden layers is one, because this gives the network the optimum generalising abilities. Hypothesis The optimum number of hidden layers will be one. Having two hidden layers will cause the learning to over-fit which will have a detremental consequence on test predictions. Study Framework For this study the original framework will be re-applied, so five iterations of training and testing will be applied per trial. The following settings will be applied to the network: Each time the connections weights are to be re-initialised to random values, within the range [-1, 1]. Shuffling: yes learning rate: 0.2 learning algorithm: back-propagation with online weight updates epoch: topology: fully connected, architecture will be trial dependent activation function: logistic Both visual inspection and accuracy measures will be used to determine the performance of testing. Visual inspection will be used to determine which prediction is the median. Then accuracy measures will be determined to compare the performance of the different trials. When a trial shows deteriatorated testing predictions the study will cease. Results Tables 6.13 and 6.14 show the accuracy measures for testing and training. 72

81 Table 6.13: The accuracy measures computed on the median test predictions, made during the hidden layers study. Trial Medians - Test Predictions Hidden Layers Absolute Direction Mean Information Table 6.14: The accuracy measures computed on the median training predictions, made during the hidden layers study. Trial Medians - Training Predictions Hidden Layers Absolute Direction Mean Information Table 6.15: T-tests on the four accuracy measures, to determine the independence of the results T-test Results. The t stat is for the two tailed test. Absolute Direction Mean Information df t Stat Significants This study was concluded after only two variations were tested. These results show that having a single hidden layer provided the best test predictions 10. For the two hidden layers trial, four out of the five iterations gave negative absolute error results. All five single hidden layer iterations provided better mean and information error figures than the two hidden layers iterations. Table 6.15 shows the findings from applying the T-test to the results. Three of the four accuracy measures showed statistical significants above 99.5%. This is statistically significant, therefore the two sets of results were independent, thus the findings of the study are reliable. In the single hidden layer iterations, the consistency of predictions was visibly better. This could have been forecast, because when a network over-fits during learning it replicates a function which generates the target output over a finite section. Many functions will be able to do this, but they will all be extremely complex because of the volatile nature of the target. Because they are complex, these functions have the potency to be very diverse outside of the fitted finite section. Increased generality is the result of simplifying the mapping functions. If simpler functions have similar outputs over a finite section they lack the potency to dramatically diverge when outside this section. Figures 6.10 and 6.11 visually demonstrate this explanation. It is now possible to explain why the consistency of test predictions is better with a single layer. 10 See Appendix C.1 to see the trial charts and performance results. 73

82 Figure 6.10: These demostration charts are designed to give a visual explanation of the reasoning behind the consistency variation between hidden layers. The black line is the time-series being learnt. The red and green lines are the two functions developed to replicate the black line. The rectangular box to the right is the test prediction segment of the function. In this example over-fitting has occured, outside the learning segment the functions produce dramatically different results. Figure 6.11: In this example chart the network has generally learnt the training target. Because the functions are very simple they diverge less aggressively outside the training segment. Conclusions Figure C.1 shows that the two hidden layers network fitted the target during learning more closely than the network containing a single layer. The learning error values also demonstrate this. These findings, together with the findings from the T-test, shown in table 6.15 suggests that the hypothesis was correct. However, it is interesting that with two hidden layers the network still showed some generality during learning. This means that over-fitting can be quite loose, that is, a network can show some signs of generality and still be seen as over-fitting the learning set. It will be interesting to see how the training set is learned after pruning, as it appears the network already loosely fits the training target and pruning should exaggerate this. A single hidden layer provided the most accurate predictions, as well as higher consistency between iterations of learning. Henceforth, this project will exlusively implement a single hidden layer. 74

83 6.5.2 Accuracy and Generalisation Studies: Part 2 - Online vs Batch Error Propagation Problem Description Speed and volatility of convergence upon the minimum error, is dependent on when the weight updates are made. The speed of convergence can have a decisive affect on the efficacy of learning and this is why it will be investigated. Online error propagation, which has been employed thus far, adjusts the weights after each input-target pair. Batch error propagation updates the weights less often. The weight corrections are summed over the full epoch and then corrected globally. Justification of Choice [Magoulas et al., 2001] asserts that online learning helps to escape local minima and provides a more natural approach for learning time varying functions, further research validated these statements. This suggests that online error propagation should provide the most accurate results with the highest consistency, because it is better at avoiding local minima. However, batch learning is computationally less intensive. This factor should not be overlooked as pruning algorithms can be very expensive. Therefore, if the results of this study show batch learning as the most suitable then this should be favoured, as it would reduce the computational expense during the pruning pre-experimental study, allowing it to be more extensive. Hypothesis The online error propagation technique will provide the most accurate test predictions together with the highest consistency. Study Framework As with previous pre-experimental studies the following ANN settings will be employed: Each time the connections weights are to be re-initialised to random values, within the range [-1, 1]. Shuffling: yes learning rate: 0.2 epoch: topology: , fully connected activation function: logistic 75

84 The median test prediction will be taken from each trial to determine which of the two methods provides the most accurate test predictions. All iterations will be observed to determine which method provides the highest consistency. To establish the consistency of results, visual inspection will be used in combination with the accuracy measures. This is because on their own the accuracy measures can make consistency hard to distguish, because they are sensitive to variation. Results Tables 6.16 and 6.17 show the accuracy measures for each trial. Figures C.3 and C.4 in Appendix C.2 show the test prediction charts. Table 6.16: The accuracy measures computed on the test predictions made using online weight updates. Online Weight Updates Trial Absolute Direction Mean Information Trial 1 was chosen as the median Table 6.17: The accuracy measures computed on the test predictions made using batch weight updates. Batch Weight Updates Trial Absolute Direction Mean Information Trial 4 was chosen as the median Table 6.18: T-tests on the four accuracy measures to determine the independence of the results T-test Results Absolute Direction Mean Information df t Stat Significants The t stat is for the two tailed test. These results do not clearly provide an answer to the hypothesis. The results are interesting, because they marginally favour batch weight updates. Table 6.18 shows the T-test results using the two samples assuming unequal variance version. None of 76

85 the four accuracy measures show statistically significant independence between sets, which suggests the two sets of results overlap. Using the method detailed in the study framework, neither method of error propagation can be asserted as providing the most accurate test predictions, as they each out perform the other in two of the four accuracy measures. Conclusions These results were somewhat surprising, as they failed to give a conclusive answer to the hypothesis. The T-tests shown in Table 6.18 conclude that the results are statistically inconclusive. This study is interesting because the batch learning method performed so well. The majority of research currently available concludes that the batch learning method is sub-optimum, however for this problem the study suggests it is as effective as online error propagation. [Rohde] provide an interesting insight as to why the results were so. It states that for training on less than approximately 500 patterns, batch training is often best, because the error surface is consistent between batches. It argues that: [With online learning] too much time is generally spent performing weight updates and the error surface will be changing radically from one update to the next...[this] may make it hard to distinguish normal variation from spikes in the error curve. We are training with only 151 examples, a very small set by general ANN terms, therefore this insight is interesting. If [Rohde] is correct and online learning causes the error surface to dramatically change between updates, the ANN struggles to learn because its reference point is constantly changing. The training set is small, so the argument that batch training increases the likelihood of getting caught in local minima is diluted. This is because the batches are small enough to provide sufficient natural noise for the error to naturally jump, causing volatility which prevents convergence on local minima. [Rohde] suggests that good batch sizes are often between 20 and 200 samples. Our set fits comfortably within this range. As these findings are not conclusive we must look to the findings of other research to determine the best error propagation strategy. [Magoulas et al., 2001] states that the online learning method adequately avoids local minima and provides a more natural approach for learning time varying functions. [Crone, 2005] created a similar ANN predictor and used online learning. The initial research into this configuration also suggested online learning was most optimum. Because of these three sets of arguments, it will be assumed that online learning is most appropriate, thus it will be used for the remainder of this project Pruning Algorithms and Parameters Problem Description In Design section 4.1.1, we found that pruning algorithms can improve generality of predictions. The previous study highlighted the importance of learning generality, so 77

86 this could potentially be a very influencial study. As JavaNNS, the ANN system being used, already has the Magnitude, Optimal Brain Surgery (OBS) and Optimal Brain Damage (OBD) algorithms implemented, this study will compare the results of each algorithm so that the optimum can be determined. [Thimm and Fiesler, 1996] explains: It is shown that these novel techniques [pruning algorithms] are hampered by having numerous user-tuneable parameters, which can easily nullify the benefits of these advanced methods This means that before we can compare the three pruning algorithms the optimum parameters must be found. As each of these algorithms use different techniques the optimum parameters may vary between them. Therefore, this study must provide sufficient flexibility to capture this. Justification of Choice There are two parameters, Maximum error increase and Accepted error, which are in accordance with the discussion in section of the Design. Due to time constraints it is not possible to trial all combinations of these for all three algorithms. Therefore, a study has been developed which should provide the most accurate results possible within the time constraints. For each parameter, test samples must be chosen which provide results across the range of potential values. However, the number of samples must be limited because with each sample the number of trials increases quadratically. If three samples are taken, one from each end of the range and one from the middle, it will be possible to create a curve joining the three points to identify the optimum values 11. To implement this procedure as efficiently, as possible five trials are required for each algorithm. Table 6.19 shows what they will be. Table 6.19: The parameter settings to be used in the pruning algorithms studies. Trial Maximum error increase Accepted error These ranges were determined appropriate after initial trials. The five trials will be repeated for all three algorithms. For each experiment the error measures will be calculated and these will be charted to determine the optimum parameters for each algorithm. At this point we can determine the most effective algorithm by pruning the network using the optimum parameters and observing the results. Even if the chosen optimum parameters produce results which are improved 11 With only two values, one from each end of the range, it would only be possible to create a linear best fit line, which assumes that the performance of the pruning networks has a linear correlation to each parameter and this is highly unlikely. 78

87 upon within the trials, the results of this test should still be taken as the results for the algorithm, else the fairness of the experiment is compromised. Although this method is not ideal compromises are unavoidable given the time constraints of this project. It is important to note at this time that a fundamental assumption has been made, that the optimum algorithm and parameters found here will also be the optimum for all companies involved in the suitability experiment. This assumption is made because of time constraints as opposed to statistical evidence, so when results are analysed and conclusions drawn, this assumption must be considered. Hypothesis As explained in section within the Design, none of the papers studied gave an explanation as to when each algorithm would be most appropriate. As a result of the limited information, strong hypothesises cannot be generated. However, it is felt that Optimal Brain Surgery will perform better than Optimal Brain Damage because of the assumptions made by OBD 12, which as stated in the Design, are often inappropriate. It is believed that the Magnitude algorithm will also perform very well, possibly to the highest standard, because [Thimm and Fiesler, February, 1997] compares it to a number of pruning algorithms with different datasets and it was found to give the best generality overall. However, none of the datasets to which the tests were applied were financial, so this hypothesis could prove incorrect. Study Framework As this study has the potential to dramatically sway this project s final findings, it is essential that the framework is robust and provides a fair environment to allow for the most accurate results possible. However, time constraints are an important issue. There will be 15 trails with five itertions, which results in a total of 75 training and testing cycles. Therefore the following testing strategy is outlined: Each time the connections weights are to be re-initialised to random values, within the range [-1, 1]. Shuffling: yes learning rate: 0.2 learning algorithm: back-propagation with online weight updates epoch: topology: , fully connected activation function: logistic For each trial the training and testing will be be repeated five times and the median test prediction will be the result which is recorded and carried forward for comparisons. 12 See section for the details of these assumptions. 79

88 Results Table 6.20 shows the optimum parameters, found as a result of viewing Figures C.5, C.6 and C.7. Table 6.20: The optimum pruning parameters as a result of the experiment. Algorithm Accepted Error Maximum Error Increase Magnitude OBS OBD Table 6.21 details the performance figures for test predictions when these parameters were used. Table 6.21: The accuracy measures computed after pruning using the optimum parameters. Algorithm Absolute Direction Mean Information Magnitude OBS OBD Although these results are not conclusive, the results do favour the Magnitude algorithm in all four accuracy measures as opposed to the Optimal Brain Surgery algorithm and only the directional accuracy measure favours the Optimal Brain Damage. When compared to other studies these results are also statistically inconclusive in determining whether pruning has improved the test predictions. However, the results do show a slight improvement, thus it is felt pruning is warranted in the suitability experiment, especially because by using pruning we may be able to deduce which inputs show the stongest correlation to the share price across a number of sectors and companies, which could be valuable for future investigations. The results of each trial can be seen in Appendix C.3. Conclusions The results show that the Magnitude algorithm was the most effective, but only by a marginal amount. The results suggest the hypotheses made were correct, however this study cannot conclusively assert this, a more detailed and rigorous study would be required. It is felt however, that this study is a useful pilot study. As far as this project is concerned, these results are to be viewed as sufficiently conclusive and accurate. This is due to the time constraints preventing deeper investigations 13. Therefore, for the suitability experiment the magnitude pruning algorithm will be applied with the following parameter settings: Accepted Error: Pruning runs took in the region of three hours when the parameters were at their smallest. A more rigorous study would require at least a quadratic increase in the number of pruning runs as at least one additional company should be trialled, which is not possible within the time constraints of this project. 80

89 Maximum Error Increase: 1.4 At this point all pre-experimental studies are complete. All that is left to consider before the suitability experiment can begin is the initial phase of the input selection, as defined in the Design. 6.6 Selection of Input Data - Phase One Introduction and Framework To establish which inputs show a correlation with the share price, visual inspection is required. Each time-series is to be plotted against the share price. For many of these charts it will be necessary to have two y-axis, as ratios tend to be in a single figure range while the share price can be in thousands of pence. When looking for correlations the following patterns will be screened for: Positive Features syncronised peaks and troughs reflexive symmetry syncronised general trending Negative Features Dynamic correlations, or correlations that are not maintained. This is important because if a correlation is learnt which is only short-lived then test predictions could be dramatically incorrect. Dramatic sudden changes in the input data that has no effect on the share price. If a correlation exists one would expect a dramatic sudden change to have a consequence on the share price. This change can be minor but must be clearly distinguishable through the noise. The following potential inputs were extracted from Datastream: Dividend Yield Price to Book (PB or PTBV) Return on Capital Employed (ROCE) Cashflow Margin Working Capital Ratio Acid Test (or Quick Test) Market Capitalisation Operating Profit Margin Price to Earnings (PE) 81

90 Price to Cashflow (PCF) Return on Investment (RI) Earnings per Share (EPS) Interest Cover Earnings Before Interest, Taxation, Depreciation or Amortisation (EBITDA) Return on Equity (ROE) Total Sales 14 Retail Index Gross Domestic Product (GDP) Consumer Price Index (CPI) A huge number of data types come under the fundamental data umbrella. To include them all, or even the majority, would be an impossibility within the time constraints outlined for this project. The twenty chosen include all of the data types introduced in the Literature Review as far as possible, as well as an additional selection which could be considered by investors. This is a much larger group than any of the fundamental data papers considered. Therefore, it is felt that if it is possible to make accurate stock price predictions using fundamental data, it would be possible with this set. Findings The charts plotting each data type against the share price can be seen in Appendix D. This study failed to ascertain which data types to discard at this point. Many of the charts showed a strong correlation up to a point, followed by a period when the correlation was broken. Even for the Market Capitalisation and Return on Investment time-series, which are well correlated to the share price when there is no share price lag, it was hard to observe lasting correlations. It was decided that the macroeconomic data, that is the Retail Index, Gross Domestic Product and the Consumer Price Index, all showed no correlation bar the very general uptrending. However, that could be found in a number of the other data types, which also show at least some periods of peak and trough synchronisation. As a result they were removed from further consideration. As the simple visual inspection failed to be conclusive, research into alternative techniques was carried out and Weka 15 was introduced as a result. Weka s visual charting tool was used, which re-inforced the initial opinion that a confident assertion could not be made regarding which data types to remove. This study did provide a small number of data types that potentially showed a good relationship with the share price. These were Dividend Yield, Return on Investment, Price to Book and Market Capitalisation. Using Weka s visualisation tool a fairly distinct correlation could be determined, Figure 6.12 shows the Dividend Yield against Share Price as a demonstration. As a result one can hypothesise that after pruning these four data types at least will remain. 14 Total Sales was removed from consideration at the initial pre-experimental study stage as it was found to be the cause for the dramatic and very incorrect predicted price drop at prediction 18. See 82

91 Figure 6.12: The chart on the left is from Weka s visualisation tool. It shows a snaking pattern from top left to bottom right, which suggests the correlation is reflexive. On the right is the comparison chart, which also shows a reflexive correlation between dividend yield and the share price. To understand Weka s visualisation chart see Figure D.4 With the macro-economic data types removed there are fifteen inputs. Due to time constraints a smaller group would have been preferable, but the visual inspection phase of the input data selection failed to decisively ruled out any more, however on further consideration this was a positive conclusion. As stated in the suitability experiment section of the Design, every company will have a different business model. This means that when an investor is evaluating the company s investment appeal, the data types which are used should also vary. At this stage it would therefore be unwise to remove potential inputs because they failed to conclusively correlate to one company s share price, as they may be influential in another. Thus it was decided that the only data types excluded at this point are the macroeconomic kind because the small correlation distinguishable is also prevalent in a number of others and it is felt this will be the case for all companies. 6.7 Conclusions from Experimental Studies In conclusion, a number of configurational problems have been identified and resolved using pre-experimental studies, which has led to a number of ANN parameters being tested and appropriate settings determined. As a result, suitability tests can be executed with the belief that the ANN parameters are not going to be detremental to findings. It must not be forgotten however that a number of studies produced findings which would not be deemed statistically conclusive, or contained assumptions that may prove to be incorrect. These factors must be incorporated into the eventual findings and analysis, as well as providing suggestions for further investigation. Figure 6.3 for an example. 15 Weka is an open source collection of machine learning algorithms for data mining tasks. Weka is particularly useful because it contains tools for pre-processing, classification, regression and visualisation. 83

92 Chapter 7 Suitability Experiment and Results Pre-experimental studies have outlined an ANN configuration which should provide the most positive results possible for the suitability experiment. At this stage we apply the ANN to ten companies taken from both asset intensive and non-asset intensive industries evenly distributed by market capitalisation. These results should suggest when this type of system would be most effective and highlight the fundamental data types which are most suitable. 7.1 Suitability Experiment Configuration As a result of the pre-experimental study and background research, the following configuration has been identified as appropriate. We cannot be sure this is the most optimum as even some pre-experimental studies failed to give conclusive answers. It is therefore felt that the configuration defined here is sufficient to discredit the argument that incorrect ANN parameterisation has caused poor results. Before each training run the connection weights are to be re-initialised to random values, within the range [-1, 1]. Shuffling: yes learning rate: 0.2 learning algorithm: back-propagation with online weight updates epoch: activation function: logistic pruning algorithm: Magnitude pruning parameters: Accepted Error , Maximum Error Increase (The next section explains how these values were conceived). In accordance with the experiment outlined in the Design chapter, the following procedure has been devised for the suitability experiment. 84

93 Create a fully connected network, including all fifteen inputs which passed the initial phase of input selection. Prune the network Create a new fully connected network containing the inputs that remain after initial pruning and the six non-annual inputs 1 in relative form. Prune the network again Train and test the network iteratively until a median test prediction can be determined. Record the results The following companies were chosen as the final ten. Asset Intensive Companies BP Severn Trent Johnson Matthey DS Smith Severfield Rowen Non-Asset Intenive Companies BATS Boots Trinity Mirror BSS Group Alba These were chosen because they belonged to the correct sectors, were of the appropriate market capitalisation and after visual inspection the time-series did not appear to contain any unlikely, inconsistent or impossible values. 7.2 Adjustment to Pruning Parameters Problem Description During initial studies it was found that a number of networks had their entire hidden layer pruned, thus leaving them corrupt. After diagnostic tests it was found that Maximum Error Increase (MEI) was the cause of this because it was too large. A new pre-experimental study was devised to find a resolution. 1 Section explains why only the non-annual data can be used in relative form. 85

94 Implementation To ensure the results of this pre-experimental study resolved the aforementioned issue, the range of MEI values considered was [0.1, 0.8]. As in the original pruning parameters study, five trials were implemented. Table 7.1 shows the parameters. Table 7.1: The pruning parameters settings for each trial. Trial Maximum error increase Accepted error However, the following adjustments were made due to limited time and to improve the conclusiveness of results. Rather than just applying this experiment to BP, it was applied to three companies: BP, Severn Trent and British American Tobacco (BATS). Applying the study to three companies would better suggest whether the optimum parameters found are consistent across the whole market. The network was limited to five inputs chosen at random for every company, because the size of the network had to be reduced to shorten the pruning time. Three were chosen from the price related ratios and two from the annual data. This was to prevent too many annual data types being chosen, since they only fluctuate yearly which limits the ANN s ability to learn month-to-month share price movements. As we took a subset of inputs, it would be possible to argue the results were created as a result of the inputs chosen. To nullify this argument the inputs were chosen at random. Although initial results were not sufficient to determine the Magnitude algorithm as the best and the paramater range was different, [Thimm and Fiesler, February, 1997] compares pruning algorithms over eleven datasets and the Magnitude algorithm gave the best overall generalisation across the range. As the pruning algorithm would be applied to ten different datasets these findings are important. It was decided at the design stage that if time constraints became a factor then at least the Magnitude pruning algorithm would be applied. Unfortunately, because this unexpected setback occurred at a late stage in the project, time has become the limiting factor. As a result the study implemented only the Magnitude algorithm. The rest of the ANN configuration and study framework was as stated in the original pre-experimental study. Results Figures E.1, E.2 and E.3 in the Appendices, show the Acceptable Error and Maximum Error Increase charts, which are interesting for two reasons. Firstly one can observe that there are similarities between the charts belonging to BP and Severn Trent, who belong to the asset intensive segment, but the charts for BATS 2, who belongs to the non-asset intensive segment, are quite different. This suggests, as the hypothesis 2 BATS would be placed in the non-asset intensive segment of the suitability experiment, because although they are a manufacturing firm their earnings are largely due to investments rather than tobacco sales. 86

95 states, that the two segments are evaluated differently. Secondly, although the charts are variable, the optimum values were similar (see Table 7.2). Table 7.2: The optimum pruning parameters for each company. Company Maximum Error Increase Average Error BP Severn Trent BATS Average As a result of this study the averages shown in Table 7.2 were used as the parameters in the suitability test. 7.3 Results of the Suitability Experiment The results of this experiment were interesting and suggest to varying degrees that the initial hypothesis was correct. Tables 7.3 and 7.4 contain the accuracy of predictions. Table 7.3: The accuracy measures of test predictions for each asset intensive company in the suitability experiment. Asset Intensive Companies Company Market Cap( mil) Absolute Directional Mean Information BP 135, Severn Trent 4, Johnson Matthey 3, DS Smith Severfield Rowen Mean Table 7.4: The accuracy measures of test predictions for each non-asset intensive company in the suitability experiment. Non-Asset Intensive Companies Company Market Cap( mil) Absolute Directional Mean Information BATS 28, Boots 3, Trinity Mirror 1, BSS Group Alba Mean T-tests for two samples assuming unequal variance were calculated for each accuracy measure. The two tailed test is used because these two populations should be from distinct segments. The results are shown in Table 7.5. The averages for all four accuracy measures favour the asset intensive set. Although the Absolute, Mean and Information errors are all shown to be less than 70% conclu- 87

96 Table 7.5: T-test results. The T-test calculations are for two samples assuming unequal variance. The t Stat is for the two tailed test. Absolute Directional Mean Information df t Stat Significants sive, the directional error is 99.6% conclusive. This is seen as statistically significant, thus the results suggest the original hypothesis was correct. When these results are compared to similar studies which use technical data, the results are quite positive. [Bodis, 2004, Dorsey and Sexton, 2000] both make stock market predictions using technical data measuring the directional error. [Dorsey and Sexton, 2000] has a range of directional error [58.46%, 66.80%], which is worse than our asset intensive stock predictions, but better than our non-asset intensive predictions. [Bodis, 2004] s predictions take the range [48.90%, 66.23%], which straddles both segments. As both of these papers conclude that their results are satisfactory, it is fair to assert that making stock price predictions using ANNs with fundamental data shows potential, in some cases more so than when technical data is used. However, one must remember that technical data systems are usually used to make daily predictions, which approximates to 270 trades per annum. Fundamental data systems, including the one developed here, are for less frequent trading. This means that technical data systems have the advantage of quantity, because as volume increases the volatility of results decreases as the distribution of results converge on the norm. The system implemented here would be making two predictions a year. Therefore, with a directional accuracy of approximately 68% 3 there is approximately a 10% chance that both predictions would be in the wrong direction, potentially causing significant losses. The chance of making one prediction in the wrong direction is lower, but still approximately 22%, which makes the system quite a high risk. Therefore, although the percentage of predictions in the right direction for fundamental data systems is at least as high as for technical data systems, due to the difference in trading volume it needs to be higher to be equally viable for real life application. At this point it is worth discussing the values of mean reversion and information coefficient (commonly referred to as the mean and information errors respectively during this project). Thoroughout the pre-experimental studies and suitability experiment, the values for these indicators have not been particularly satisfactory. This is because values greater than one indicate the predictor is inferior to the trivial predictor to which it is compared. None of the predictions gave an information coefficient value of less than one and some were as high as eight. This would suggest that the ANN predictor is poor by traditional standards, but this can be explained. In general, predictions failed to follow the correct price trending gradient because the predictions were too flat. This resulted in a poor information coefficent, because every prediction failed to intersect the current to previous range of the real price. Two reasons are suggested as to why this occurred. Firstly, the input which determines the gradient of trending was not included, and as the same inputs set was used for all tests this could be the problem. In initial tests total sales was included which produced a more accurate gradient, but part way through the predictions would veer off, causing them to be dramatically incorrect 4. When this was investigated it was found that 3 The average from the asset intensive predictions. 4 This is documented in section 6.6 and example charts are shown in figure

97 the correlation between total sales and the share price reversed in the middle of the test prediction period, sending the ANN predictions in the opposite direction. This leads to the second explaination for poor information coefficients. The correlations between share price and inputs are probably dynamic as businesses often adapt their business model, making investors adjust their evaluation model in response. This makes it difficult for the ANN to predict the correct gradient, because in order to do this it must learn dynamic correlations, which would be impossible if the nature of the dynamic is random. This suggests that using the random walk hypothesis to make predictions would give better accuracy than the ANN predictor. However, this does not necessarily mean it is a better prediction method to use for real-life application, because the most important characteristic of a real-life predictor is that it accurately predicts direction changes, as this prevents losses. Table 7.6 shows the directional change accuracy when the Random Walk Hypothesis 5 was applied to the asset intensive companies. Table 7.6: The directional accuracy when the Random Walk Hypothesis was applied to the asset intensive companies. Company Directional Accuracy BP Severn Trent Johnson Matthey DS Smith Severfield Rowen These results show that for all five companies the ANN predictor had higher directional accuracy, potentially suggesting it is superior for real-life use. For the suitability experiment only two predictions resulted in mean reversion values greater than one, which indicates that the predictor is more accurate than taking the average share price. In the Design section we hypothesised that combining relative inputs with absolute inputs would give the most accurate test predictions. Table F.5 in Appendix F shows that for every company within the asset intensive segment, the predictions improved as a result of including the relative data types. However, for the non-asset intensive segment the results were less conclusive, thus the null hypothesis can be asserted. This is probably because the system failed to make good test predictions for these companies, suggesting that these inputs were unsuitable. Therefore, pre-processing into relative form was irrelevant. These results suggest that the hypothesis only holds for asset intensive companies. However, if suitable inputs can be found for non-asset intensive companies then this should be re-assessed. To potentially aid further work in this field, Tables F.1, F.2 and F.3 in the appendices show which inputs remained after pruning for each company. The input that was pruned away the least often was return on investment, which remained in the final topology for eight of the ten networks. Price to earnings was the second most influential, being included in seven of the ten pruned networks. Joint third were dividend yield and earnings per share. These results confirm the findings of [Campbell and Shiller, 2001, Rey, 2004] where they state that future share prices correlate with dividend yield and price to earnings. Also, all of these were in real format as 5 The Random Walk Hypthesis states that the best predictor of the next share price is the current price. In otherwords, predicting that the next share price will be the same as the current price. 89

98 opposed to being converted into relative form. None of the indicators were pruned out in every network, however the least popular inputs were price to workflow and return on investment in their relative forms. These findings back up the statement in [Vanstone et al., 2004] which suggests that raw data is preferable to pre-processed data. The average number of inputs that remained after pruning was 8.6 out of a possible 21. This experiment also found that the vast majority of connections were removed as a result of pruning. Out of the 462 connections which were possible, the average number of connections after pruning was just twenty. This suggests that when fully connected networks are employed generalisation is dramatically reduced. This is further evident because, of the four networks that included more inputs than the average, three belonged to the non-asset intensive set, which were shown to provide poor prediction accuracy. This consolidates the findings in [Dorsey and Sexton, 2000], where after pruning the network was also dramatically reduced with positive effect. When the companies have been placed in their appropriate segments, further conclusions can be drawn by looking at the inputs which remain after pruning. The results are especially interesting because they are somewhat contradictory. For the asset intensive segment, cashflow margin was quite important because it remained in three of the five pruned networks, however for the non-asset intensive segment it was always pruned out. The exact opposite can be said for operating margin, as it was redundant in all five asset intensive networks, but remained in four of the five non-asset intensive. Dividend yield remained in four of the five asset-intensive networks, but only two non-asset intensive. These findings can be explained by the nature of the segments and the companies within them. Dividend yield is important to an asset intensive company s share price because for the company to grow a large amount of capital is required. As a result, these types of companies tend to grow very slowly, meaning earnings growth is not fast, thus the share price does not rise quickly. To compensate for this, these firms usually provide large dividends to make the investment more appealing. Therefore, the dividend yield would been seen as highly influencial to the share price. Alternatively, non-asset intensive companies find it easier to grow because less capital is required, thus dividend yield is less important. The reasoning behind the importance placed on operating and cashflow margins is less obvious and probably a result of subtle company related factors, rather than the somewhat primitive segregation currently employed. To determine whether these are significant characteristics, a larger cross section of companies would be required as the sample size used here is relatively small and could potentially be misrepresentative. A problem did arise during this experiment that is worthy of discussion. For a few of the companies, the outlier prediction was actually best. It was often very different from the consistent test predictions and sometimes the error chart showing the change in MSE during training suddenly plummeted, as if the training had jumped out of a local minimum. Table 7.7 shows an example of the outlier prediction against the consistent one. This suggests that the error surface is volatile and unsmooth. ANNs learn using gradual, linear movement across the error surface, which makes them susceptible to finding and getting caught in local minima. Methods can be used to reduce this problem, for instance the use of online error propagation and random presentation of input-target pairs, however these can only reduce the likelihood of the problem rather than remove it. Genetic algorithms are an alternative soft computing technology, which makes a more random and scattered movement across the error surface. This gives them a higher chance of finding the global as opposed to the local minimum, 90

99 Table 7.7: The accuracy measures computed on both the outlier and consistent test predictions. BSS Group Test Predictions Prediction Type Absolute Direction Mean Information Consistent Outlier which suggests they may be more appropriate for this domain. 7.4 General Conclusions A number of pre-experimental studies have been executed to find the ANN settings which are domain dependent and the suitability experiment has shown when a system of this nature can be effective. With the completion of each pre-experimental study, a suitable ANN parameter was found and iteratively the complete set of parameters were established, at which point the suitability experiment could begin. The initial studies tackled problems with consistency of learning. The causes were found to be time ordered presentation of input-target pairs and the hyperbolic tangent activation function. Section explored the affect of different ranges of initial random weights. Findings suggested that the smaller the size and range the higher the consistency of test predictions, however if they were too small test prediction quality deteriorated. A range of [-1, 1] was found to be most suitable. The accuracy studies examined the system s ability to learn predetermined, commonly trialled functions, to verify that the system was functioning correctly. These studies tested its ability to learn the basic sine wave and the Bumps function which shows common characteristics to financial time-series. Both studies provided results suggesting the system learnt the functions well and that the resolution of input datasets did not have a significant bearing. From section 6.5 onwards, we focused on improving the quality of test predictions. The results suggested that a single hidden layer was optimum and although the preexperimental study in section was inconclusive, research suggested that online error propagation was preferable to batch learning. The final pre-experimental study concerned the pruning algorithms and parameters. Although the study contained assumptions, the Magnitude algorithm was concluded as most suitable with optimum parameters of: Maximum Error Increase (MEI) and Accepted Error (AE) A limited, version of this study was carried out at the beginning of the suitability experiment because the MEI level of 1.4 was too high. The results suggested new parameters of and for MEI and AE respectively would be better. To conclude the development process the initial phase of input selection was con- 91

100 ducted. This phase led to only the macro-economic data types being discarded, because for all other inputs the degree of correlation with share price was too difficult to measure visually. The suitability experiment has provided results which suggest that the hypotheses for both itself 6 and the project 7 were correct when predictions were made on assetintensive companies. Although the T-test suggests the results are statistically significant, further experiments are required over a larger cross section of companies to be sure that these findings are a true representation of the entire market. This is especially necessary, as the revised pruning parameters study suggested that it may be necessary to look at asset intensive and non-asset intensive companies as two distinct segments and tackle ANN configuration independently for each. This implies that the suitability experiment was biased towards asset intensive companies, which does limit how conclusive these results can be in determining the validity of the suitability experiment s hypothesis. However, this potential bias does not have any implications concerning the validity of conclusions drawn on the project s hypothesis, thus it should be viewed as successful. As a final point which should not be overlooked, the Information Coefficient 8 was always greater than one, which suggests the prediction quality is inferior to using the Random Walk Hypothesis. However, when looking at it from a real life application stand point, directional change suggested that the ANN system would actually perform better. This compares the ANN system to a traditional prediction method and suggests it is superior. 6 The hypothesis for the suitability experiment is detailed in section See 1.1 for the project s hypothesis. 8 The Information Coefficient is often referred to as the Information Error in this project. 92

101 Chapter 8 Conclusions The findings of this project suggest that using artificial neural networks with fundamental data to make stock price predictions has potential. However, an array of pitfalls were uncovered which make their development difficult and potentially limit their application. These pitfalls are a result of both the domain and ANN technology. 8.1 Pitfalls Domain The suitability experiment suggests that this form of system may not be effective for all companies. The findings imply that if a company s share price evaluation is based on non-tangible factors, then an ANN system may struggle to make accurate share price predictions. Futhermore, the pruning results imply that the set of suitable input data types is dependent on at least the sector and may even be company specific; they may even change over time. An investigation into this would be an interesting extension to this project, but if these informed assumptions are accurate then they highlight the complexity of the problem. The correlation between fundamental data and the share price is dependent on the size of the share price lag, highlighting another critical factor that is difficult to determine. For example, [Buffett and Clark, 2002] states that Warren Buffett believes it can be as long as ten years before a company s share price represents the firms true value based on fundamental measures. This outlines a potentially serious limitation with using an ANN system. The only way to leave the choice of lag up to the ANN is by windowing every input time-series. However, if we were to use the same fifteen inputs as in this investigation, with a window size of just five 1, then 75 inputs would be required. [Shachmurove, 2002] states that too many inputs can have a detremental effect on results. Therefore, pruning would be essential, but pruning a network of this size would be enormously computationally expensive 2 making this exercise unfeasible with standard resources and time-frames. The size of the fundamental data set is another implicit problem which would ideally require a large number of ANN inputs. Fundamental data is anything which gives 1 This would mean the correct lag must be within a five month range. 2 Using a computer with a 1.8GHz Pentium 4 with 256MB of RAM took in the region of three hours to prune a fully connected network using the Magnitude algorithm. 93

102 information about the company or business environment. As a result, the set is large and diverse and different data types may be of interest at different times. For example, when an industry is in a boom, industry wide indicators may be driving a company s share price, while in industry recessions the company s own figures may be most important. If a lawsuit is filed against a company the driving factor may move from earnings to liquidity figures. This means that to achieve accurate predictions a large number of different fundamental data inputs are required, however citing [Shachmurove, 2002] once more, having too many inputs may have detrimental effects on predictions. The dynamic nature of the relationships will not be learnt well either, especially correlation changes as a result of events like lawsuits, because it is not possible to forecast their occurrence. Another difficulty occurs when sourcing the fundamental data. A large proportion of the companies examined either contained missing data or data which appeared inaccurate. This dramatically limited the choice of companies unless the techniques introduced in [Numerical Algorithms Group, 2002] were implemented. However, artificial data imputation is not to be taken lightly, a poor technique may disrupt the structure of the inputs and change the correlation with the share price. Due to time constraints an appropriate investigation into these techniques could not be executed. Lastly, the pre-experimental study into pruning parameters in section 7.2, showed that it may be necessary to find optimum settings separately for asset intensive and non-asset intensive companies. If this is the case then this project may not be a fair test, because the optimum settings were discovered using BP which is an asset intensive company. More segmentation may also be possible, for instance segmenting by market capitalisation, as this has an effect on the stocks liquidity 3, thus its price. If this is the case then the complexity of using fundamental data is once again apparent, because one must be sure that the market has been segmented correctly and the optimum settings have been found for each segment. It may be that this type of system is equally effective for all companies; one must just be sure that the system s settings are correct for the segment in which each company belongs ANN The majority of limitations and problems were because of the ANN technology. There are a number of parameters and they are usually domain specific and inter-dependent. As can be seen from the pre-experimental studies, finding the correct configurations can require a large number of experiments and the results may not be conclusive. As a result, the findings of this project may be inaccurate and further projects may find a more optimum ANN configuration, giving more accurate and consistent predictions. Because of the relative simplicity of ANN learning methods, or at least back-propagation learning, they are susceptible to getting caught in local minima. This is a sizeable problem with this domain because the error surface is irregular and volatile. We determine this because of the consistency problems found in the early pre-experimental studies and the number of outliers observed in the suitability experiment (see Figure 8.1 for an example). Although we were able to dramatically improve the consistency of predictions after refinement, outliers were still fairly common and even between consistent results the error measures varied sufficiently to disconcert confidence. It is felt that in a real life application when the choice of test prediction is critical, these variations would make it very hard to assert that the optimum predic- 3 Companies which are titled Small Cap are usually more illiquid. This is because they are seen as more volatile and risky investments, resulting in the trading volume usually being lower. 94

103 tion has been chosen. Figure 8.1: This chart shows the change in Mean Square Error during training. In two of the training runs one can observe that the error drops dramatically a number of times. This is when the learning jumps out of a local minimum. The black error line even follows a completely different path, suggesting it has found a different minimum. Compounding the local minima problem, the findings in section 7.3 suggest that within this domain the global minimum can be a very narrow basin in comparison to the local minima. Thus, the consistent predictions are often just the largest local minimum. This is believed because for a number of suitability tests an outlier result was the most accurate although infrequently repeated. This suggests that other technologies may be more appropriate. For instance, genetic algorithms are less susceptable to local minima by having a more expansive method of observing the error surface 4. Another potentially interesting continuation of this work would be to repeat it, exchanging the ANN learning for a genetic algorithm or similar alternative technology. Moving towards the input data requirements implicit with ANN techology, it is felt that this style of prediction system is more suited to using technical data. This is because through the technical data viewpoint all companies are the same, the market s opinion of the stock 5 is what is critical. As a result, the same data types are suitable for all companies which immediately reduces the complexity of the problem. Secondly, a debate looms regarding the efficiency of stock markets and it can be argued that they are random in nature and unpredictable. It is felt that because of this it is impossible to make highly accurate predictions with current technology. To compensate for this, a large volume of predictions are needed. This is not possible with fundamental data because of the long term nature, but suits technical data perfectly. Finally, technical data is prevalent and can be gained free of charge from numerous vendors 6, because of this it is very easy to get hold of large quantities of accurate data, which is a critical requirement for effective ANN learning. In comparison, as was found in this investigation, the number of vendors providing fundamental data is limited and it is often very expensive. Furthermore, the data often contains missing or inaccurate values, which means advanced data mining techniques are required. 4 Genetic algorithms parallelise the search using a population of search points and by stochastic search steps, [Braun and Ragg, 1996]. 5 This is what technical data indicators are designed to measure. 6 For example is a free online service which offers technical data for the majority of companies, covering a number of years. 95

104 8.2 Hypothesis The original hypothesis was: Artificial Neural Networks can be applied to fundamental data to make accurate stock price predictions. Results showed that not all stock price predictions could be classified as successful. This suggests that the hypothesis is too broad and in need of refinement. Post results analysis suggested pitfalls in the pre-defined assumptions and limitations in the experiments as the possible causes. This means it has not been possible to give a conclusive verdict as to the correctness of this hypothesis. However, a group of the results suggested it is correct if only in a more refined context. This can be asserted through inference, as the results showed prediction directional accuracy similar to, if not greater than, that found when technical data has been employed and those studies have been regarded as successful. The results here should be seen as a benchmark for further investigations, which may provide further evidence validating this hypothesis. This is because fundamental data is an extremely varied and complex dataset, thus ANN preparation is critical. Although this project has been extensive, it is felt that more needs to be studied to determine the most appropriate ANN configuration JavaNNS Using JavaNNS was very beneficial to the project. No time was lost developing the system because of the range of algorithms and functionality already contained in the system. If it had been necessary to build a new system from scratch or from a library of functions, many of the pre-experimental studies could not have been completed within the time-frame, thus potentially damaging assumptions would have been necessary. After initial difficulties, based largely on subtleties and ambiguities common with open-source systems, JavaNNS was very easy to use and allowed for swift network creation and deployment. Neural elements, such as topology, connections, learning parameters and activation functions were easy to configure through the GUI and had sufficient flexibility to allow for thorough examinations during the pre-experimental studies. Having all three pruning algorithms already developed was beneficial to the project, because the use of pruning is extensive in latter pre-experimental studies and the suitability experiment. Not having to develop and test this functionality was advantageous from a time-frame perspective and it also allowed for a comparison of algorithms to find the one which was most appropriate for this domain. Although the pruning took a substantial period of time, this was due to the nature of the algorithms as opposed to the speed of the system. When real-time graphical tools were minimised, learning and testing were generally completed in a matter of seconds and this meant that pre-experimental studies where pruning was not involved could be completed relatively quickly. There were some system ambiguities which caused the process of learning and testing to be complicated. To output the test predictions during testing, the Training Set in 96

105 the Control Panel had to be switched to the test set before running the Analyzer function. Also, after pruning one had to remember to look-up the output neuron s new node number so that the Analyzer was calculating the output from the output neuron as opposed to the output from a hidden neuron. Because of the design of the Control Panel it was also quite easy to forget to tick the Shuffle checkbox. On several occasions errors occured due to these features of the system, which meant that studies had to be repeated. In spite of these somewhat minor issues, it can be concluded that JavaNNS is an appropriate choice of ANN system because it allowed the project to focus on the experiments which were at the heart of the project. 8.3 Contribution to the Field This project has contributed to both the financial and the computer science communities. This section discusses where and to what extent. This project reveals the complexities surrounding ANNs and the bespoke considerations which must be investigated for most applications. It also attempts to provide a framework for uncovering the bespoke optimum solutions in a fair and scientifially valid manner. This may be of a particular interest to individuals new to the ANN field. To our knowledge, this is the most detailed investigation into applying ANNs to fundamental data for stock price predictions available to the scientific community. As a result, bespoke considerations unique to this domain are also discussed, for example the duration of share price lag and when to present company releases to the ANN. It is also the first investigation to apply the Magnitude, Optimal Brain Surgery and Optimal Brain Damage pruning algorithm to financial time-series data 7. For parties interested in financial predictions, this project has highlighted the pitfalls and complexities encapsulated in the paradigm, as well as providing potentially innovative solutions to the problem. For example, to our knowledge this is the first investigation where the ANN has been asked to choose the financial inputs using pruning techniques. Also of financial interest, this investigation potentially provides arguments for and against the Efficient Market Hypothesis. If the findings regarding the non-asset intensive companies are true of the whole segment, then one can argue that this demonstrates that for some companies the semi-strong form of EMH may be correct, as the ANN failed to make accurate stock price predictions based on the publicly available information 8. The predictions made for asset intensive companies provide evidence defying the Efficient Market Hypthesis in all dilutions, because they show it is possible to make stock price predictions based on publically available, historic data. This hybrid idea as to the validity of the EMH requires further investigation 9. Finally, an interesting by-product of the pre-experimental studies is the finding regarding the optimum frequency of error propagation. The common belief is that the 7 Again this is an opinion based on the author s background research. 8 This argument is incomplete because not ALL publically available information has been used and the experimentally dependent limitations, for example the time lag problem discussed in section 8.1.1, could have beeen the reasons for the poor predictions. 9 The plausibility of EMH has been extensive studied, because of this it is very difficult to know if an idea has already been developed and researched. 97

106 online method is superior to batch, however the findings in this project highlight a niche where this common belief may be too broad and simplistic. 8.4 Further Work Throughout this project interesting extensions to the work have been identified and established. Here is a list containing the four main areas. Not all of them require an ANN or share price predictions and some may be of interest to statisticians and financial analysts as opposed to computer scientists. When selecting the most appropriate pruning algorithm we had to assume that the BP stock was a fair representation of the entire market. However, the re-test using multiple companies suggested this was incorrect. Further work could be focussed on the selection of pruning algorithms and parameters for fundamental data. This would require numerous companies and trial parameters. The conclusion to the suitability experiment developed the idea of further segmenting the stock market. Here the results found fairly conclusively that the market can be segmented into asset intensive and non-asset intensive firms. However, it was suggested that further segmentation maybe possible, or even necessary. One suggestion was to segment by market capitalisation. Further work could focus on this, developing further segmentations and analysing which data types are most appropriate for the particular segments. The findings may show that for some segments technical data is also useful. This leads to another area for further work. The set of fundamental data is large, thus it was only possible to test a sub-section in this project. Further work could trial alternative data types. Particularly interesting data types would be those that attempt to measure the intangible value aspects, such as brand, as these may allow for more accurate predictions with non-asset intensive companies. As discussed in Domain Pitfalls, it is extremely hard to distinguish when fundamental data is most highly correlated to the share price. In this project we assumed that it was six months after the fundamental data was published 10. However, an interesting study would concern calculating the correct point. Results may show that the point varies between companies, that there is no definitive point or that it is surprisingly short or long. One method would be to use an ANN and apply a window to a single data type. Then by using pruning techniques the least correlated lags could be removed until the most appropriate one(s) remain, repeating for each data type. One method named Dynamic Time Warping was identified as a technique that may be suitable for this problem. [Wang, 2006] is a very recent thesis which examines this technique as well as others and attempts to present a times series matching framework. These techniques were found too late in the project to be incorporated, but this framework may be useful for further work in this area. Further investigations into pre-processing may uncover ways to improve predictions. The basic pre-processing steps were implemented in this project but the field of data mining is ever expanding and some advanced techniques may improve predictions. If it is possible to manipulate the input data in order to smooth the learning error surface, then learning should more easily converge on the global minimum. Some techniques may even be able to remove the data quantity problem, which was identified as a limitation when fundamental data 10 This assumption was because of data quantity constraints. 98

107 is applied, by creating a way of exposing more of the correlation functions to the ANN to improve learning 11. Of these, it is felt that the most interesting and potentially insightful continuation of this work would be to replace the back-propagation learning for a genetic algorithm or similar soft computing technology. This belief is due to the characteristics of the error surface, which means that traditional ANN learning techniques may be suboptimal. [Braun and Ragg, 1996] is the user manual for a genetic learning algorithm called ENZO. ENZO has been developed as an add-on to SNNS, to be used as a replacement for the learning algorithms currently provided. This makes it ideal for further investigations because a large portion of what has been discussed in this paper would still be relevant, allowing for faster implementation and greater depth at the experimental stages. Concluding with an important consideration, a number of these further work activities could prevent some of the assumptions and biases or allow the findings to be more conclusive if this project was re-investigated. Some of these could be entire projects on their own as they require a great deal of time, analysis and development. As is discussed in Personal Reflections, this project became a far larger and complex problem than initially anticipated and therefore any extension to this work which tackles such a broad hypothesis, must bear this in mind. 8.5 Personal Reflections In its findings this project has been largely successful. However, it has been extremely successful in light of the complexities implicit in the broad project title and hypothesis. The field of artificial neural networks is large and diverse, full of bespoke configuration requirements. To add to this, the correlations between fundamental data and equity markets is complex and largely undefined. In hindsight, with only somewhat limited prior financial markets knowledge and no previous experience with ANN systems, the project was probably too expansive. However, as a consequence, it was a challenging, insightful and many skills have been learnt or honed. I have developed the ability to swiftly gauge a paper s usefulness and extract specific details, because ANNs is a mature field providing a plentiful supply of information and the project has required extensive research and analysis of findings. Many aspects of ANN configuration are problem specific, so one must be cautious when determining settings based on outside research. Also, the application of ANNs to financial predictions may be seen as a study for financial gain, rather than scientific discovery. This would explain why the search for scientific research surrounding this particular topic was less fruitful than anticipated and predominantly uncovered student and PhD papers, which meant that a more cautious view of findings and conclusions had to be taken than might otherwise have been the case. As a consequence, I am most proud of the advances made in my ability to self learn and development of good background research skills. As a by-product of these new found skills, I can now critically examine existing research to identify holes where original work can make important contributions. Strict scientific procedures were also vital to the success of this project. Without a sound scientific approach to the pre-experimental studies it would have been hard 11 This idea is first discussed in section

108 to confidently assert any conclusions. As a result, I have acquired sound critical reasoning skills and scientific practices, developing my objectivity when devising experimental studies. As a result, I have developed my experimental reasoning approaches and also discovered a number of new techniques for determining the accuracy and significance of findings. From a project management point of view, proficient time management skills were necessary along with flexibility, as the experiments often highlighted unanticipated pitfalls and further studies. Without these skills it would have been very difficult to keep within the time constraints and remain faithful to the overall project goal. Having said this, the time management plan was essentially accurate. The Exploring existing ANN software and Gather fundamental data activities did start late, but took less time than allowed for. As a suitable ANN system was sourced, the Build ANN skeleton was dramatically shorter than planned. This meant that the pre-experimental phase started slightly ahead of schedule, but due to the unexpected difficulties and the time intense nature of the pruning studies, it ran on longer than anticipated. This meant that the suitability testing did commence a week late, but because Document work kept to schedule the Refine write-up could be shortened. This aspect of the project has improved my time management and scheduling skills, as well as my problem solving and critial reasoning abilities. Throughout this project numerous difficulties and unexpected hinderaces surfaced, which meant that I had to put into practice all of these skills. From a computer science perspective, I have learnt how to use the Latex type setting package, which has definitely been invaluable. This project has also given me a sound understanding of Artificial Neural Network technology, where it can be applied and its benefits and pitfalls. As a result, I have a firm understanding of its place in the field of Artificial Intelligence. As a result of the project s domain, it has given me great insights into fundamental data and its place in equity research and stock price predictions. It has also helped me understand the complexity of stock markets and the interesting and varied views that are held concerning their efficiency and the part they play in business. In hindsight, there is not much that could have been done differently without making adjustments to the initial aim of the project or extending the time-frame. However, the pre-experimental investigations into pruning algorithms and their parameters could have been improved. The initial incorrect choice of Maximum Error Increase range did hamper the investigation and could have been caught more quickly. Looking back during the first pre-experimental pruning study, a number of networks were extensively pruned, almost to corruption point. Therefore, if the potential implications of this had been properly evaluated, the range could have been adjusted in a timely fashion, rather than only once the suitability study had begun. In conclusion, although this project was challenging and at times frustrating, it has been enjoyable and extremely insightful. In a relatively short period of time a lot has been achieved, both from a personal perspective and hopefully also in terms of contributions to the fields involved. 100

109 References A. Abraham. Modeling chaotic behavior of stock indices using intelligent paradigms. Neural Parallel & Scientific Computations, Volume 11(Number 1 & 2): , L. Bodis. Financial time series forecasting using artificial neural networks. Master s thesis, The Pretsch Group, Babes-Bolyai University, L. B. Booker. Adaptive value function approximation in classifier systems. In Genetic And Evolutionary Computation Conference, pages 90 91, Proceedings of the workshops on Genetic and Evolutionary Computation, H. Braun and T. Ragg. Evolution of Neural Networks (ENZO), Version 1.0. University of Karlsruhe, M. Buffett and D. Clark. The New Buffettology. Simon & Schuster UK Ltd, London, J. Campbell and R. Shiller. Valuation Ratios and the Long-Run Stock Market Outlook: An Update. Cowles Foundation, Yale University in its series Cowles Foundation Discussion Papers, S. Crone. Stepwise selection of artificial neural network models for time series prediction. Journal of Intelligent Systems, Volume 14(Number 2-3), Y. L. Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. Advances in Neural Information Processing Systems II, Volume 2: , E. Dimson and M. Mussavian. A brief history of market efficiency. European Financial Management Journal, Volume 4(Number 1):91 193, March R. Dorsey and R. Sexton. The Use of Parsimonious Neural Networks for Forecasting Financial Time Series. Finance & Technology Publishing, D. Dreman. Contrarian Investment Strategies: The Next Generation. Simon & Schuster, New York, A. Elder. Come Into My Trading Room: A Complete Guide to Trading. John Wiley & sons, New York, May B. Graham and J. Zweig. The Intelligent Investor. HarperCollins Publishers Inc., 3rd edition, Numerical Algorithms Group. Cleaning financial data. Financial Engineering News, B. Hassibi and D. G. Stork. Second order derivatives for network pruning: Optimal brain surgery. Advances in Neural Information Processing Systems, pages ,

110 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York, S.C Hui. A hybrid time lagged network for predicting stock prices. International Journal of Computer Integrated Manufacturing, Volume 8(Number 3), S. V. Kartalopoulos. Understanding Neural Networks and Fuzzy Logic. IEEE Press, Wiley, T. Kolarik and G. Rudorfer. Time series forecasting using neural networks. In Proceedings of the international conference on APL : the language and its applications, pages ACM Press, J. Lakonishok, A. Shleifer, and R. Vishny. Contrarian investment, extrapolation, and risk. The Journal of Finance, Volume 49(Number 5): , K. Lane and R. Neidinger. Neural networks from idea to implementation. ACM SIGAPL APL Quote Quad, ACM Press, Volume 25(Number 2):27 37, C-S Lina, H. Khan, and C-C Huang. Can the Neuro Fuzzy Model Predict Stock Indexes Better than its Rivals. CIRJE, Faculty of Economics, University of Tokyo, in CIRJE F-Series with number CIRJE-F-165, J. Liu and T. Leung. A web-based cbr agent for financial forecasting. In Workshop of Soft Computing in Case-Based Reasoning, International Conference on Case-Based Reasoning (ICCBR 01), T. Lundin and P. Moerland. Quantization and pruning of multilayer perceptrons: Towards compact neural networks. In Institut Dalle Molle d Intelligence Artificielle Perceptive Communication 97-02, IDIAP, Martigny, Switzerland, March, W. Maas and M. Schmitt. On the complexity of learning for a spiking neuron. In Proceedings of the Tenth Annual Conference on Computational Learning Theory, pages 54 61, G.D. Magoulas, V.P. Plagianakos, and M.N. Vrahatis. Hybrid methods using evolutionary algorithms for on-line training. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2001), pages , Washington, J. Preece, Y. Rogers, and H. Sharp. Interaction Design: Beyond Human-Computer Interaction. John Wiley & Sons Ltd, T-S. Quah and B. Srinivasan. Improving returns on stock investment through neural network selection. Expert Systems with Applications, Volume 17(Number 4), T. Ragg, S. Gutjahr, and H. M. Sa. Automatic determination of optimal network topologies based on information theory and evolution. In IEEE, Proceedings of the 23rd EUROMICRO Conference, pages , Budapest, A-P Refenes. Neural Networks in Capital Markets. John Wiley & Sons, New York, D. Rey. Stock Market Predictability: Is It There? A Critical Review. PhD thesis, WWZ University of Basel, Working Paper No. 12/03, D. Rohde. Lens manual: Rules of thumb. URL Michigan Institute of Technology (MIT). 102

111 G. Rudorfer. Early bankruptcy detection using neural networks. In International Conference on APL, pages , Proceedings of the international conference on Applied programming languages, W. S. Sarle. Stopped training and other remedies for overfitting. In Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, pages , Y. Shachmurove. Applying Artificial Neural Networks to Business, Economics and Finance. Penn CARESS Working Papers, UCLA Department of Economics, I. Sommerville. Software Engineering. Pearson Education Limited, 6th edition, K. Swingler. Financial prediction, some pointers, pitfalls, and common errors. Neural Computing and Applications Journal, Volume 4(Number 4), G. Thimm and E. Fiesler. Neural network pruning and pruning parameters. In Institut Dalle Molle d Intelligence Artificielle Perceptive, In the first workshop on Soft Computing, G. Thimm and E. Fiesler. Pruning of neural networks. In Institut Dalle Molle d Intelligence Artificielle Perceptive Report 97-03, February, J. P. Thivierge, F. Rivest, and T. R. Shultz. A dual-phase technique for pruning constructive networks. International Joint Conference on Neural Networks (IJCNN), pages , R. Trippi. Neural Networks in Finance and Investing. Times Mirror Higher Education Group, 2nd edition, B. Vanstone, G. Finnie, and C. Tan. Applying fundamental analysis and neural networks in the australian stockmarket. Bond University School of IT, Z. Wang. Time Series Matching: a Multi-filter Approach. PhD thesis, Courant Institute of Mathematical Sciences, New York University, A. Zell, G. Mamier, M. Vogt, and N. Mache. SNNS User Manual, Version 4.2. University of Stuttgart, S. Zemke. On Developing a Financial Prediction System: Pitfalls and Possibilities. in Proceedings of the 19th International Conference on Machine Learning - ICML, Sydney, Australia,

112 Appendix A Gantt Chart 104

113 105 Figure A.1: The Gantt Chart detailing the schedule of the project. This has been produced using GanttProject.

114 Appendix B Bumps Function Bumps Function f(t) = h j K((t t j )/w j ), where K(t) = (1+ t ) 4 (t j ) = (0.1, 0.13, 0.15, 0.23, 0.25, 0.4, 0.44, 0.65, 0.76, 0.78, 0.81) (h j ) = (4, 5, 3, 4, 5, 4.2, 2.1, 4.3, 3.1, 5.1, 4.2) (w j ) = (0.005, 0.005, 0.006, 0.01, 0.01, 0.03, 0.01, 0.01, 0.005, 0.008, 0.005) Figure B.1: The output of the bumps functions. 106

115 Appendix C Pre-experimental Study Results C.1 Hidden Layer Study Results Table C.1: The accuracy measures computed on the test predictions made when a single hidden layer was employed. Test Predictions - One Hidden Layer Trial Absolute Direction Mean Information Trial 1 was chosen as the median. Table C.2: The accuracy measures computed on the test predictions made when two hidden layers were employed. Test Predictions - Two Hidden Layers Trial Absolute Direction Mean Information Trial 4 was chosen as the median. 107

116 Figure C.1: This chart shows how closely the learning has fitted the target for the median iterations in the hidden layer trials. Figure C.2: This chart shows how accurate the test predictions were for the median results of each hidden layer trial. 108

117 C.2 Online vs Batch Error Propagation Results Figure C.3: This chart shows the five test predictions made after BP with online weight updates. Figure C.4: This chart shows the five test predictions made after BP with batch weight updates. 109

118 C.3 Pruning Algorithms and Parameters Results Figure C.5: The charts show the results of the pruning trials for the Magnitude algorithm. The results are joined together using a smoothed line to project forecast values for the none testing values. The optimum MEI is approximately 1.4 and the optimum AE is approximately 0.6. Figure C.6: The charts show the results of the pruning trials for the Optimal Brain Surgery algorithm. The results are joined together using a smoothed line to project forecast values for the none testing values. The optimum MEI is approximately 1.5 and the optimum AE is approximately

119 Figure C.7: The charts show the results of the pruning trials for the Optimal Brain Damage algorithm. The results are joined together using a smoothed line to project forecast values for the none testing values. The optimum MEI is approximately 1.4 and the optimum AE is approximately

120 Table C.3: The accuracy measures computed on the median test predictions made after pruning using Optimal Brain Surgery with the experimental parameters. OBS Algorithm Accepted error Maximum error increase Value Absolute Direction Mean Information Value Absolute Direction Mean Information Table C.4: The accuracy measures computed on the median test predictions made after pruning using the Magnitude algorithm with the experimental parameters. Magnitude Algorithm Accepted error Maximum error increase Value Absolute Direction Mean Information Value Absolute Direction Mean Information

121 Table C.5: The accuracy measures computed on the median test predictions made after pruning using Optimal Brain Damage with the experimental parameters. OBD Algorithm Accepted error Maximum error increase Value Absolute Direction Mean Information Value Absolute Direction Mean Information

122 Appendix D Input Selection Charts 114

123 115 Figure D.1: These figures show the fundamental data types plotted next to the share price.

126 118 Figure D.4: These charts are the visualisation charts produced in Weka. They simply plot the share price along the x axis and the fundamental data type along the y axis. All data presented to Weka was pre-scaled to the range [0, 1]. All of the charts have an x and y axis range of [0, 1].

127 119 Figure D.5: These charts are the visualisation charts produced in Weka. See Figure D.4 for the explanation of the charts.

128 120 Figure D.6: These charts are the visualisation charts produced in Weka. See Figure D.4 for the explanation of the charts.

129 Appendix E Revised Pruning Parameters Pre-experimental Study Figure E.1: The charts show the results of the pruning trials for BP. The results are joined together using a smoothed line to project forecast values for the none testing values. The optimum MEI is approximately 0.55 and the optimum AE is approximately

130 Figure E.2: The charts show the results of the pruning trials for Severn Trent. The results are joined together using a smoothed line to project forecast values for the none testing values. The optimum MEI is approximately 0.6 and the optimum AE is approximately 0.5. Figure E.3: The charts show the results of the pruning trials for BATS. The results are joined together using a smoothed line to project forecast values for the none testing values. The optimum MEI is approximately 0.4 and the optimum AE is approximately

131 Appendix F Suitability Experiment Pruning Analysis 123

132 Table F.1: This table shows the inputs that remained after all pruning. The final two columns show the total number of connections remaining and the total number of inputs. The key explaining the data type abbreviations follows in table F.4. Table F.2: This table shows only the asset intensive companies. This table shows the inputs that remained after all pruning. The final two columns show the total number of connections remaining and the total number of inputs. The key explaining the data type abbreviations follows in table F

133 Table F.3: This table shows only the non-asset intensive companies. This table shows the inputs that remained after all pruning. The final two columns show the total number of connections remaining and the total number of inputs. The key explaining the data type abbreviations follows in table F

134 Table F.4: Abbreviations key for tables F.1, F.2 and F.3. DY - Dividend Yield PB - Price-to-book ROCE -Return On Capital Employed CFM - Cashflow Margin WCR - Working Capital Ratio AT - Acid Test MC - Market Capitalisation OPM - Operating Profit Margin PE - Price-to-earnings PCF - Price-to-cashflow RI - Return On Investment EPS - Earnings Per Share IC - Interest Cover EBITDA - Earnings Before Interest, Taxation, Depreciation or Amortisation ROE - Return On Equity Table F.5: A comparison of when the inputs did and did not include relative data types. Company Relative Data 1 Absolute Directional Mean Information BP No Yes Severn Trent No Yes Johnson Matthey No Yes DS Smith No Yes Severfield Rowen No Yes BATS No Yes Boots No Yes Trinity Mirror No Yes BSS Group No Yes Alba No Yes