Neural Computation - Assignment Analysing a Neural Network trained by Backpropagation AA SSt t aa t i iss i t i icc aa l l AA nn aa l lyy l ss i iss i oo f vv aa r i ioo i uu ss l lee l aa r nn i inn gg t ee cc hh nn i iqq i uu ee ss, nn ee t ww oo r kk t oo pp oo l loo gg i iee ss aa nn dd aa cc t i ivv i aa t i ioo i nn f uu nn cc t i ioo i nn ss, f oo r t hh ee cc l laa l ss ss i if i icc i aa t i ioo nn oo f nn oo nn - l li l i inn ee aa r dd aa t aa bb yy aa nn AA r t i if i icc i iaa l l NN ee uu r aa l l NN ee t ww oo r kk. Author: Course: MEng Computer Science & Software Engineering E-Mail: Student No.: 317818 Tutor: Dr. Peter Tino Abssttrracctt This report details various learning techniques, network topologies and activation functions (among other variables), which have been analysed in the classification process of non-linear data by an Artificial Neural Network. The implementation of a Feed-Forward Neural Network trained by the Backpropagation technique is detailed, prior to the testing of this network. Supervised learning is achieved through use of a training dataset, prior to testing on a test dataset. The network has been trained and tested using a number of different strategies outlined. Initially; batch learning and online learning, different numbers of hidden units and different activation functions were trialled. The alteration of learning and momentum rates is documented, prior to the application of a technique for the automated variation of a local learning rate. The results of each strategy have been recorded and analysed under a variety of statistical functions. This has enabled the conclusion of the properties of each strategy and the identification of the most suitable strategies for the classification on particular datasets. This report should prove useful to anyone training an ANN by Backpropagation. It gives an in-depth view of the possible variables, strategies and techniques employed in the development. Keyworrdss Backpropagation, Online Learning, Batch Learning, Logistic Sigmoid, tanh Activation Function, Adaptive Learning Rate, Momentum Rate, Cross Validation. 23-24 -1-
Neural Computation - Assignment -- IInfforrmattiion This report partially satisfies the requirements of the module; 6-12412 Introduction to Neural Computation Level M A component of the degree of MEng, in Computer Science/Software Engineering,, The University of Birmingham. Name: Student ID #: 317818 E-Mail: This report is a summary of the work conducted on the project assignment for the aforementioned module. It presents the research conducted and the results obtained. Specifically, it covers; Section Title Coverage 1 Introduction Concepts and Properties, Training and Testing Datasets. 2 Network Design Models, Topologies and Variables. Network Implementation. 3 Strategy Testing Hypotheses, Strategy and Variables. 4 Results Presentation of statistically organised results. 5 Recommendations Recommendation for optimal network design. 6 Conclusions Conclusions and future work. References. 1 -- IInttrroduccttiion TABLE 1: ORGANISATION OF THIS ASSIGNMENT The aim of the assignment is to develop a feed-forward neural network trained by backpropagation, testing different learning strategies, variables and network topologies. The results from this development are reported in statistically organised formats. The different features, variables and strategies to be tested are outlined later in this section. Prior to that, brief overviews of the neural network, mathematical models and datasets which it will operate upon are presented. Please Note: Throughout this report, I use the terms Neural Network, Artificial Neural Network, ANN and Network. These are all interchangeable. As are the terms for classification units which are used; Unit, Neuron and Perceptron. 11. 11 - TT hh ee AA NN NN CC oo nn cc ee pp t An Artificial Neural Network (ANN) is a system (which could be implemented in either hardware or software), which may be used to perform classifications on datasets. The ANN can access a solution space that is unavailable to computational systems based on logical inferences. They have the ability to relate n-inputs to m-output classes. Furthermore, they can operate on classification problems which are non-linear. That is to say, they cannot be separated by a single line when plotted on a graph. Figure 1 graphically shows the mathematical difference between linear and non-linear separability. FIGURE 1: A GRAPHICAL REPRESENTATION OF LINEAR AND NON-LINEAR SEPARABILITY 23-24 -2-
The Artificial Neural Network is based on a mathematical model, which itself is analogous with the biological processes of neural function within the human cortex. Developments and research in this area aim to assist in bringing computational systems closer to the abilities of the human brain. A number of properties of the ANN make it a very attractive prospect when attempting to solve a number of computational problems. These properties are briefly outlined below. Generalisation Given a series of continuously variable input values, ANNs are able to place each case into a particular output class. If the input variables are represented by n, the output classes by m, then n > m where n and m are members of the set of real numbers [1]. Adaption ANNs have the ability to adapt to their training data sets. Their process development relies on supervised learning; initially training the network on a training data set, τ, prior to execution on an operational or test data set. The ability to adapt to a particular data set has a number of advantages, including the ability to generalise, thus accessing a solution space unavailable to systems based on logical inferences. Graceful Degradation This process occurs alongside generalisation. As ANNs can place a specific case into any one of m output classes given n inputs, if any of the n outputs are incorrect or unavailable, the network can still perform the classification operation. This is not possible for many logical inference based systems. Fundamentally, all of the previously described properties allow the ANN to learn and evolve (in broad terms). Observations and test scenarios (as later illustrated), show that as a network is trained on a data set and then executed on an operational data set, its operation becomes quicker. Continuous alterations to the functions within the network achieve this process, which is known as Hebbian Learning [Hebb, 1949]. 11. 22 TT r aa i inn i i inn i gg aa nn dd TT ee ss t i inn i gg DDaa t aa ss ee t The network developed is to be tested on the recommended Ionosphere dataset, which is available from the Johns Hopkins University Ionosphere Database [2]. The dataset contains 351 instances, of which 275 are training examples and 76 are testing examples. Each instance in the dataset is composed of 34 inputs (continuous values) and one discrete output ( or 1). Each continuous input represents an individual antenna reading from the Goose Bay radar system in Labrador. The output pertains to the structure of the atmosphere. That is; Positive Output Ionosphere has a prevalent structure. Negative Output Ionosphere is lacking in structure. The 275 training inputs should be a large enough sample to train the network for execution on the testing data; those unseen examples without outputs. The dataset has a high-dimensional input space. As such, a neural network for nonlinear classification is a suitable solution to the problem. The network should be able to increment through the space of all inputs, adjusting variables as it does, to minimise error on the classification of later unseen inputs. The dataset does not contain any missing inputs, which will ensure that no pre-processing techniques are required. 2 Nettworrk Dessiign;; Modellss,, Topollogiiess and Varriiablless 22. 11 MM aa t hh ee mm aa t i icc i aa l l MM oo dd ee l lss l, aa nn aa l loo gg oo uu ss ww i it hh BB i ioo l loo gg i icc i aa l l PPr r oo cc ee ss ss ee ss An ANN can be modelled in software or electronic circuitry. The majority of contemporary research uses software to model the network. However, previous attempts have been made to use electronics (specifically printed circuit boards), to create a neural-based control system. The ANN is composed of a series of classification units, each taking a series of numerical inputs, applying a function and achieving a single output variable. The activation is based upon whether the summated inputs are greater than a pre-set threshold value, T. Activation functions can be bi-polar ( or 1), tri-polar (-1,, 1) or Continuous (using the Logistic Sigmoid Function). 23-24 -3-
Each classification unit (generally, a Perceptron [1]) is only capable of linear separation. However, non-linearly separable problems may be solved by combining multiple units into a neural network. It is this type of problem solution that will be developed here. Activation values are fed to connecting units, the overall goal being to classify based on a series of inputs, whilst reducing the classification error. The neural network implementation may be broken down into a unit model (the Perceptron (classification unit)), and the network model, which is a network topology, with a unit model at each node. The fully interconnected multi-layer network is similar to that in the human cortex, though the latter will not feature full interconnection generally. 22. 22 UU nn i it MM oo dd ee l l The unit model used in this study is the Perceptron, trained under the Perceptron Learning Algorithm. This is equivalent to a single neuron. Figure 2 shows the Perceptron and its similarities with a single neuron. Input vectors take inputs from the dataset or previous units in the network model. These inputs, (x 1..x n ), are vectors which are weighted with random numbers (w 1..w n ). The summation of; (x 1.w 1 ),(x 2.w 2 ),..,(x n.w n ), gives a value, which is compared to the activation threshold. Thus, a value is passed forward, either as a classification value or to another unit for further processing. FIGURE 2: ANALOGY BETWEEN BIOLOGICAL AND ARTIFICIAL NEURAL MODELS The Perceptron learning algorithm is defined by the function; y = g ( Σ W i X i T) (Eq.1) Where y is the activation result, g is the function of the Perceptron learning algorithm, W is the weight vector, X is the input vector, α is the learning rate and T is the threshold value. 22. 33 NN ee t ww oo r kk MM oo dd ee l l Typically, one unit model is placed at each node of the network and interconnections run between units. The network model is operated by a training algorithm, which itself has a number of associated variables. The network used in this development employs the backpropagation technique for training. This network is a feed forward multi-layer network. It should contain; 1 Input Layer of Units, At least 1 Hidden Layer of Units, 1 Output Layer of Units (generally to one common output value). It is important to note that weights are assigned to the interconnections between each layer in the network. W ij is generally used to represent the initial interconnection weights, V jk, the next, and so on [3]. The backpropagation technique is employed as a means of reducing error in the network s classification, by initially calculating this error and the propagating it back through the network for reduction. Inputs are propagated to the first layer of hidden units, whose output is calculated and propagated to the next hidden later. This process is repeated until the output layer is reached. Each output layer unit calculates the activation, from the sum of weighted inputs from previous layers. The error on the initial output is computed and propagated back to the first hidden layer, where the weight matrix is updated. This process is repeated until the error is minimised as far as possible. The remainder of this report discusses the additional variables, strategies and tests used in the implementation, and their results. 23-24 -4-
22. 44 NN ee t ww oo r kk II mm pp l lee l mm ee nn t aa t i ioo i nn This section details my own implementation of the Artificial Neural Network required to perform the classification problem outlined (Section 1.2). The network has been implemented in Java, following guidelines given and examples written in C [4, 5]. I have chosen for my own implementation to be written in Java, as it is the programming language which I am most familiar with. The implementation contains a neural network, which is written as one object, Network, in this implementation. Additional objects inherit this object to vary the strategy used for experimentation. The objects and the strategies they implement are: BatchLearn To implement the batch learning technique, minimising the error for all training samples. OnlineLearn To implement the online learning technique, minimising the error for each training example. HiddenUnits Variation of the number of hidden units in the network. MomentumRate Variation of the momentum rate for gradient descent, avoiding local minima on error curve. LearningRate Variation of the learning rate. This is, in turn, inherited by: o AdaptLearningRate Which implements a strategy for automated learning rate adaptation. ActivationFunction To implement the LogisticSigmoid and TanH activation function objects. The implementation also includes separate objects which read in the training and testing datasets, feeding them to the Network class. A main class has been implemented, which instantiates and runs the network and executes the backpropagation algorithm. Main also reports the Mean and Standard Deviation of the error for a specified number of training epochs. 3 -- Sttrrattegy 33. 11 - HH yy pp oo t hh ee ss ee ss Given the experimental nature of this implementation, it is feasible to create a series of hypotheses which, in turn, will assist in the derivation of strategies for network training and testing. The hypotheses that I have concluded are: As the number of hidden units in the network increases, the error should decrease. A lower value for the learning rate should decrease error in this scenario. That is due to the relative complexity of the dataset which has relatively few training inputs. This may lead to an uneven error curve which is plagued with local minima. As such, a lower value of the learning rate will aid the traversal of the error curve. As with the learning rate, due to the potentially high number of local minima, a higher momentum rate will assist in the traversal of the error surface. As the number of training epochs increases, the error should minimise and the momentum rate could be lower. Online learning is the process of calculation of weight change for each individual training example. Therefore, it can be assumed that the error will decrease as the training examples increase. The batch learning procedure calculates weight change over the entire test set, or a subset batch. Therefore, as the number of examples increases the error should remain reasonably stable, varying only slightly. The activation functions to be tested are tanh and the Logistic Sigmoid. As the tanh function has a greater range, it should be more sensitive to the learning rate and weight changes should result in greater values. This should lead to a steeper reduction of error. These hypotheses have led to the derivation of testing strategies, which should determine the most suitable variables for this network, operating on this particular dataset. The network performance is analysed in terms of its mean-squared error (MSE). This is the square of the average difference between actual output and desirable output. 23-24 -5-
33 22 SSt tr aa t ee gg yy aa nn dd VV aa r iaa i bb lee ss f oo r TT ee ss t inn i gg Table 2 shows the order and series of tests that were performed upon the network. Each test is measured by using a series of variables, which give the required level of information to perform a statistical analysis on the data. The variables were initially sought through interpolation upon the training examples. After this process was performed, the optimal variables were not changed. Test # Test Description Variables Required / Values Used 1 Batch vs. Online Learning R (# of runs) = 5 E (# of epochs) = -> 5 H (# of hidden units) = 8 M (momentum rate) =. L (learning rate) =.5 2 Number of Hidden Units R (# of runs) = 5 E (# of epochs) = 5 H (# of hidden units) = 1 -> 3 M (momentum rate) =. L (learning rate) =.5 3 Activation Functions (tahh vs. Logistic Sigmoid) R (# of runs) = 5 E (# of epochs) = -> 5 H (# of hidden units) = 8 M (momentum rate) =. L (learning rate) =.5 4 Momentum Rate R (# of runs) = 5 E (# of epochs) = -> 5 H (# of hidden units) = 8 M (momentum rate) = -> 1 L (learning rate) =.5 5 Learning Rate R (# of runs) = 5 E (# of epochs) = -> 5 H (# of hidden units) = 8 M (momentum rate) =. L (learning rate) = -> 1 6 Adaptive Learning Rate R (# of runs) = 5 E (# of epochs) = -> 5 H (# of hidden units) = 8 M (momentum rate) =. L (learning rate) =.5 S (learning rate scale) = -> 1 7 N-Fold Cross Validation R (# of runs) = 5 H (# of hidden units) = 8 M (momentum rate) =. L (learning rate) =.5 F (folds of dataset) = > 275 TABLE 2: STRATEGY FOR NETWORK TESTING Section 4 covers the reporting of the test results. This includes graphical representations over time (varied by the number of epochs) and variable alteration against the mean-squared error. The test results are organised in terms of the test description, test results and a summary which describes these results. Tests were conducted in an ad-hoc approach at first, until optimal values for the variables had been found. Not all optimal variables were specified, as some of them would require significantly more computational power for little additional accuracy in the results. The process of tuning would yield the most suitable values for variables in the network. 23-24 -6-
4 -- Ressullttss 44. 11 BB aa t cc hh LL ee aa r nn i inn i gg vvss. OO nn l li l i inn ee LL ee aa r nn i inn i gg Initially, the process of batch learning was trialled against another similar learning scheme, online learning. Batch learning is the process of calculating the weight change (δw) with respect to the error, for the entire epoch. The new weights are calculated at the end of the training epoch. As such, batch learning allows better levels of generalisation, though in some scenarios may decrease accuracy. Online learning calculates δw and updates the weights after each training example has been calculated. The latter approach should allow for the close learning of the training set by the network, resulting in a falling error as the number of training examples increases. The results of this test are shown in Figure 3..45.4.35.3.25.2.15.1 Online Batch 1 2 3 4 5 8 1 15 2 25 5 Epochs (E) FIGURE 3: BATCH VS. ONLINE LEARNING WITH VARIABLES [R=5, H=8, M=., L=.5] As anticipated, the decreasing error given through online learning would cross the relatively stable line which represents batch learning. The batch learning experiment varied more than expected, though this could be due to some examples in the dataset which may throw the experiment out slightly. As seen in the graph, the optimal number of epochs for minimal error in online learning is infinite, though for computational optimisation, a value between 8 and 15 would be sufficient. 44. 22 VV aa r yy i inn i gg nn uu mm bb ee r oo f HH i idd dd ee nn UU nn i it it ss The variation of the number of hidden units would define the topology of the network. The previous experiment had shown that an online learning technique which used 5 epochs would be preferable for convergence and error reduction. Although this would take longer, the error would be significantly lower than in batch learning. The testing procedures were performed using refinement and interpolation with results carried forward to all network tests. Figure 4 shows a graph of the mean-squared error against the number of hidden units within this network, allowing an optimal topology decision..6.4.3.2.1 1 2 3 4 5 6 7 8 9 1 15 3 # of Hidden Units (H) FIGURE 4: BATCH VS. ONLINE LEARNING WITH VARIABLES [R=5, E=5, M=., L=.5] 23-24 -7-
As shown in Figure 4, the decision to use 8 hidden units in the network was taken, specifically due to the local minima in this graph. This decision was a compromise between a low error and a simple network. As the number of hidden units increased, it was expected that the error would become lower still, though computations would take much longer, increasing runtime for the network. The lower the number of variables the network had to deal with, the quicker it would converge upon a solution. Due to the limited number of training examples within the dataset, a low number of hidden units would result in error reduction being too low. A number of hidden units that was too high could result in overfitting, where the network would incorrectly classify unseen problems as it was tuned to the testing dataset too much. 44. 33 AA cc t i ivvaa t i ioo nn FF uu nn cc t i ioo i nn ss (( t aa nn hh vv ss. LL oo gg i iss i t i icc i SSi i gg mm oo i idd i )) The backpropagation learning technique can use the Logistic Sigmoid activation function or the tanh activation function at each classification unit. In this experiment, each was trailed and the most suitable selected for continuation in the network. The logistic sigmoid activation function is mathematically defined by the following function; Where y is the activation function. (Eq. 2) The tanh function is an alternative for the activation function, and is mathematically defined by the following; Where tanh(x) is the activation function. (Eq. 3) The tanh function has a greater range than the Logistic Sigmoid. In terms of real numbers, this range is equivalent to double that of the Sigmoid, or [-1, +1]. The range of the Sigmoid function is [, 1]. Therefore, the tanh function is far more sensitive to a change in learning rate, which also ensures that weight changes have a greater range. The two functions are compared to one another in the graph in Figure 5..5.45.4.35.3.25.2.15.1 Log. Sig. (L=.5) tanh (L=.1) 1 2 3 4 5 8 1 15 2 25 5 Epochs (E) FIGURE 5: LOGISTIC SIGMOID VS. TANH FUNCTION [R=5, H=8, M=., L=.5 &.1] As the graph shows, after approximately 2 epochs, the two functions converge and give similar results. As 5 epochs had been selected in this case, it was decided that the Logistic Sigmoid would be a suitable activation function. It is easier to implement and modify the logistic sigmoid function and see how the classification units are operating. In addition, the Sigmoid gives more freedom with the learning rate. As this is larger, it allows for more significant alterations to weights without throwing the classification out. 44. 44 VV aa r yy i inn i gg t hh ee MM oo mm ee nn t uu mm RR aa t ee The momentum rate is used to avoid the network getting stuck in local minima. That is, incorrectly converging upon a solution because it is surrounded by a positive and negative gradient, even if this is not the minimum error. The 23-24 -8-
momentum term is useful when an error surface is jagged, as this example was believed to be. Hypothetically, an uneven error surface would be possible with this dataset due to its few training examples. However, as the graph of this experiment shows, this was not the case..5.45.4.35.3.25.2.15.1 Momentum (M) =. Momentum (M) =.25 Momentum (M) =.5 Momentum (M) =.75 1 2 3 4 5 8 1 15 2 25 5 Epochs (E) FIGURE 6: VARIATION OF MOMENTUM RATE [R=5, H=8, L=.5] In fact, the error curve appeared to be particularly smooth. As this was the case, the momentum rate had very little effect, other than to speed up the time to convergence. As this experiment uses 5 epochs as an optimal value, this time is insignificant. The momentum rates seemed to converge at around 1 epochs and although the error continued to fall, it was decided that excluding the momentum rate (setting it to.) would be the simplest solution. 44. 55 VV aa r yy i inn i gg t hh ee LL ee aa r nn i inn i gg RR aa t ee The learning rate is especially important in accurately classifying the testing examples, and dependant on the dataset. If the error surface is smooth, then a low learning rate will ensure that traversal of the curve takes an unnecessary amount of time. However, if the surface is complex, a small learning rate is required. If a high rate is employed to a complex error surface, the network will bounce in and out of local minima, spuriously altering weights and never converging on an optimal solution. Figure 7 shows the testing of 3 separate rates in this experiment;.25,.5 and 1...45.4.35.3.25.2.15.1 Learning Rate (L) =.25 Learning Rate (L) =.5 Learning Rate (L) = 1. 1 2 3 4 5 8 1 15 2 25 5 Epochs (E) FIGURE 7: VARIATION OF LEARNING RATE [R=5, H=8, M =.] The optimal learning rate was selected as.5 as it has the smoothest descent. In this case, the network converged at around 15 epochs, though after this, the error rates continued to vary. It seems from testing that the learning rate would eventually converge, though this may be at over 1 epochs, which would take a long time to run. What is particularly noticeable is the large difference between rates at 2 epochs. The.5 rate was chosen as it seems to be the median value here, finding the middle ground between a lower error and a premature convergence. 23-24 -9-
44. 66 TT r aa i i l l i inn i gg aa nn AA dd aa pp t i ivv i ee LL ee aa r nn i inn i gg RR aa t ee Trials were also performed with an adaptive learning rate. Adaptation of the local learning rate during training will allow a more precise learning rate to be selected, depending upon the error on that particular training example, or epoch (depending on learning technique). The Taylor Expansion would normally be performed to generalise the gradient of the error function to a quadratic term, which could be used to calculate the learning rate. That was viewed as too complex for this experiment. Instead, the adaptive rate would be set much like the momentum rate is, though the learning rate would be varied instead of the weights. The strategy employed is outlined intuitively as: Calculate the difference in current error (MSE E ) between this and the last training epoch (E-1). o If MSE E-1 is greater than MSE E ; calculate scale S and add to learning rate (L), To calculate S, multiply L by (MSE E-1 - MSE E ). o Else, if MSE E is greater than MSE E-1 ; calculate scale, S, and subtract from learning rate (L), Again, to calculate S, multiply L by (MSE E-1 - MSE E ). That is to say, that as the network sees that the error curve is smooth and it is moving quickly down the error slope, the convergence will speed up. Conversely, as the gradient lessens (possibly due to an uneven error surface), then the learning rate will slow down convergence, becoming more accurate and preventing the bouncing problem caused by an uneven error surface. Figure 8 shows the results of this test, varying the scale for learning rate adaptation from to 1..45.4.35.3.25.2.15.1 Learning Scale (S) =. Learning Scale (S) = 1. 1 2 3 4 5 8 1 15 2 25 5 Epochs (E) FIGURE 8: ADAPTIVE LEARNING RATE [R=5, H=8, M =., L=.5] The results show little difference between the two error scales (S). However, this is probably due to the smooth error surface of this dataset. When S is 1., slightly later convergence is shown, signifying that this strategy should work in practice. As with the momentum term on this dataset, few local minima mean that these functions are limited. Larger values of S could be trialled with other datasets to prove that this concept works. 44. 77 NN - FF oo l ldd CC r oo ss ss VV aa l li l i idd aa t i ioo i nn The process of cross validation is performed to estimate the error over the dataset. Small datasets, such as this one are subject to the problem of overfitting, as previously mentioned. That is, where a high test error occurs as the network is specifically tuned to classify a small subset of the total data space. Cross validation produces an estimate of the test error as follows. Initially, the training dataset is split into N subsets, called folds. One fold is selected and removed from the dataset. The network is trained on N-1 folds and the error on the removed subset is calculated. This process is repeated for each subset and an average of the error across the entire dataset is calculated. Figure 9 shows this process, as performed for the Ionosphere dataset using the network developed. 23-24 -1-
.1.9.8.7.6.4.3.2.1 Cross Fold Validation Actual Error 1 25 5 1 25 275 Folds (F) FIGURE 9: 1-FOLD TO 275-FOLD CROSS VALIDATION [R=5, H=8, M =., L=.5] When compared with the actual error, the error given by cross-fold validation is greater than that given by the network. This is possibly due to the data being skewed, though this would have to be proven with an accurate plot of all elements in the test set. It shows that the lowest error is given by 1 folds. The rapid descent of this graph is probably due to a lack of transparency in this dataset. However, this does show that to avoid overfitting, at least 1 training examples should be used. As this experiment uses 275, then the assumption can be made that overfitting does not occur. 5 -- Reccommendattiionss Having conducted an exhaustive series of tests throughout this experiment, it is possible to compare the results gained with the initial hypotheses formed (Section 3.1). Initially, it was correct to assume that error would decrease as the number of hidden units increased. However, for computational speed, it was decided that the optimal value of 8 hidden units would form the network topology. Trials with batch and online learning at an early stage in the experiment allowed the decision for online learning to be made. This was selected as it is able to change the weights for each individual training example, rather than every epoch. This significantly reduced the compound error throughout training. Online learning with 8 hidden units were carried as a strategy throughout the rest of the experiment. Trials were performed with learning and momentum rates. The hypothesis suggested that a relatively low learning rate would be preferable, for the avoidance of the bounce of the network around local minima. However, it appeared that the error surface was relatively even. As such, a medium error rate (.5 in this case) was selected as a compromise between classification accuracy and computational speed. The calculation of momentum was not particularly necessary in this case, again, due to the smooth error curve. Values of momentum rates were tried, but really had no effect. For ease,. was selected as a suitable learning rate. Future work would ensure this was increased slightly, to.1 or.2, thus ensuring that any small local minima could also be overcome by the network. The activation functions trialled showed that the tanh function would probably yield more accurate results than the Logistic Sigmoid function. This was indeed true, though due to the greater range of the function the learning rate was decreased to 1/5 th its original size. This meant that changes to the weights were more sensitive, which could lead to further inaccuracies later. Therefore, the Logistic Sigmoid function was employed instead, allowing greater freedom with the learning rate. Possibly the most suitable decision here was the training over 5 epochs. That, in the majority of cases, would reach firm convergence. Although this took more computational time, it showed that the Hebbian Learning principle works effectively. 23-24 -11-
6 -- Conccllussiionss This has been a particularly enjoyable task to undertake, which has definitely optimised my own knowledge of the neural network. Much attention had to be paid to the operation of mathematical models, and, once implemented; an understanding was gained of how these operated. Testing of the model ensured that it was iteratively and strategically validated before actual testing could commence. This also ensured that I learned the precise details of how this commonly implemented network topology operated. The experimental procedures showed that there is no one solution or combination of variables that will work more effectively than others; it is all down to their testing scenario. The process of trade-offs between accuracy (in error reduction), complexity and computational time was an interesting tuning problem to attempt. I feel that the strategy which I executed was almost optimal for this particular dataset. However, were the network tested on a completely different dataset, it is likely that this strategy, too, would be completely different. 7 Refferrenccess The dataset which I have chosen for this experimental work is the Ionosphere dataset, which is located at; http://www.cs.bham.ac.uk/~pxt/nc/assignment/bp.code.tar.gz [1] Haykin, S. Neural Networks, A Comprehensive Foundation, Second Edition. Prentice Hall, 1999. [2] IONOSPHERE Dataset Johns Hopkins University Found at the above link. [3] Tino, P. Lecture 12 Intro. to Neural Computation,, University of Birmingham, 23. [4] Bullinaria, J. Step by Step Guide to Implementing an Neural Network in C,, University of Birmingham. See http://www.cs.bham.ac.uk/~jxb/nn/nn.html. [5] Tino, P. C Implementation of Feed Forward Backpropagation Network,, University of Birmingham. Found at the above link. 23-24 -12-