Artificial Neural Computation Systems

Similar documents
Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승

Neural network software tool development: exploring programming language options

Lecture 8 February 4

Lecture 6. Artificial Neural Networks

An Introduction to Neural Networks

SMORN-VII REPORT NEURAL NETWORK BENCHMARK ANALYSIS RESULTS & FOLLOW-UP 96. Özer CIFTCIOGLU Istanbul Technical University, ITU. and

Machine Learning: Multi Layer Perceptrons

Chapter 4: Artificial Neural Networks

Recurrent Neural Networks

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Predicting customer level risk patterns in non-life insurance ERIK VILLAUME

ARTIFICIAL NEURAL NETWORKS FOR DATA MINING

Deep Learning for Multivariate Financial Time Series. Gilberto Batres-Estrada

Comparison of Supervised and Unsupervised Learning Algorithms for Pattern Classification

IBM SPSS Neural Networks 22

Neural Networks and Support Vector Machines

Self Organizing Maps: Fundamentals

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

Stock Prediction using Artificial Neural Networks

Follow links Class Use and other Permissions. For more information, send to:

AN APPLICATION OF TIME SERIES ANALYSIS FOR WEATHER FORECASTING

Neural Computation - Assignment

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Performance Evaluation of Artificial Neural. Networks for Spatial Data Analysis

Forecasting of Economic Quantities using Fuzzy Autoregressive Model and Fuzzy Neural Network

STA 4273H: Statistical Machine Learning

Power Prediction Analysis using Artificial Neural Network in MS Excel

Data Mining Techniques Chapter 7: Artificial Neural Networks

CLASSIFICATION AND PREDICTION IN DATA MINING WITH NEURAL NETWORKS

Role of Neural network in data mining

6.2.8 Neural networks for data mining

Horse Racing Prediction Using Artificial Neural Networks

IBM SPSS Neural Networks 19

Application of Neural Network in User Authentication for Smart Home System

Time Series Data Mining in Rainfall Forecasting Using Artificial Neural Network

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

SEMINAR OUTLINE. Introduction to Data Mining Using Artificial Neural Networks. Definitions of Neural Networks. Definitions of Neural Networks

The Backpropagation Algorithm

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

A Time Series ANN Approach for Weather Forecasting

ARTIFICIAL NEURAL NETWORKS FOR ADAPTIVE MANAGEMENT TRAFFIC LIGHT OBJECTS AT THE INTERSECTION

A New Approach to Neural Network based Stock Trading Strategy

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

CHAPTER 5 PREDICTIVE MODELING STUDIES TO DETERMINE THE CONVEYING VELOCITY OF PARTS ON VIBRATORY FEEDER

Analysis of Multilayer Neural Networks with Direct and Cross-Forward Connection

Simplified Machine Learning for CUDA. Umar

Introduction to Machine Learning Using Python. Vikram Kamath

Comparison of Regression Model and Artificial Neural Network Model for the prediction of Electrical Power generated in Nigeria

A Multi-level Artificial Neural Network for Residential and Commercial Energy Demand Forecast: Iran Case Study

Package AMORE. February 19, 2015

Background 2. Lecture 2 1. The Least Mean Square (LMS) algorithm 4. The Least Mean Square (LMS) algorithm 3. br(n) = u(n)u H (n) bp(n) = u(n)d (n)

Tennis Winner Prediction based on Time-Series History with Neural Modeling

Implementation of Neural Networks with Theano.

Feedforward Neural Networks and Backpropagation

APPLICATION OF ARTIFICIAL NEURAL NETWORKS USING HIJRI LUNAR TRANSACTION AS EXTRACTED VARIABLES TO PREDICT STOCK TREND DIRECTION

NEURAL NETWORKS A Comprehensive Foundation

Data Mining using Artificial Neural Network Rules

Machine Learning and Data Mining -

Supporting Online Material for

Neural Networks and Back Propagation Algorithm

Iranian J Env Health Sci Eng, 2004, Vol.1, No.2, pp Application of Intelligent System for Water Treatment Plant Operation.

Data Mining Algorithms Part 1. Dejan Sarka

NN-OPT: Neural Network for Option Pricing Using Multinomial Tree

Novelty Detection in image recognition using IRF Neural Networks properties

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

Performance Evaluation On Human Resource Management Of China S Commercial Banks Based On Improved Bp Neural Networks

Efficient online learning of a non-negative sparse autoencoder

Prediction Model for Crude Oil Price Using Artificial Neural Networks

Artificial Neural Networks for Gas Turbine Modeling and Sensor Validation

PLAANN as a Classification Tool for Customer Intelligence in Banking

Analecta Vol. 8, No. 2 ISSN

Neural Networks in Quantitative Finance

Note on growth and growth accounting

OPTIMUM LEARNING RATE FOR CLASSIFICATION PROBLEM

Temporal Difference Learning in the Tetris Game

The general inefficiency of batch training for gradient descent learning

Back Propagation Neural Network for Wireless Networking

ANN Based Fault Classifier and Fault Locator for Double Circuit Transmission Line

DEVELOPMENT OF THE ARTIFICIAL NEURAL NETWORK MODEL FOR PREDICTION OF IRAQI EXPRESS WAYS CONSTRUCTION COST

Methods and Applications for Distance Based ANN Training

Package neuralnet. February 20, 2015

Comparison of K-means and Backpropagation Data Mining Algorithms

Data Mining Using Neural Networks: A Guide for Statisticians

General Framework for an Iterative Solution of Ax b. Jacobi s Method

A Neural Network based Approach for Predicting Customer Churn in Cellular Network Services

129: Artificial Neural Networks. Ajith Abraham Oklahoma State University, Stillwater, OK, USA 1 INTRODUCTION TO ARTIFICIAL NEURAL NETWORKS

Introduction to Logistic Regression

Bank Customers (Credit) Rating System Based On Expert System and ANN

A Simple Feature Extraction Technique of a Pattern By Hopfield Network

COMBINED NEURAL NETWORKS FOR TIME SERIES ANALYSIS

TRAINING A LIMITED-INTERCONNECT, SYNTHETIC NEURAL IC

Neural Network Predictor for Fraud Detection: A Study Case for the Federal Patrimony Department

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Using Artifical Neural Networks to Model Opponents in Texas Hold'em

Neural network models: Foundations and applications to an audit decision problem

Design call center management system of e-commerce based on BP neural network and multifractal

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department sss40@eng.cam.ac.

Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin *

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Transcription:

Artificial Neural Computation Systems Spring 2003 Technical University of Szczecin Department of Electrical Engineering Lecturer: Prof. Adam Krzyzak,PS

5. Lecture 15.03.2003 147 1. Multilayer Perceptrons............ 148 1.1 Structure............... 148 1.2 Training................ 151 1.3 Notation............... 152 1.4 The cost function........... 154 1.5 Adapting output neurons....... 156 1.6 Adapting hidden neurons....... 162 1.7 Summary so far............ 168 1.8 Steepest Descent Training Modes... 170 Sequential Mode........... 171 Batch Mode............. 171 1.9 Activation Functions......... 174 Logistic sigmoid............ 175 Hyperbolic tangent function..... 177 Softmax................ 179 2

5. Lecture 15.03.2003 147

1. Multilayer Perceptrons Workhorse of ANNs Universal approximator Successfully applied to many difficult and diverse problems most often function approximation 1.1 Structure An MLP consists of an input layer, one or more hidden layers, and an output layer (See Fig. 4.1). The activation functions of the hidden layer neurons are nonlinear (usually tanh/sigmoid) otherwise the total 148

network would be equivalent to a single layer linear network! Output layer activation functions can be linear or nonlinear, most often sigmoid, tanh, or softmax. The input signal propagates through the network in a forward direction, layer-by-layer. 149

150

1.2 Training Typically trained using a supervised error-correction learning algorithm with steepest descent, or one of its faster converging friends. The trick lies in credit assignment How to compute the effect of the weights in the hidden layers to the cost function = How to compute the gradient of the cost with respct to the weights. This is done using the so-called error back-propagation algorithm. Two passes through the layers of an MLP network: 1. In the forward pass, the response of the network to an input vector is computed and all the synaptic weights are kept fixed. 151

2. During the backward pass, the error signal is propagated backward through the network, and the weights are adjusted using an error-correction rule. After adjustment, the output of the network will be closer to the desired response. 1.3 Notation The indices i, j and k refer to neurons in different layers in that order from left to right. In iteration n, the n-th training vector is presented to the network. E(n) refers to the instantaneous sum of error squares or error energy at iteration n. 152

E av is the average of E(n) over all n. e j (n) is the error signal at the output of neuron j for iteration n. d j (n) is the desired response for neuron j. y j (n) is the the output of neuron j for iteration n. w ji (n) is the weight connecting the output of neuron i to the input of neuron j at iteration n. The correction applied to this weight is denoted by w ji (n). v j (n) denotes the local field of neuron j at iteration n (the weighted sum of inputs plus bias of that neuron). The activation function (nonlinearity) associated with neuron j is denoted by ϕ j (.). 153

b j denotes the bias applied to neuron j, corresponding to the weight w j0 = b j and a fixed input +1. x i (n) denotes the i-th element of the input vector. η denotes the learning-rate parameter. m l denotes the number of neurons in layer l. - The network has L layers. - For output layer, the notation m L = M is also used. 1.4 The cost function The error signal at the output of neuron j at iteration n is defined by e j (n) = d j (n) y j (n), (5..1) 154

when neuron j is an output node of the whole network. The total instantaneous error energy E(n) for all the neurons in the output layer is E(n) = 1 e 2 2 j(n) (5..2) j C where the set C contains all the neurons in the output layer. Let N be the total number of training vectors (examples, patterns). Then the average squared error energy is E av = 1 N E(n) (5..3) N n=1 155

The objective is to derive a learning algorithm for minimizing E av with respect to the free parameters. Weights are updated on a pattern-by-pattern basis during each epoch (epoch is one complete presentation of the entire training set). In other words, instantaneous stochastic gradient based on a single sample only. The average of these updates over one epoch estimates the gradient of E av. 1.5 Adapting output neurons 156

Consider now Figure 4.3 showing neuron j. It is fed by a set of function signals produced by a layer of neurons to its left, indexed by i. 157

The local field v j (n) of neuron j is clearly v j (n) = m w ji (n)y i (n) (5..4) i=0 The function signal y j (n) appearing at the output of neuron j at iteration n is then y j (n) = ϕ j (v j (n)). (5..5) The correction w ji (n) made to the synaptic weight w ji (n) is proportional to the partial derivative E(n)/ w ji (n) of the instantaneous error. Using the chain rule of calculus, this gradient can be ex- 158

pressed as follows: E(n) w ji (n) = E(n) e j (n) y j (n) v j (n) e j (n) y j (n) v j (n) w ji (n) (5..6) Differentiating both sides of Eq. (5..2) with respect to e j (n), we get E(n) e j (n) = e j(n) (5..7) Differentiating Eq. (5..1) with respect to y j (n) yields e j (n) y j (n) = 1 (5..8) 159

Differentiating Eq. (5..5) with respect to v j (n), we get y j (n) v j (n) = ϕ j(v j (n)) (5..9) where ϕ j denotes the derivative of ϕ j. Finally, differentiating (5..4) with respect to w ji (n) yields v j (n) w ji (n) = y i(n). (5..10) Inserting these partial derivatives into (5..6) yields E(n) w ji (n) = e j(n)ϕ j(v j (n))y i (n) (5..11) 160

If we use steepest descent to optimize the weights, the correction w ji (n) applied to the weight w ji (n) is defined by the delta rule: w ji (n) = η E(n) w ji (n) (5..12) where η is the learning-rate parameter. Inserting (5..11) into (5..12) yields w ji (n) = ηδ j (n)y i (n) (5..13) where the local gradient is defined by δ j (n) = E(n) v j (n) = e j(n)ϕ j(v j (n)) (5..14) 161

1.6 Adapting hidden neurons There is no desired response available for neuron j. Question: How to compute the responsibility of this neu- 162

ron to the error made at the output? The error signal for a hidden neuron can be determined recursively in terms of the error signals of all neurons connected to it. This makes back-propagation efficient. Figure 4.4 illustrates the situation where neuron j is a hidden node. Using Eq. (5..14), we may rewrite the local gradient δ j (n) for hidden neuron j as follows: δ j (n) = E(n) y j (n) y j (n) v j (n) = E(n) y j (n) ϕ j(v j (n)) (5..15) The partial derivative E(n) y j (n) may be calculated as follows. 163

From Figure 4.4 we see that E(n) = 1 e 2 2 k(n), neuron k is an output node (5..16) k C Differentiating this with respect to the function signal y j (n) and using the chain rule we get E(n) y j (n) = k e k e k (n) y j (n) = k e k (n) v k (n) e k v k (n) y j (n) (5..17) From Figure 4.4 we note that when the neuron k is an output node e k (n) = d k (n) y k (n) = d k (n) ϕ k (v k (n)) (5..18) 164

so that e k (n) v k (n) = ϕ k(v k (n)) (5..19) Figure 4.4 shows also that the local field of neuron k is v k (n) = m w kj (n)y j (n) (5..20) j=0 where the bias term is again included as the weight w k0 (n). Differentiating this with respect to y j (n) yields v k (n) y j (n) = w kj(n) (5..21) 165

Inserting these expressions into (5..17) we get the desired partial derivative E(n) y j (n) = k e k (n)ϕ k(v k (n))w kj (n) = k δ k (n)w kj (n) (5..22) Here again δ k (n) denotes the local gradient for neuron k. Finally, inserting (5..22) into (5..15) yields the back-propagation formula for the local gradient δ j (n): δ j (n) = ϕ j(v j (n)) k δ k (n)w kj (n) (5..23) This holds when neuron j is hidden. 166

Let us briefly study the factors in this formula: δ j (n) = ϕ j(v j (n)) k δ k (n)w kj (n) ϕ j (v j(n)) depends solely on the activation function ϕ j (.) of the hidden neuron j. The local gradients δ k (n) require knowledge of the error signals e k (n) of the neurons in the next (righthand side) layer. The synaptic weights w kj (n) describe the connections of neuron j to the neurons in the next layer to the right. Thus propagating the errors backwards takes only about the same amount of computation as the forward pass. 167

1.7 Summary so far The correction w ji (n) of the weight connecting neuron i to neuron j is described by Eq. (5..13), in Haykin, Eq. (4.25) above Figure 4.5. 168

The local gradient δ j (n) is computed from Eq. (5..14) if neuron j lies in the output layer. If neuron j lies in the hidden layer, the local gradient is computed from Eq. (5..23). So, back-propagation = computationally efficient way of evaluating the gradient of the cost function with respect to the weights of the neurons in the hidden layer. Once we have the gradient, any optimization method can be applied; we are not limited to steepest descent only. Thus back-propagation network is an incorrect term! Back-propagation steepest descent, as often erroneously appears in the literature! 169

1.8 Steepest Descent Training Modes One complete presentation of the entire training set is called an epoch. The learning process is continued over several epochs. Learning could be stopped when the weight values and biases stabilize, and the average squared error converges to some minimum value (more later!). It is useful to present the training samples in a randomized order during each epoch. One may use either sequential (on-line, stochastic) or batch learning mode. 170

Sequential Mode The weights are updated after presenting each training example (input vector). This is what we derived earlier! Advantages Simple to implement Requires less storage Perhaps less likely to get trapped in a local minimum Batch Mode The weights are updated after each epoch only. 171

All the training examples are presented once before updating the weights and biases. In batch mode, the cost function is the average squared error E av = 1 N e 2 2N j(n) (5..24) n=1 j C The synaptic weight is updated using the batch delta rule w ji = η E av w ji = η N N n=1 e j (n) e j(n) w ji (5..25) The partial derivative e j (n)/ w ji may be computed as in the sequential mode. 172

Advantages provides an accurate estimate of the gradient vector convergence to a local minimum at least is guaranteed 173

1.9 Activation Functions The derivative of the activation function ϕ(.) is needed in computing the local gradient δ. Therefore, ϕ(.) must be continuous and differentiable. It would be computationally advantageous if ϕ (v) is expressible as a function of v and ϕ(v) only. In MLP networks, two forms of sigmoidal nonlinearities are commonly used as activation functions: 1. Logistic sigmoid 2. Hyperbolic tangent A generalization of the logistic function is: 3. Softmax 174

Logistic sigmoid 1 y = ϕ(v) =, a > 0, < v < 1 + exp( av) For clarity, we have omitted here the neuron index j and the iteration number n. The range of ϕ(v) and hence the output y = ϕ(v) always lies in the interval 0 y 1. The derivative of y = ϕ(v) can be expressed in terms of the output y as ϕ (v) = ay(1 y) 175

This formula allows writing the local gradient δ j (n) in a simple form. If neuron j is an output node, δ j (n) = e j (n)ϕ j(v j (n)) (5..26) = a[d j (n) y j (n)]y j (n)[1 y j (n)] And for a hidden node: δ j (n) = ϕ j(v j (n)) k δ k (n)w kj (n) (5..27) = ay j (n)[1 y j (n)] k δ k (n)w kj (n) 176

Hyperbolic tangent function y = ϕ(v) = a tanh(bv), where a and b are positive constants. In fact, the hyperbolic tangent is just the logistic function rescaled and biased. Its derivative with respect to v is ϕ (v) = ab[1 tanh 2 (bv)] = b [a y][a + y] a Using this, the local gradients of output neurons and hidden neurons can be simplified to Eqs. (4.37) and (4.38) in Haykin. 177

178

Softmax y = ϕ(v k ) = exp(v k) k exp(v k ) Used in output layer for classification problems with 1-of- C target value coding (There are as many output neurons as there are classes to recognize. Only the output target for the correct class is one, the others are zeros.) Softmax provides posterior probabilities of class membership generalization of the logistic function for multiple classes. One of the output units can be left unconnected, since other outputs sum to one. 179