Artificial Neural Computation Systems

Artificial Neural Computation Systems Spring 2003 Technical University of Szczecin Department of Electrical Engineering Lecturer: Prof. Adam Krzyzak,PS

5. Lecture 15.03.2003 147 1. Multilayer Perceptrons............ 148 1.1 Structure............... 148 1.2 Training................ 151 1.3 Notation............... 152 1.4 The cost function........... 154 1.5 Adapting output neurons....... 156 1.6 Adapting hidden neurons....... 162 1.7 Summary so far............ 168 1.8 Steepest Descent Training Modes... 170 Sequential Mode........... 171 Batch Mode............. 171 1.9 Activation Functions......... 174 Logistic sigmoid............ 175 Hyperbolic tangent function..... 177 Softmax................ 179 2

5. Lecture 15.03.2003 147

1. Multilayer Perceptrons Workhorse of ANNs Universal approximator Successfully applied to many difficult and diverse problems most often function approximation 1.1 Structure An MLP consists of an input layer, one or more hidden layers, and an output layer (See Fig. 4.1). The activation functions of the hidden layer neurons are nonlinear (usually tanh/sigmoid) otherwise the total 148

network would be equivalent to a single layer linear network! Output layer activation functions can be linear or nonlinear, most often sigmoid, tanh, or softmax. The input signal propagates through the network in a forward direction, layer-by-layer. 149

1.2 Training Typically trained using a supervised error-correction learning algorithm with steepest descent, or one of its faster converging friends. The trick lies in credit assignment How to compute the effect of the weights in the hidden layers to the cost function = How to compute the gradient of the cost with respct to the weights. This is done using the so-called error back-propagation algorithm. Two passes through the layers of an MLP network: 1. In the forward pass, the response of the network to an input vector is computed and all the synaptic weights are kept fixed. 151

2. During the backward pass, the error signal is propagated backward through the network, and the weights are adjusted using an error-correction rule. After adjustment, the output of the network will be closer to the desired response. 1.3 Notation The indices i, j and k refer to neurons in different layers in that order from left to right. In iteration n, the n-th training vector is presented to the network. E(n) refers to the instantaneous sum of error squares or error energy at iteration n. 152

E av is the average of E(n) over all n. e j (n) is the error signal at the output of neuron j for iteration n. d j (n) is the desired response for neuron j. y j (n) is the the output of neuron j for iteration n. w ji (n) is the weight connecting the output of neuron i to the input of neuron j at iteration n. The correction applied to this weight is denoted by w ji (n). v j (n) denotes the local field of neuron j at iteration n (the weighted sum of inputs plus bias of that neuron). The activation function (nonlinearity) associated with neuron j is denoted by ϕ j (.). 153

b j denotes the bias applied to neuron j, corresponding to the weight w j0 = b j and a fixed input +1. x i (n) denotes the i-th element of the input vector. η denotes the learning-rate parameter. m l denotes the number of neurons in layer l. - The network has L layers. - For output layer, the notation m L = M is also used. 1.4 The cost function The error signal at the output of neuron j at iteration n is defined by e j (n) = d j (n) y j (n), (5..1) 154

when neuron j is an output node of the whole network. The total instantaneous error energy E(n) for all the neurons in the output layer is E(n) = 1 e 2 2 j(n) (5..2) j C where the set C contains all the neurons in the output layer. Let N be the total number of training vectors (examples, patterns). Then the average squared error energy is E av = 1 N E(n) (5..3) N n=1 155

The objective is to derive a learning algorithm for minimizing E av with respect to the free parameters. Weights are updated on a pattern-by-pattern basis during each epoch (epoch is one complete presentation of the entire training set). In other words, instantaneous stochastic gradient based on a single sample only. The average of these updates over one epoch estimates the gradient of E av. 1.5 Adapting output neurons 156

Consider now Figure 4.3 showing neuron j. It is fed by a set of function signals produced by a layer of neurons to its left, indexed by i. 157

The local field v j (n) of neuron j is clearly v j (n) = m w ji (n)y i (n) (5..4) i=0 The function signal y j (n) appearing at the output of neuron j at iteration n is then y j (n) = ϕ j (v j (n)). (5..5) The correction w ji (n) made to the synaptic weight w ji (n) is proportional to the partial derivative E(n)/ w ji (n) of the instantaneous error. Using the chain rule of calculus, this gradient can be ex- 158

pressed as follows: E(n) w ji (n) = E(n) e j (n) y j (n) v j (n) e j (n) y j (n) v j (n) w ji (n) (5..6) Differentiating both sides of Eq. (5..2) with respect to e j (n), we get E(n) e j (n) = e j(n) (5..7) Differentiating Eq. (5..1) with respect to y j (n) yields e j (n) y j (n) = 1 (5..8) 159

Differentiating Eq. (5..5) with respect to v j (n), we get y j (n) v j (n) = ϕ j(v j (n)) (5..9) where ϕ j denotes the derivative of ϕ j. Finally, differentiating (5..4) with respect to w ji (n) yields v j (n) w ji (n) = y i(n). (5..10) Inserting these partial derivatives into (5..6) yields E(n) w ji (n) = e j(n)ϕ j(v j (n))y i (n) (5..11) 160

If we use steepest descent to optimize the weights, the correction w ji (n) applied to the weight w ji (n) is defined by the delta rule: w ji (n) = η E(n) w ji (n) (5..12) where η is the learning-rate parameter. Inserting (5..11) into (5..12) yields w ji (n) = ηδ j (n)y i (n) (5..13) where the local gradient is defined by δ j (n) = E(n) v j (n) = e j(n)ϕ j(v j (n)) (5..14) 161

1.6 Adapting hidden neurons There is no desired response available for neuron j. Question: How to compute the responsibility of this neu- 162

ron to the error made at the output? The error signal for a hidden neuron can be determined recursively in terms of the error signals of all neurons connected to it. This makes back-propagation efficient. Figure 4.4 illustrates the situation where neuron j is a hidden node. Using Eq. (5..14), we may rewrite the local gradient δ j (n) for hidden neuron j as follows: δ j (n) = E(n) y j (n) y j (n) v j (n) = E(n) y j (n) ϕ j(v j (n)) (5..15) The partial derivative E(n) y j (n) may be calculated as follows. 163

From Figure 4.4 we see that E(n) = 1 e 2 2 k(n), neuron k is an output node (5..16) k C Differentiating this with respect to the function signal y j (n) and using the chain rule we get E(n) y j (n) = k e k e k (n) y j (n) = k e k (n) v k (n) e k v k (n) y j (n) (5..17) From Figure 4.4 we note that when the neuron k is an output node e k (n) = d k (n) y k (n) = d k (n) ϕ k (v k (n)) (5..18) 164

so that e k (n) v k (n) = ϕ k(v k (n)) (5..19) Figure 4.4 shows also that the local field of neuron k is v k (n) = m w kj (n)y j (n) (5..20) j=0 where the bias term is again included as the weight w k0 (n). Differentiating this with respect to y j (n) yields v k (n) y j (n) = w kj(n) (5..21) 165

Inserting these expressions into (5..17) we get the desired partial derivative E(n) y j (n) = k e k (n)ϕ k(v k (n))w kj (n) = k δ k (n)w kj (n) (5..22) Here again δ k (n) denotes the local gradient for neuron k. Finally, inserting (5..22) into (5..15) yields the back-propagation formula for the local gradient δ j (n): δ j (n) = ϕ j(v j (n)) k δ k (n)w kj (n) (5..23) This holds when neuron j is hidden. 166

Let us briefly study the factors in this formula: δ j (n) = ϕ j(v j (n)) k δ k (n)w kj (n) ϕ j (v j(n)) depends solely on the activation function ϕ j (.) of the hidden neuron j. The local gradients δ k (n) require knowledge of the error signals e k (n) of the neurons in the next (righthand side) layer. The synaptic weights w kj (n) describe the connections of neuron j to the neurons in the next layer to the right. Thus propagating the errors backwards takes only about the same amount of computation as the forward pass. 167

1.7 Summary so far The correction w ji (n) of the weight connecting neuron i to neuron j is described by Eq. (5..13), in Haykin, Eq. (4.25) above Figure 4.5. 168

The local gradient δ j (n) is computed from Eq. (5..14) if neuron j lies in the output layer. If neuron j lies in the hidden layer, the local gradient is computed from Eq. (5..23). So, back-propagation = computationally efficient way of evaluating the gradient of the cost function with respect to the weights of the neurons in the hidden layer. Once we have the gradient, any optimization method can be applied; we are not limited to steepest descent only. Thus back-propagation network is an incorrect term! Back-propagation steepest descent, as often erroneously appears in the literature! 169

1.8 Steepest Descent Training Modes One complete presentation of the entire training set is called an epoch. The learning process is continued over several epochs. Learning could be stopped when the weight values and biases stabilize, and the average squared error converges to some minimum value (more later!). It is useful to present the training samples in a randomized order during each epoch. One may use either sequential (on-line, stochastic) or batch learning mode. 170

Sequential Mode The weights are updated after presenting each training example (input vector). This is what we derived earlier! Advantages Simple to implement Requires less storage Perhaps less likely to get trapped in a local minimum Batch Mode The weights are updated after each epoch only. 171

All the training examples are presented once before updating the weights and biases. In batch mode, the cost function is the average squared error E av = 1 N e 2 2N j(n) (5..24) n=1 j C The synaptic weight is updated using the batch delta rule w ji = η E av w ji = η N N n=1 e j (n) e j(n) w ji (5..25) The partial derivative e j (n)/ w ji may be computed as in the sequential mode. 172

Advantages provides an accurate estimate of the gradient vector convergence to a local minimum at least is guaranteed 173

1.9 Activation Functions The derivative of the activation function ϕ(.) is needed in computing the local gradient δ. Therefore, ϕ(.) must be continuous and differentiable. It would be computationally advantageous if ϕ (v) is expressible as a function of v and ϕ(v) only. In MLP networks, two forms of sigmoidal nonlinearities are commonly used as activation functions: 1. Logistic sigmoid 2. Hyperbolic tangent A generalization of the logistic function is: 3. Softmax 174

Logistic sigmoid 1 y = ϕ(v) =, a > 0, < v < 1 + exp( av) For clarity, we have omitted here the neuron index j and the iteration number n. The range of ϕ(v) and hence the output y = ϕ(v) always lies in the interval 0 y 1. The derivative of y = ϕ(v) can be expressed in terms of the output y as ϕ (v) = ay(1 y) 175

This formula allows writing the local gradient δ j (n) in a simple form. If neuron j is an output node, δ j (n) = e j (n)ϕ j(v j (n)) (5..26) = a[d j (n) y j (n)]y j (n)[1 y j (n)] And for a hidden node: δ j (n) = ϕ j(v j (n)) k δ k (n)w kj (n) (5..27) = ay j (n)[1 y j (n)] k δ k (n)w kj (n) 176

Hyperbolic tangent function y = ϕ(v) = a tanh(bv), where a and b are positive constants. In fact, the hyperbolic tangent is just the logistic function rescaled and biased. Its derivative with respect to v is ϕ (v) = ab[1 tanh 2 (bv)] = b [a y][a + y] a Using this, the local gradients of output neurons and hidden neurons can be simplified to Eqs. (4.37) and (4.38) in Haykin. 177

Softmax y = ϕ(v k ) = exp(v k) k exp(v k ) Used in output layer for classification problems with 1-of- C target value coding (There are as many output neurons as there are classes to recognize. Only the output target for the correct class is one, the others are zeros.) Softmax provides posterior probabilities of class membership generalization of the logistic function for multiple classes. One of the output units can be left unconnected, since other outputs sum to one. 179