Multilayer Perceptrons

Size: px

Start display at page:

Download "Multilayer Perceptrons"

Homer Beasley
7 years ago
Views:

1 Made wi t h OpenOf f i ce. or g 1 Multilayer Perceptrons 2 nd Order Learning Algorithms

2 Made wi t h OpenOf f i ce. or g 2 Why 2 nd Order? Gradient descent Back-propagation used to obtain first derivatives w.r.t weights Oscillates and takes a long time to converge 2 nd order adds direction

3 Made wi t h OpenOf f i ce. or g 3 Newton's Method Local quadratic approximations to the error function Use second-order Taylor series expansion of E at the point w E w Δw = E w Δw E w 1 2 ΔwT 2 E w Δw Differentiate w.r.t. Δw and minimize E w Δw 2 E w = 0 Δw = 2 E w 1 E w Δw = H 1 E w

4 Made wi t h OpenOf f i ce. or g 4 Hessian Matrix A matrix of second-order partial derivatives of the error function

5 Made wi t h OpenOf f i ce. or g 5 Issues & Alternatives Main Issues Calculating the Hessian and it's inverse is computationally expensive Inverting Hessian is not always possible Newton-Like Methods Diagonal Approximation Levenberg-Marquardt Quasi-Newton Methods...

6 Made wi t h OpenOf f i ce. or g 6 Diagonal Approximation Set all non-diagonal elements to 0 2 E n a j 2 = g ' a j 2 k 2 2 E n w kj 2 a k g ' ' a j 2 k w kj E n a k 2 E n w = 2 E n 2 2 ji a j g a 2 i Easy and quick to invert In practice Hessian is strongly non-diagonal

7 Made wi t h OpenOf f i ce. or g 7 Levenberg-Marquardt (LM) Designed specifically for minimizing E = 1 2 n y n t n 2 Hessian can be written as 2 E w ji w lk = n y n w ji y n y n t n w lk n Neglect second term and get H = J T J 2 y n w ji w lk

8 Made wi t h OpenOf f i ce. or g 8 Levenberg-Marquardt (LM) Update rule based on quadratic approximation w i 1 = w i H 1 E w Inclusion of blending factor Behaviour w i 1 = w i H λi 1 E w λ λ 0 w i 1 = w i 1 λ E w w i 1 = w i H 1 E w

9 Made wi t h OpenOf f i ce. or g 9 Levenberg-Marquardt General Algorithm 1.Update weights 2.Evaluate new error 3.If error has increased, reset weights, increase λ by a large factor and go to 1. 4.If error has decreased, decrease λ by a large factor.

10 Quasi-Newton Methods Build up an approximation to the inverse of the Hessian over a number of iterations using the first derivatives of the error function Start with the usual approximation where H is initialised to the identity matrix w w i = H 1 E w i There exists a direction p along which E decreases ( Ep < 0 ) E w i w w i = w w i H w w i 0 Made wi t h OpenOf f i ce. or g 10

11 Made wi t h OpenOf f i ce. or g 11 Quasi-Newton Methods By substitution we can see how the weight vectors and gradients are related at steps i and i + 1 Let G i+1 H -1 w i 1 w i = H 1 E w i 1 E w i w i 1 w i = G i 1 E w i 1 E w i Called the quasi-newton condition. Adjustment of G G i 1 = G i corrections

12 Quasi-Newton Methods The correction term is where the various implementations differ Most popular are Davidson-Fletcher-Powell (DFP) Broyden-Fletcher-Goldfarb-Shanno (BFGS) Example BFGS Where G i 1 = G i ppt p T v G i v v T G i v T G i v v T G i v uu T p = w i 1 w i v = E i 1 E i u = p p T v G i v v T G i v Made wi t h OpenOf f i ce. or g 12

13 Second Order Back-Propagation Evaluates exact value of Hessian Extension of standard Back-Propagation Algorithm: 1.Find activations of all hidden and output units (by standard forward propagation). Similarly, propagate through the network, calculating the following h kj = r g ' a r w kr h rj Made wi t h OpenOf f i ce. or g 13

14 Made wi t h OpenOf f i ce. or g 14 Second Order Back-Propagation 2.Evaluate δ k for the outputs. 3.Use standard back-propagation to find δ j for all the hidden units. Similarly, use back-propagation to find b lj = g ' ' a l h lj s w sl δ s g ' a l s w sl b sj 4.Evaluate the elements of the Hessian matrix using 2 E n w ji w lk = z i δ l g ' a k h kj z i z k b lj

15 Made wi t h OpenOf f i ce. or g 15 Second Order Back-Propagation 5.Repeat the above steps for all inputs in the training set and sum to obtain the complete Hessian

16 Made wi t h OpenOf f i ce. or g 16 Other Possible Uses of Hessian Network Pruning - Uses the inverse Hessian to identify least significant weights Compare relative probabilities of network models Calculating error bars on network outputs etc

17 Made wi t h OpenOf f i ce. or g 17 References Bishop, C.M. (1995), Neural Networks for Pattern Recognition Rochas, R. (1996), Neural Networks A Systematic Introduction

(Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting