Artificial Neural Networks

Transcription

1 Artificial Neural Networks Oliver Schulte School of Computing Science Simon Fraser University

2 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on multi-layer perceptrons Mathematical properties rather than biological plausibility

3 Uses of Neural Networks Pros Good for continuous input variables. General continuous function approximators. Highly non-linear. Learn features. Good to use in continuous domains with little knowledge: When you don t know good features. You don t know the form of a good functional model. Cons Not interpretable, black box. Learning is slow. Good generalization can require many datapoints.

4 Applications There are many, many applications. World-Champion Backgammon Player. No Hands Across America Tour. www/nhaa/nhaa_home_page.html Digit Recognition with 99.26% accuracy. Speech Recognition features/speechrecognition aspx

5 Outline Feed-forward Networks Network Training Error Backpropagation Correctness Proof for Backpropagation Applications

7 Neurons Starting with input x = (x 1,..., x D ), construct linear combinations: D a j = w (1) ji x i + w (1) j0 i=1 These a j are known as activations Pass through an activation function h( ) to get output z j = h(a j ) Model of an individual neuron from Russell and Norvig, AIMA2e

10 Activation Functions Can use a variety of activation functions Sigmoidal (S-shaped) Logistic sigmoid 1/(1 + exp( a)) (useful for binary classification) Hyperbolic tangent tanh Radial basis function z j = i (x i w ji ) 2 Softmax Useful for multi-class classification Hard Threshold Rectified Linear Unit (deep learning)... Should be differentiable for gradient-based learning (later) Can use different activation functions in each unit

11 Activation Functions Visualized x x Left Threshold Middle Logistic sigmoid Logistic(x) = 1 1+exp( x) (useful for binary classification) Right Logistic regression Logistic(w x)

12 Feed-forward Networks Connect together a number of these units into a feed-forward network (DAG) Above shows a network with one layer of hidden units Implements function: M y k (x, w) = h w (2) kj h j=1 ( D i=1 w (1) ji x i + w (1) j0 See ) + w (2) k0

13 No Hands Across America Sharp Left Straight Ahead Sharp Right 30 Output Units 4 Hidden Units 30x32 Sensor Input Retina

14 target t output z z 1 A general network t 1 t 2 t k t c z 2 z k z c output w kj y 1 y 2 y j y nh hidden input x w ji x 1 x 2 x i x d x 1 x 2 x i x d input FIGURE 6.4. A d-n H -c fully connected three-layer network and the notation we shall

15 Function Composition Think logic circuits h W (x 1, x 2 ) x x 2 h W (x 1, x 2 ) x Two opposite-facing sigmoids = ridge. Two ridges = bump. 2 4 x 2

16 x 1 x 2 The XOR Problem x 2 1 z=-1 R 2 z= R 2 z=-1 R 1 x 1-1

17 The XOR Problem Solved 1 z x z k 0-1 x 1 1 output k y 1 y 2 x x 1 bias y 1 y 2 1 w kj hidden j w ji x 1 1 x 2 input i x 1 x 2

18 Hidden Units Compute Basis Functions red dots = network function 3 dashed lines = 3 hidden unit activation functions. blue dots = data points + target values Network function is roughly the sum of activation functions.

19 Hidden Units As Feature Extractors sample training patterns learned input-to-hidden weights FIGURE The top images represent patterns from a large training set used to train a 64 input nodes sigmoidal network for classifying three characters. The bottom figures show the input-to-hidden 2 hidden weights, unitsrepresented as patterns, at the two hidden units after training. Note that 2xthese learned learned weights vector indeedat describe hidden feature unitgroupings useful for the classification task. In large networks, such patterns of learned weights may be difficult to

20 Representational Power of Neural Nets For fixed input size n. Boolean Functions Every Boolean function can be represented by some network with 1 hidden layer. May need exponentially many hidden units (motivates deep networks). Continuous Functions Every bounded continuous function can be approximated by a network with 1 hidden layer. Hidden units can use sigmoid activations, output units linear. Arbitrary Functions Any function can be approximated with a network with 2 hidden layers (motivates deep networks). Hidden units can be sigmoids, output units linear.

22 Network Training Given a specified network structure, how do we set its parameters (weights)? As usual, we define a criterion to measure how well our network performs, optimize against it For regression, training data are (x n, t), t n R Squared error naturally arises: 1 N 2 E(w) = {y(x n, w) t n } 2 n=1

23 Network Training Given a specified network structure, how do we set its parameters (weights)? As usual, we define a criterion to measure how well our network performs, optimize against it For regression, training data are (x n, t), t n R Squared error naturally arises: 1 N 2 E(w) = {y(x n, w) t n } 2 n=1

24 Parameter Optimization E(w) w A w B wc w 1 w 2 E For either of these problems, the error function E(w) is nasty Nasty = non-convex Non-convex = has local minima

25 Descent Methods The typical strategy for optimization problems of this sort is a descent method: w (τ+1) = w (τ) + ηw (τ) These come in many flavours Gradient descent E(w (τ) ) Stochastic gradient descent E n (w (τ) ) Newton-Raphson (second order) 2 All of these can be used here, stochastic gradient descent is particularly effective Redundancy in training data, escaping local minima

28 Computing Gradients The function y(x n, w) implemented by a network is complicated It isn t obvious how to compute error function derivatives with respect to hidden weights. The credit assignment problem.

30 Error Backpropagation Backprop is an efficient method for computing error derivatives En w ji for all nodes in the network. Intuition: 1. Calculating derivatives for weights connected to output nodes is easy. 2. Treat the derivatives as virtual error, compute derivative of error for nodes in previous layer. 3. Repeat until you reach input nodes. Propagates backwards the output error signal through the network.

31 Functional Picture for NN a j # z j =#g(a j )# w kj# a k# Keep in mind this picture.

32 Error at the output nodes First, feed training example x n forward through the network, storing all activations a j Calculating derivatives for weights connected to output nodes is easy For output node with activation y k = g(a k ) and target t n, define δ k (t n y k )g (a k ). Gradient Descent Update: w kj w kj + ηδ k z j. 0 if no error or if input z j from node j is 0.

33 Error at the output nodes First, feed training example x n forward through the network, storing all activations a j Calculating derivatives for weights connected to output nodes is easy For output node with activation y k = g(a k ) and target t n, define δ k (t n y k )g (a k ). Gradient Descent Update: w kj w kj + ηδ k z j. 0 if no error or if input z j from node j is 0.

34 Error at the hidden nodes Consider a hidden node j connected to output nodes. Intuition: δ k is node activation derivative, times output error. The error signal δ j is node activation derivative, times the weighted sum of contributions to the output errors. In symbols, δ j = g (a j ) k w kj δ k. Gradient Descent Update: w ji w ji + ηδ j z i.

35 Backpropagation Picture output ω 1 ω 2 ω 3 ω k ω c δ 1 δ 2 δ 3 δ k δ c w kj hidden δ j w ij input FIGURE The error 6.5. The signal sensitivity at a at hidden a unit unit is is to the weighted sum of the sensitivities the output units: δ j = f (net j ) proportional to the error c signals at the units influences: k=1 w kjδ k.theoutputunitsensitivitiesare thus propagated back to the hidden units. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. δ j = g (a j ) w kj δ k k

36 The Backpropagation Algorithm 1. Apply input vector x n and forward propagate to find all activation levels a i and output levels z i. 2. Evaluate the error signals δ k for all output nodes. 3. Backpropagate the δ k to obtain error signals δ j for each hidden node. 4. Perform the gradient descent updates for each weight vector w ji. Demo AIspace

38 Multi-variate Chain Rule u" x" y" f " For f (x, y), with f differentiable wrt x and y, and x and y differentiable wrt u and v: f u = and f x x u + f y y u f v = f x x v + f y y v

39 Chain Rule Exercise a j # z j =#g(a j )# w kj# a k# Exercise: From this functional diagram find expressions for the following quantities: a k w kj. a k a j.

40 Backpropagation = Gradient Descent a j # z j =#g(a j )# w kj# a k# We need to show that En w kj = δ k z j. This follows easily given the following result Theorem For each node i, we have δ i = En a i. Proof given theorem: En w kj = En a k Next we prove the theorem. a k w kj = δ k z j.

41 Proof of Theorem, I We want to show that δ j = En a j. Think of the error as a function of the activation levels of the nodes after node j. Formally, we can write En a j = a j E n (a j1, a j2,..., a jm ) where {j i } are the indices of the nodes that receive input from j. Now using the multi-variate chain rule, we have E n a j = m k=1 E n a k a k a j We have seen that a k a j = w kj g (a j ). a j # z j =#g(a j )# w kj# a k#

42 Proof of Theorem, I We want to show that δ j = En a j. Think of the error as a function of the activation levels of the nodes after node j. Formally, we can write En a j = a j E n (a j1, a j2,..., a jm ) where {j i } are the indices of the nodes that receive input from j. Now using the multi-variate chain rule, we have E n a j = m k=1 E n a k a k a j We have seen that a k a j = w kj g (a j ). a j # z j =#g(a j )# w kj# a k#

43 Proof of Theorem, II We want to show that δ j = En a j. Proof by backward induction. Easy to see that the claim is true for output nodes. (Exercise). Inductive step: Consider node j and suppose that δ k = En a k for all nodes k that receive input from j. Using the multivariate chain rule, we have E n a j = = m k=1 m k=1 E n a k a k a j δ k a k a j = m δ k w kj g (a j ) = δ j. k=1 where step 1 applies the inductive hypothesis, step 2 the result from the previous slide, and step 3 the definition of δ j.

44 Other Learning Topics Regularization: L2-regularizer (weight decay). Experimenting with Network Architectures is often key. Learn Architecture Prune Weights: the Optimal Brain Damage Method. Grow Network: Tiling, Cascade-Correlation Algorithm.

46 Applications of Neural Networks Many success stories for neural networks Credit card fraud detection Hand-written digit recognition Face detection Autonomous driving (CMU ALVINN)

47 Hand-written Digit Recognition MNIST - standard dataset for hand-written digit recognition training, test images

48 LeNet-5 INPUT 32x32 C1: feature maps C3: f. maps S4: f. maps S2: f. maps C5: layer 120 F6: layer 84 OUTPUT 10 Convolutions Subsampling Convolutions Full connection Gaussian connections Subsampling Full connection LeNet developed by Yann LeCun et al. Convolutional neural network Local receptive fields (5x5 connectivity) Subsampling (2x2) Shared weights (reuse same 5x5 filter ) Breaking symmetry See Also deep learning tutorial.

49 4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3 9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4 8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1 9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8 4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4 8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8 1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1 2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0 4!>9 2!>8 The 82 errors made by LeNet5 (0.82% test error rate)

50 Conclusion Feed-forward networks can be used for predicting discrete or continuous target variables Very expressive, can approximate arbitrary continuous functions. Different activation functions possible. Learning is more difficult, error function not convex Use stochastic gradient descent, obtain (good?) local minimum Backpropagation for efficient gradient computation.