Artificial Neural Networks

Size: px

Start display at page:

Download "Artificial Neural Networks"

Andra Holland
7 years ago
Views:

1 Artificial Neural Networks

2 Motivations Analogy to biological systems, which are the best examples of robust learning systems Consider human brain: Neuron switching time ~ 10-3 S Scene recognition can be done in 0.1 S There is only time for about a hundred serial steps for performing such tasks Current AI system typically require a lot more serial steps than this! We need to exploit massive parallelism!

3 A single neuron 1 Σ Σ Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result

4 Activation Functions Given the weighted sum as input, an activation function controls whether a neuron is active or inactive Commonly used activation functions: o Threshold function: outputs 1 when input is positive and 0 otherwise o Sigmoid function: 1 1 Differentiable a good property for optimization 1

5 Multilayer Neural Network Output layer Hidden layer x 1 x 2 x 3 x 4 Input layer Each layer receives its inputs from the previous layer and forwards its outputs to the next feed forward structure Output layer: sigmoid activation function for classification, and linear activation function for regression Referred to as a two-layer network (2 layer of weights)

6 Representational Power Any Boolean Formula x 1 x 2

7 Representational Power Any Boolean Formula OR units AND units x 1 x 2 x 3 Any Boolean function can be represented in the conjunctive normal form, which can be represented by a neural net

8 Representational Power (cont.) Continuous functions Any continuous functions can be approximated arbitrarily closely by a sum of (possibly infinite) basis functions Suppose we implement the hidden units to represent the basis functions, and make the output : ŷ = W a linear of hidden units. Any bounded continuous function can be approximated to arbitrary accuracy with enough hidden units.

9 Training for neural net Learning objective: minimize the squared-error Optimization method: gradient descent

10 Terminology a 9 W 9 a 6 a 7 a 8 W 6 W 7 W 8 x 1 x 2 x 3 x 4 1, 1, 2, 3, 4 the input vector with the bias term 1, 6, 7, 8 the output of the hidden layer with the bias term represents the weight vector leading to node, represents the weight connecting from the -th node to the -th node 9, 6 is the weight connecting from 6 to 9 We will use to represent the activation function, so

11 Minimizing Squared Error We adjust the weights of the neural network to minimize the mean squared error (MSE) on training set. Useful fact: the derivative of the sigmoid activation function is 1

12 Gradient Descent: Output Unit a 9 W 9 a 6 a 7 a 8 W 6 W 7 W 8 w 9,6 is a component of W 9, connecting from a 6 to the output node. x 1 x 2 x 3 x 4 1

13 The Delta Rule a 9 W 9 a 6 a 7 a 8 W 6 W 7 W 8 w 9,6 x 1 x 2 x 3 x 4 Define then

14 Gradient descent: Hidden Units a 9 W 9 a 6 a 7 a 8 W 6 W 7 W 8 w 6,2 x 1 x 2 x 3 x 4

15 Delta Rule for Hidden Units a 9 W 9 a 6 a 7 a 8 W 6 W 7 W 8 Define and rewrite as w 6,2 x 1 x 2 x 3 x 4

16 Networks with Multiple Output Units a 9 a 10 W 9 a 6 a 7 a 8 W 6 W 7 W 8 x 1 x 2 x 3 x 4 We get a separate contribution to the gradient from each output unit. Hence, for input-to-hidden weights, we must sum up the contributions:

17 The Backpropagation Algorithm Forward Pass Given X, compute a u and ŷ v for hidden units u and output units v. Compute Errors Compute ŷ for each output unit v Compute Output Deltas ŷ ŷ 1 ŷ Compute Hidden Deltas 1 Σ, Compute Gradient Compute for hidden-to-output weights. Compute for input-to-hidden weights. Backpropagate error from layer to layer Take Gradient Descent Step

18 Backpropagation Training Create a three layer network with N hidden units, fully connect the network and assign small random weights Until all training examples produce correct output or MSE ceased to decrease For all training examples, do Begin Epoch For each training example do Compute the network output Compute the error Backpropagate this error from layer to layer and adjust weights to decrease this error End Epoch

19 Remarks on Training No guarantee of convergence, may oscillate or reach a local minima. However, in practice many large networks can be adequately trained on large amounts of data for realistic problems, e.g., Driving a car Recognizing handwritten zip codes Play world championship level Backgammon Many epochs (thousands) may be needed for adequate training, large data sets may require hours or days of CPU time. Termination criteria can be: Fixed number of epochs Threshold on training set error Increased error on a validation set To avoid local minima problems, can run several trials starting from different initial random weights and: Take the result with the best training or validation performance. Build a committee of networks that vote during testing, possibly weighting vote by training or validation accuracy

20 Notes on Proper Initialization Start in the linear regions keep all weights near zero, so that all sigmoid units are in their linear regions. This makes the whole net the equivalent of one linear threshold unit a relatively simple function. This will also avoid having very small gradient Break symmetry If we start with all the weights equal, what would happen? Ensure that each hidden unit has different input weights so that the hidden units move in different directions. Set each weight to a random number in the range where the fan-in of weight w v,u is the number of inputs to unit v.

22 Overtraining Prevention Running too many epochs may overtrain the network and result in overfitting. Keep a validation set and test accuracy after every epoch. Maintain weights for best performing network on the validation set and return it when performance decreases significantly beyond this. To avoid losing training data to validation: Use 10-fold cross-validation to determine the average number of epochs that optimizes validation performance. Train on the full data set using this many epochs to produce the final result.

23 Over-fitting Prevention Too few hidden units prevent the system from adequately fitting the data and learning the concept. Too many hidden units leads to over-fitting. Similar cross-validation method can be used to decide an appropriate number of hidden units. Another approach to preventing over-fitting is weight decay, in which we multiply all weights by some fraction between 0 and 1 after each epoch. - Encourages smaller weights and less complex hypotheses. - Equivalent to including a L2 regularization term in the error function

24 Input/Output Coding Appropriate coding of inputs/outputs can make learning easier and improve generalization. Continuous inputs can be handled by a single input unit, but scaling them between 0 and 1 Best to encode discrete multi-category features using multiple input units and include one binary unit per value For classification problems, have one output unit per class. Continuous output values then represent certainty in various classes. Assign test instances to the class with the highest output. Use target values of 0.9 and 0.1 for binary problems rather than forcing weights to grow large to closely approximate 0/1 outputs. Continuous outputs (regression) can also be handled by scaling to the range between 0 and 1

25 Softmax for multi-class classification For K classes, we have K nodes in the output layer, one for each class Let be the output of the class- node, i.e., where is the output of the hidden layer, and is the weight vector leading into the class-k node We define: ŷ 1 ŷ 2 softmax W 9 W 10 1 a 6 a 7 a 8 W 6 W 7 W 8 1 x 1 x 2 x 3 x 4

26 Likelihood objective Given an instance, we compute the outputs for : These will be the probabilities computed using the softmax, and they will be a function of the weights of the neural net (thus the subscript ) Log-likelihood function: Similar gradient descent algorithm could be derived

27 Recent Development A recent trend in ML is deep learning, which learns feature hierarchies from large amounts of unlabeled data The feature hierarchies are expected to capture the inherent structure in the data Can often lead to better classification when used the learned features to train with labeled data Neural networks provide one approach for deep learning

Deep learning with Auto-encoders Network is trained to output the input Constraining number of units in layer 2 to learn compressed representation Constraining layer 2 to be sparse Multiple

28 Deep learning with Auto-encoders Network is trained to output the input Constraining number of units in layer 2 to learn compressed representation Constraining layer 2 to be sparse Multiple layers of such auto-encoder networks for deep learning If you are interested in learning more, see Andrew Ng s ECCV 2010 tutorial: ufldl.stanford.edu/eccv10-tutorial/eccv10_tutorial_part4.ppt

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems