Feedforward Neural Networks and Backpropagation

Feedforward Neural Networks and Backpropagation Feedforward neural networks Architectural issues, computational capabilities Sigmoidal and radial basis functions Gradient-based learning and Backprogation On-line vs batch learning Trade and tricks Bayesian interpretation of learning Local minima and complexity issues Generalization issues Competitive learning and LVQ

Feedforward Neural Networks Sigmoidal units Radial units Directed Acyclic Graph Architecture Partial ordering on the nodes Feedforward architecture Multilayer architecture

Forward Propagation Let be any topological sorting of the nodes and let Be the parents of node Universal Approximation Given and find

Boolean Functions Boolean Functions by MLP Every Boolean function can be expressed in the first canonical form Every minterm is a linearly-separable function (one on a hypercube s vertex) OR is linearly-separable Similar conclusions using the second canonical form.

Set Functions A set function is defined by for all Convex set by MLP Implementation by radial basis functions

Set Functions for Comlex Domains non-connected domains non-convex domains Set Functions (Lippman ASSP 87) Every hidden unit is associated with a hyperplane Every convex set is associated with units in the first hidden layer Every non-connected or non-convex set can be represented by a proper combination (at the second hidden layer) of units representing convex sets in the first hidden layer 2004/2005 Basic statement: Artificial Intelligence Two by hidden Marco Gori layer University to of approximate Siena any set function

Supervised Learning Consider the triple where Error due to the mismatch between Gradient Descent The optimization may involve a huge number of paramters even one million (Bourlard 1997) The gradient heuristics is the only one which is meaningful in such huge spaces The trajectory ends up in local minima of the error function. How is the gradient calculated?

Backpropagation Bryson & Ho (1969), Werbos (1974), le Cun (1995), Rumerlhart-Hinton-Williams (1986) Error accumulation: DAG hypothesis: Backpropagation (con t) if else then

Backpropagation (con t) any topologic sorting induced by topologic sorting induced by the inverse Batch-learning We use the truly gradient descent heuristics The gradient is accumulated for all the example before changing the weights The learning rate place a critical role... the momentum term

On-line learning The weights are updated after the presentation of each example The scheme resembles Rosenblatt s PC algorithm The learning trajectory does not follow the gradient descent On-line learning approximates batch learning for small learning rates and training sets On-line learning can be more efficient than batch-learning