Learning framework for NNs. Introduction to Neural Networks. Learning goal: Inputs/outputs. x 1 x 2. y 1 y 2

Introduction to Neura Networks Learning framework for NNs What are neura networks? Noninear function approimators How do they reate to pattern recognition/cassification? Noninear discriminant functions More compe decision boundaries than inear discriminant functions (e.g. Fisher, Gaussians with equa covariances inputs n Unknown mapping G rainabe mode Γ y y y m m mode outputs desired outputs Inputs/outputs Learning goa: Definitions: y y G( (e.g discriminant function we want to earn n ( n inputs y y y m ( m process outputs Find w such that: E( w E( w, w, where E( w error between G and Γ. rainabe mode: Γ(, ( w adjustabe parameters w m ( m mode outputs What shoud E( w be?

Error function (idea Error function (practica Input/output data: p input-output training patterns Ideay, E( w y p( d p y y y p How to compute? i y i i i in y i y i y im, i Γ( i, w. E( w p -- y i i i p m -- y ( ij ij i j Artificia neura networks (NNs Neura networks are one type of parametric mode Γ. Noninear function approimators Bioogica inspiration Structure and function oosey based on bioogica neura networks (e.g. brain. Reativey simpe buiding bocks connected together in massive and parae network. Adjustabe (trainabe parameters w (weights Map inputs to outputs dendrites neuron aon Why Neura Network? What does a neuron do?

Neuron transfer function Rough approimation: threshod function aon output net stimuus from dendrites Neura networks: crude emuation of bioogy Simpe basic buiding bocks. Individua units are connected massivey and in parae. Individua units have threshod-type activation functions. Learning through adjustment of the strength of connection (weights between individua units Caveat: Artificia neura networks are much, much, much simper than bioogica systems. Eampe: Human brain: neurons connections Basic buiding bocks of neura networks Basic buiding bock: the unit ψ φ φ φ φ q (scaar inputs w ω ω ω q (weights ω ω ω ω q φ φ φ φ q noninear activation function ψ ( w ω i φ i φ q i (output

Perceptrons: the simpest neura network hreshod activation function t ( u u θ u < θ ω ω n ω ω n t ( u What is this? θ u Perceptron output Limited mapping capabiity Perceptron mapping: w t w t < ω.5 ω ω.8.6.4...4.6.8 OR function where,.8 n.6.4. w ω ω ω n..4.6.8 XOR function

More genera networks: activation function ( u.8.6.4. ( u ( u ---------------- (sigmoid + e u e u e u e u e u ------------------ (hyperboic tangent + sigmoid ( u.5 -.5 hyperboic tangent signa fow (feedforward More genera networks: mutiayer perceptrons (MLPs m n output ayer input ayer hidden unit ayer u - -5 5 - - -5 5 u signa fow (feedforward More genera networks: mutiayer perceptrons (MLPs m n output ayer hidden unit ayer # hidden unit ayer # input ayer MLP appication eampe: ALVINN Sharp Left Straight Ahead Sharp Right 4 Hidden Units 3 Output Units 33 Sensor Input Retina ALVINN: Neura Network for Autonomous Steering

A simpe eampe Derivation of function ƒ( f ( c[ t ( a t ( b ] ω 6 ω 7 f ( c t ( a c t ( b ω 5 f ( c ω ω ω 4 ω 3 t ( u ( ku as k f ( c [ k ( a ] c [ k ( b ] for arge k. a b ω 5 + ω 6 ( ω + ω + ω 7 ( ω 3 + ω 4 Weight vaues for simpe eampe Some theoretica properties of NNs set # set # ω ω ω 3 ω 4 ω 5 ω 6 ω 7 kb k ka k c c ka k kb k c c Singe-input functions: what does the previous eampe say about singe-input functions? f (.8.6.4. NN error.4. -. -.4 -.6 4 6 8 -.8 4 6 8

Muti-input functions: universa function approimator? Does the singe-input eampe hod in genera? Neura networks in practice: 3 basic steps. Coect input/output training data.. Seect an appropriate neura network architecture: Number of hidden ayers Number of hidden units in each ayer. 3. rain (adjust the weights of the neura network to minimie the error measure, E p -- y i i i Neura network training Gradient descent (one parameter Key probem: How to adjust w to minimie E? Answer: use derivative information on error surface.. Initiaie ω to some random initia vaue.. Change ω iterativey at step t according to: E ( ω a b e c ω( t + ω( t η------------- de dω( t d f ω Impies oca, not goba minimum...

Genera gradient descent Simpe eampe of gradient computation. Initiaie w to some random initia vaue.. Change w iterativey at step t according to: w( t + w( t η E[ w( t ] E[ w( t ] ω ( t ω ( t ( ωq t Compute ω 4 ω 5 for the neura network beow: ω 6 ω 7 ω ω ω 4 Singe training pattern y, E -- ( y ω 3 Derivation Generaiation to mutipe training patterns: ω j p -- y ω j ( i i i p i -- ( y i i ω j. net ω + ω net ω 3 + ω 4 h ( net h ( net ω 5 + ω 6 h + ω 7 h ω 4 ( y ω4 Derivation ω 5 ω ω ω 4 ω 3 ω 6 ω 7 ω 4 h ( y h ------------ net net ω 4

net ω + ω net ω 3 + ω 4 h ( net h ( net ω 5 + ω 6 h + ω 7 h ω 4 ω 4 Derivation ω 5 h ( y h ------------ net net ω 4 ( yω 7 ' ( net ω ω ω 4 ω 3 ω 6 ω 7 Generaiation: Backpropagation Key probem: Generaie specific resut to compute derivatives in more genera manner. Answer: Backpropagation agorithm [Rumehart and McCeand,986]. Efficient, agorithmic formuation for computing error derivatives Gradient computation without hardcoding derivatives (aows on-the-fy adjustment of NN architectures. ( h k ω kj ω ij k Backpropagation derivation net j ----------- ω ij ----------- h i ω ij h i ω ij ω ij h i Backpropagation derivation: output units net k ω jk h i unit k ω ij E net k m -- y ( k ------------ net k

Backpropagation derivation: output units unit k m E -- y ( Backpropagation derivation:output units unit k net k k ------------ net k net k ω jk net k k ------------ net k net k ω jk ( net k h i ω ij ------------ ' ( net net k k h i ω ij ( y k '( net k ( y k ω jk Backpropagation derivation: hidden units unit k Backpropagation derivation: hidden units unit k net k ω jk net k ω jk h i ω ij net ----------- net h i ω ij net ----------- net net δ -----------

Backpropagation derivation: hidden units net k ω jk net unit k δ ----------- net s ω s ( net s Backpropagation derivation: hidden units net k ω jk unit k h i ω ij net ----------- ω ' ( j δ ω j ' ( h i ω ij δ ω j ' ( netj ω ij h i Output units: ω jk Hidden units: Backpropagation summary ( y k '( net k Basic steps in using neura networks. Coect training data. Preprocess training data 3. Seect neura network architecture 4. Seect earning agorithm 5. Weight initiaiation ω ij h i 6. Forward pass δ ω j ' ( netj 7. Backward pass 8. Repeat steps 6 and 7 unti satisfactory mode is reached.

he Forward Pass he Backward Pass. Evauate at the outputs, where,. Appy an input vector to network.. Compute the net input to each hidden unit ( netj. 3. Compute the hidden-unit outputs (, 4. Compute the neura network outputs (. i net k for each output unit k.. Backpropagate the δ vaues from the outputs backwards through the neura network. 3. Compute. ω i 4. Update weights based on the computed gradient, w( t + w( t η E[ w( t ]. Practica issues. What shoud your training data be? Practica issues (continued 5. Seecting the earning parameter Sufficient training data? Biased training data? Deterministic/stochastic task? Stationary/non-stationary? In gradient descent: w( t + w( t η E[ w( t ] what shoud η be?. What shoud your neura network architecture be? 3. Preprocessing of data. 4. Weight initiaiation why sma, random vaues? Difficut question to answer...

Seecting the earning parameter: an eampe Sampe error surface: E ω + ω (reaistic? Seecting the earning parameter: an eampe Where is the minimum of this error surface? E ω + ω E 4.5 -.5 -.5 ω -.5 ω.5 -.5 How many steps to convergence? ( E < 6 Different initia weights Different earning rates Deriving the gradient descent equations Gradient? E ω + ω Convergence eperiments Initia weights: ( ω, ω (, 4 -------- 4ω ω Gradient descent? ω ( t + ω ( t η--------------- ω ( t ω ( t + ω ( t ( 4η ω ( t + ω ( t ( η -------- ω ω # steps to convergence 8 6 4...3.4.5 η

A coser ook A coser ook η. η.4.5.5 ω.5 ω.5 -.5 -.5 - -.5.5 -.5 -.5 - -.5.5 ω ω A coser ook η.5 What happens at η >.5? Gradient descent equations: ω ( t + ω ( t ( 4η.5 ω ( t + ω ( t ( η ω.5 -.5 -.5 - -.5.5 ω Simiar to fied-point iteration: ω( t + cω( t diverges for c >, ω( converges for c <.

Convergence of gradient descent equations Learning rate discussion ω ( t + ω ( t ( 4η require that: 4η < < 4η < < η <.5 Why not η <? ω ( t + ω ( t ( η Probematic error surfaces: ong, steep-sided vaeys If earning rate is too sma, sow convergence. If earning rate is too arge, possibe divergence. heoretica bounds not possibe in genera case (ony for specific, trivia eampe. Motivation for ooking at more advanced training agorithms doing more with the gradient information. Any thoughts? Practica issues (continued Good generaiation: wo data sets 6. Pattern vs. batch training 7. Good generaiation y y NN error cross-vaidation data training data Sufficienty constrained neura network architecture. Cross vaidation. eary stopping point training time