Artificial Neural Networks (1)

Transcription

1 Artificial Neural Networks (1) John Kelleher & Brian Mac Namee Machine DIT

2 1 Artificial Neurons Cognitive Basis Basic Components Unthresholded linear units Activation Functions 2 Implementing logical functions 3 Network structures 4 Feed-forward networks Feed-forward networks as Parameterised Func s of Input Types of Feed-forward Networks 5 Perceptrons What is a Perceptron? Representational Power of Threshold Perceptrons Perceptron Learning (Perceptron Training Rule) Perceptron Learning (Gradient Descent Learning) Perceptron Learning Algorithm 6 Perceptrons versus Decision Trees 7 Summary

3 Cognitive Basis Figure: A diagram illustrating the structure of a neuron neurons of > 20 types, synapses, 1ms 10ms cycle time, Signals are noisy spike trains of electrical potential.

4 Basic Components Neural networks are composed of nodes or units connected by directed links A link from unit j to unit i serves to propagate the activation a j from j to i. Each link also has a numeric weight W j,i associated with it, which determines the strength and the sign of the connection These units are a gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do

5 Unthresholded linear units The simplest form of unit is an unthresholded linear unit. Given a vector of real-value inputs a = a 0,..., a j, and a vector of real-value weights each of which is associated with an input w = w 0,..., w j, the output of an unthresholded linear unit is equal to the weighted sum of its inputs: a i j w ji a j or in vector form: a i w a Recall the dot product ( ) is defined as, given w = [w 1, w 2,..., w n ] and a = [a 1, a 2,..., a n ] then: w a = (w 1 a 1 ) + (w 2 a 2 ) + (w 3 a 3 )

6 Activation Functions Usually however the basic unthresholded linear unit is augmented with some form of activation function. In these more complex units, the processing stages of a unit are: 1 First each unit i computes a weighted sum of its inputs: in i j w ji a j 2 Then it applies an activation function g to this sum to derive the output (activation) a i : a i g(in i ) = g w ji a j j

7 Activation Functions Figure: The figure above illustrates the processing stages of such an augmented unit. Note: we have included a bias weight W 0,i connected to a fixed input a 0 = 1. This bias weight is used to parameterize the behaviour of the activation functions.

8 Activation Functions Figure: The outputs of two frequently used activations functions (a) is a graph of the activation of a unit that is using a step activation function aka threshold activation function. Units that use a threshold function as their activation function are called linear threshold units or McCulloch Pitts units. The activation of these units is equal to: { 1 if ini < 0 i.e. w a < 0 a i = g(in i ) = 0 otherwise

9 Activation Functions Figure: The outputs of two frequently used activations functions (b) is a graph of the activation of a unit that is using a sigmoid (aka. logistic) function 1/(1 e x ). A unit using a sigmoidal activation function is known as a sigmoid unit. The activation of a sigmoid unit is a continuous function of its inputs that ranges between 0 and 1, increasing monotonically with its inputs. It is equal to: a i = g(in i ) = 1/(1 e ( w a) )

10 Activation Functions Figure: The outputs of two frequently used activations functions Notice that both functions have a threshold (either hard or soft) at zero. For the linear threshold function, the bias weight w 0i sets the actual threshold for the unit in the sense that the unit is activated when the weighted sum of real inputs n j=1 w jia j (i.e. excluding the bias input) exceeds w 0i a 0.

11 Activation Functions Why use the sigmoidal activation function? Although complex the sigmoid function has the advantage of being differentiable, a property which is important for the weight-learning algorithm we will develop. When g(in i ) is a sigmoidal function δg(in i ) δin i = g(in i ) (1 g(in i ))

12 McCulloch and Pitts: every basic Boolean function can be implemented using a linear threshold activation function, and given the appropriate inputs and bias weights. This is important because it means we can use these units to build a network to compute any Boolean function of inputs.

13 Figure: Threshold units that implement standard logical functions Class exercise: Assuming a 0 = 1 show using a truth table how each of the above units models the corresponding logic function using a {1 if ( w a > 0 ) threshold function g(in i ) = 0 otherwise

14 Figure: Threshold units that implement standard logical functions Inputs in i a i a 0 a 1 a 2 2 j=0 w jia j (in i > 0)?(1) : (0) Table: Truth table for AND unit

15 Figure: Threshold units that implement standard logical functions Inputs in i a i a 0 a 1 a 2 2 j=0 w jia j (in i > 0)?(1) : (0) Table: Truth table for OR unit

16 Figure: Threshold units that implement standard logical functions Inputs in i a i a 0 a 2 1 j=0 w jia j (in i > 0)?(1) : (0) Table: Truth table for NOT unit

17 There are two main categories of neural networks structures: 1 acyclic or feed-forward networks Feed-forward networks implement functions, have no internal state e.g: single-layer perceptrons, multi-layer perceptrons 2 cyclic or recurrent networks. feeds its outputs back into its own inputs. recurrent neural nets have directed cycles with delays = have internal state (like flip-flops), can oscillate etc. = can support short-term memory.

18 Feed-forward networks as Parameterised Func s of Input Figure: Example feed-forward network with two input units (1, 2), two hidden units (3, 4) and 1 output unit (5). To keep things simple we have omitted the bias units in this example

19 Feed-forward networks as Parameterised Func s of Input The output of a feed-forward network is a parameterized function of the network inputs, e.g.: a 5 = g(w 3,5 a 3 + w 4,5 a 4 ) = g(w 3,5 g(w 1,3 a 1 + w 2,3 a 2 ) + w 4,5 g(w 1,4 a 1 + w 2,4 a 2 )) By expressing the output of each hidden unit as a function of its inputs, we can see that the output of the network as a whole, a 5, is a function of the network s inputs. Furthermore, we see that the weights in the network act as parameters of this function. = By adjusting the weights, we can change the function that the network represents. This is how learning occurs in neural nets!

20 Types of Feed-forward Networks Feed-forward networks are usually arranged in layers, such that each unit receives input only from units in the immediately preceding layer. In single-layer (aka. perceptron) networks there are no hidden units: i.e., all the inputs connected directly to the outputs In multi-layer networks there are one or more layers of hidden units.

21 What is a Perceptron? Figure: A perceptron. A perceptron is a feed-forward network with no hidden units. It takes a vector of real-valued inputs calculates a linear combination of these inputs then outputs a 1 if the result is greater than some threshold and 1 otherwise.

22 What is a Perceptron? Figure: A perceptron network consisting of three perceptron output units that share five inputs. Looking at a particular output (say the second one, outlined in bold), we see that the weights on its incoming links have no effect on the other output units = output units all operate separately no shared weights

23 What is a Perceptron? Figure: A graph of the output of a two-input perceptron unit with a sigmoid activation function. adjusting weights moves the location, orientation, and steepness of cliff.

24 Representational Power of Threshold Perceptrons A threshold perceptron is a single layer network (no hidden units) with units that use a threshold activation function (i.e. McCulloch-Pitts units). We have already shown how such a network can represent the basic boolen functions: AND, OR, NOT However, such a network cannot be used to implement XOR. Why is this?

25 Representational Power of Threshold Perceptrons We can think of a 2 input threshold perceptron as representing a line separator in 2D input space. In other words, the function j W ja j > 0 or w a > 0 defines a line in the input space and the perceptron outputs a 1 for instances lying on one side of the line and -1 for instances lying on the other side of the line.

26 Representational Power of Threshold Perceptrons Figure: The decision surface represented by a two-input perceptron. The + and symbols represent training examples from two different class that the decision surface can distinguish between. The figure above illustrates this concept. The equation for the line in this image is w a > 0. The inputs into the perceptron are x 1 and x 2 ; i.e. a = {x 1, x 2 }.

27 Representational Power of Threshold Perceptrons This concept of a linear perceptron defining a line scales up into higher dimensional input space. In 3D and higher inputs space (i.e. in situations where the linear perceptron has 3 or more inputs), the linear activation function w a > 0 defines a hyperplane decision surface in the n-dimensional space of inputs. Data-sets of positive and negative examples that can be separated by a hyperplane are called linearly separable. Of course, not all data-sets of positive and negative examples are linearly separable. The XOR function is one example of a non-linearly separable function.

28 Representational Power of Threshold Perceptrons Linear Separability in threshold perceptrons. Figure: The circles represent the data points to be classified. The colour of the circles indicates their correct classification: black dots represent a point in the input space where the value of the function is 1, and white dots indicate a point where the value is 0. The diagonal lines illustrate potential linear borders of demarcation between the classes. As is evident from the images, in the AND and OR inputs spaces it is possible to draw a line that separates the black dots from the white dots. However, in the XOR space no such line exists. As a result, a threshold perceptron cannot represent the XOR function.

29 Representational Power of Threshold Perceptrons This limitation of threshold perceptrons was first highlighted by Minsky & Papert (1969) and resulted in a lot of people turning away from neural networks. Minsky & Papert showed that in general, threshold perceptrons can represent only linearly separable functions. Sigmoid perceptrons are similarly limited, in the sense that they represent only soft linear separators However, its not all dome and gloom we can represent the XOR function using a multilayer network. Minsky, M. and Papert, S. (1969). Perceptrons. MIT Press, Cambridge.

30 Representational Power of Threshold Perceptrons Figure: A multilayer neural network that implements the XOR function. XOR is easiest to construct using step-function units. Because XOR is not linearly separable, we will need a hidden layer. It turns out that just one hidden node suffices. We can think of the XOR function as OR with the AND case (both inputs on) ruled out. Thus the hidden layer computes AND, while the output layer computes OR but weights the output of the hidden node negatively.

31 Perceptron Learning (Perceptron Training Rule) Because threshold perceptrons have limitations on their representational power we will generally be interested in training multilayer networks of threshold units. However, as an introduction to training multilayer networks we will first look at how to learn weights for a single perceptron. Here the precise learning problem is to determine a weight vector that causes the perceptron to produce the correct + 1 output for each of the given training examples.

32 Perceptron Learning (Perceptron Training Rule) Perhaps the most interesting aspect of neural networks is that the connection weights need not be set by hand or fixed in advance. Most models are born with their weights set to random values and then these weights are iteratively adjusted by a learning algorithm on the basis of a series of training examples that pair inputs with targets.

33 Perceptron Learning (Perceptron Training Rule) Perceptron Training Rule Weights are modified at each step according to the perceptron training rule, which revises the weight w i associated with input a i according to the rule: w i w i + η(t o) a i where t = target output, o = observed output, η is a positive constant called the learning rate. Learning rate moderates the degree to which weights are changed at each step; usually η = a small value (e.g., 0.1); sometimes made to decay as number of weight-training iterations increases.

34 Perceptron Learning (Perceptron Training Rule) Why would this rule converge toward successful weight values? If the training example is correctly classified (t o) = 0 η(t o) a i = 0 so no weights are updated. If the case of a false negative (o=0 and t=1) we want to make the perceptron output a 1 instead of a 0 so the weights must be altered to increase the value of w a. Notice that in this case the rule will increase w i because (t o), η and a i are all positive. On the other hand, in the case of a false positive (o=1 and t=0) then the weights associated with a i will be decreased.

35 Perceptron Learning (Gradient Descent Learning) The learning procedure we have just described can be proven to converge within a finite number of applications of the perceptron training rule to a weight vector that correctly classifies all training examples, provided the training examples are linearly separable and provided that a sufficiently small η is used (see Minsky & Papert, 1969). However, the perceptron training rule can fail to converge if the examples are not linearly separable. A second approach is to use gradient descent to search the hypothesis space of possible weights to find the weights that best fit the training examples. To understand the gradient descent algorithm, it is helpful to visualise the entire hypothesis space of possible weights vectors and their associated error values. (see next slide)

36 Perceptron Learning (Gradient Descent Learning) Figure: Graph of an error surface across a hypotheses space. w 0 and w 1 represent possible values for two weights of a simple linear unit. The w 0, w 1 plane therefore represents the entire hypothesis space. The vertical axes indicates the error relative to some fixed set of training examples.

37 Perceptron Learning (Gradient Descent Learning) Figure: Graph of an error surface across a hypotheses space. The error surface shown summarises the desirability of every weight vector in the hypothesis space, we are searching for the hypothesis (weight vector) with minimum error, (the hypothesis at the global minimum in the error surface).

38 Perceptron Learning (Gradient Descent Learning) Figure: Graph of an error surface across a hypotheses space. The arrow shows the negated gradient at one particular point, indicating the direction in the w 0, w 1 plane producing steepest descent along the error surface.

39 Perceptron Learning (Gradient Descent Learning) Note that for linear units the error surface must always be parabolic with a single global minimum. Gradient descent search determines a weight vector that minimises the error E by starting with an arbitrary initial weight vector, then repeatedly modifiying it in small steps. At each step, the weight vector is altered in the direction that produces the steepest descent along the error surface. This process continues until the global minimum error is reached.

40 Perceptron Learning (Gradient Descent Learning) To apply this gradient descent approach we need to: 1 define the function that computes the error of the network. 2 be able to compute the slope of the error surface at a particular point as a function in the change of the weights at a node. 3 define an weight update rules that uses the slope information to produce the steepest descent along the error surface.

41 Perceptron Learning (Gradient Descent Learning) Defining the error of the network The classical measure of error used in gradient descent search is the mean squared network error. This is computed by summing the squared error for each node. Formally the mean squared error for a network = E = 1 2 m (tn i on i ) 2 = 1 2 i=1 m (tn i g( a i w i ) 2 i=1 where the network has m output units, tn i is the target output for unit i, on i is the actual output for unit i, g the activation function of the network units, a i is the vector of inputs to unit i, w i is the vector of weights in the network on the links into unit i, and on i = g( w a).

42 Perceptron Learning (Gradient Descent Learning) In order to compute slopes we need to know some calculus: Fundamental Calculus: In mathematics, the derivative is a measurement of how a function changes when the values of its inputs change and a partial derivative (denoted by the symbol ) of a function of several variables is its derivative with respect to one of those variables with the others held constant. x x = 1, x 2 x = 2x g(f (x)) Chain rule: x = g f (x) (f (x)) x Ex.1: if f (x) = (x 2 + 1) 3 then f (x) = 3(x 2 + 1) 2 (2x) Ex.2: if f (x) = 1 2 (c x)2 then f (x) = 2 (c x 2 (c x)( x ) = (c x)( 1) = (c x) If the activation function g( w a) is a sigmoidal function 1 ( ) then δg(in i ) 1 e ( w a) δin i = g(in i ) (1 g(in i )) with in i = w a

43 Perceptron Learning (Gradient Descent Learning) Computing the slope of the error surface as a function in the change of the weights at a node. We can compute the slope of a surface by taking the derivative of the function that defines that surface. The error surface is defined by the function: E = 1 m (tn i on i ) 2 = 1 m (tn i g( a i w i )) i=1 What we want is the rate of the change of network error E as a function of change in a particular weight w k : E = E on i w k on i w k i=1 Chain rule: g(f (x)) x = g (f (x)) f (x) x

44 Perceptron Learning (Gradient Descent Learning) How do we compute E w k = E on i on i w k? Step 1, we will compute the partial derivative of the total E error with respect to each output unit: E on i = 1 2 on i m i=1 (tn i g( a i w i )) 2 g( a i w i ) = 1 2 (tn i g( a i w i )) 2 g( a i w i ) = 1 2 2(tn i g( a i w i )) tn i g( a i w i ) g( a i w i ) = (tn i g( a i w i ))( 1) = (tn i g( a i w i )) We can drop the summation because we are considering a node on the output layer, where its error will not affect any other node.

45 Perceptron Learning (Gradient Descent Learning) How do we compute E w k = E on i on i w k? Step 2 we will compute the partial derivative of the actual output at the ith node taken with respect to each weight at that node: on i = g( a i w i ) w k w k = g ( a i w i ) a i w i w k = g ( a i w i )a k where g () is the derivative of the activation function of the network units and a k is the single input component k on the input whose weight w k is being updated.

46 Perceptron Learning (Gradient Descent Learning) Putting these two derivations together we can now define as: E w k E = (tn i g( a i w i ) g ( a w k }{{} i w i )a }{{ k } E on i on i w k where g is the derivative of the activation function.

47 Perceptron Learning (Gradient Descent Learning) Gradient Descent Weight Update Rule In the gradient descent algorithm where we want to reduce E (i.e. we want the weight to be change in the direction of the negative gradient component) we update the weight using the following rule: w k w k + (η (tn i g( a i w i )) g ( a i w i ) a k ) where η is the learning rate.

48 Perceptron Learning (Gradient Descent Learning) Gradient descent weight update rule: w k w k + (η (tn i g( a i w i )) g ( a i w i ) a k ) Intuitively, this makes a lot of sense. If the error (tn i g( a i w i )) is positive (i.e., we have a false negative), the network output is too small and so the weights are increased for the positive inputs a i > 0 and decreased for the negative inputs a i < 0. The rule does this because η, (tn i g( a i w i )), g ( a i w i ) are all positive so (η (tn i g( a i w i )) g ( a i w i ) a i ) will be positive when a k is positive and negative when a k is negative. The opposite happens when the error is negative.

49 Perceptron Learning (Gradient Descent Learning) Gradient descent weight update rule and Activation Functions w k w k + (η (tn i g( a i w i )) g ( a i w i ) a k ) For threshold perceptrons, the factor g ( a i w i ) is omitted from the weight update. Omitting g ( a i w i ) makes the weight update rule identical to the percepton learning rule. Since g ( a i w i ) is the same for all weights, its omission changes only the magnitude and not the direction of the overall weight update for each example. If the units in the network are using a continuous activation function we must be able to take the derivative g If the activation function g( w a) is a sigmoidal function 1 ( ) then δg(in i ) 1 e ( w a) δin i = g(in i ) (1 g(in i )) with in i = w a

50 Perceptron Learning Algorithm The perceptron learning algorithm, listed on the next slide, runs the training examples through the net one at a time, adjusting the weights slightly after each example to reduce the error. Each cycle through the examples is called an epoch. Epochs are repeated until some stopping criterion is reached typically, that the weight changes have become very small. The hypothesis returned computes the network output for any given example.

51 Perceptron Learning Algorithm function PERCEPTRON-LEARNING(examples, network, ) returns a perceptron hypothesis inputs: examples, a set of examples, each with input a = a 1,..., a n and output target t network, a perceptron with weights w = w 0... w n, and activation function g repeat for each e in examples do in w a [ e ] Err t [ e ] g ( in ) W j W j + (η Err g (in) a j [ e ]) until some stopping criterion is satisfied return NEURAL-NET-HYPOTHESIS(network) Perceptron learning rule converges to a consistent function for any linearly separable data set

52 Figure: Performance comparison of perceptrons and decision-trees on learning the majority function (which outputs a 1 only if more than half of its n inputs are 1). The perceptron learns majority function easily (because the majority function is linearly separable), DTL is hopeless (a decision tree would need O(2 n ) nodes to represent this function for n inputs and won t learn that without a very large data set.

53 Figure: Performance comparison of perceptrons and decision-trees on learning the restaurant example. The perceptron finds this problem very difficult (the solution to this problem is not linearly separable, the best plane through the data correctly classifies only 65%). However, the problem is easily represented as a decision tree.

54 Artificial Neurons: linear threshold versus sigmoidal activation functions Network types: Feedforward (connections only in one direction) versus Recurrent Feed-forward networks: Perceptrons (one-layer networks) verus Multi-layer Networks Perceptrons: cannot express non-linearly-seprable functions. Learning done by adjusting weights. Perceptron Training Rule: w i w i + w i where w i = η(t o) a i Gradient descent learning attempts to reduce the squared error: by calculating the partial derivative of the squared error of the network with respect to each weight Gradient descent weight update rule: w k w k + (η (tn i g( a i w i )) g ( a i w i ) a k )

55 1 Artificial Neurons Cognitive Basis Basic Components Unthresholded linear units Activation Functions 2 Implementing logical functions 3 Network structures 4 Feed-forward networks Feed-forward networks as Parameterised Func s of Input Types of Feed-forward Networks 5 Perceptrons What is a Perceptron? Representational Power of Threshold Perceptrons Perceptron Learning (Perceptron Training Rule) Perceptron Learning (Gradient Descent Learning) Perceptron Learning Algorithm 6 Perceptrons versus Decision Trees 7 Summary