Neural Networks and Learning Systems

Transcription

1 Neural Networks and Learning Systems Exercise Collection, Class 9 March 2010 x 1 x 2 x N w 11 3 W 11 h h N w NN h 1 W NN y

2 Neural Networks and Learning Systems Exercise Collection c Medical Informatics, IMT, LiTH Contents Exercises 2 1 Genetic Algorithms 2 Solutions 4 Formulary 7 1 Activation functions 7 2 Cost functions 7 3 Learning rules 8 4 Probability functions 9 5 Miscellaneous 10 1

3 Exercises 1. Genetic Algorithms 1.1. (Crossover and Mutation) We start of with a number of definitions in order to make further calculations easier. Define the order of a schema S, as the number of fixed positions, o(s). The distance between the first and the last fixed position in the schema is denoted as δ(s). a) Assume a crossover between two strings of the length l is taking place be means of a two step process. First a random position k is drawn from a rectangular distribution on the interval {1,l 1}. Then the strings swap the parts between and including position k +1and l with each other. Derive a lower bound for the probability, p s, that a schema of the length l survives a crossover given the probability of the crossover itself, p k. b) Let us also consider the possibility of mutation. The probability that a given position should be affected is assumed to be p m. What is the lower bound for survivability of a schema now? 1.2. (The Schema Theorem) Show the Schema Theorem, i.e. the number of copies of a schema S in a population will increase or decrease exponentially with respect to the relative fitness of the schema. Disregard crossover- and mutation effects (The survival of the fittest) A population contains strings with the following corresponding fitness: No. String Fitness The probability for mutation is p m =0.01 and the probability for crossover is p k =1.0. Calculate the expected number of schemata with the string S 1 =1 and S 2 =0 1 respectively in the next generation. Comments? 1.4. (Live and let die) Let us in this exercise ignore the possibility of the destruction of a schemata due to crossovers and mutations. a) A schema S 1 with one representative in the first generation has 25% larger fittness value than the average in the population of 100 individuals. After how many generations will this schema appear in every individual in the population? b) A schema S 2 appearing in 60 (first generation) of the 100 individuals has 10% lower fittness value than the average. After how many generations will this schema be extinct? 1.5. (*) (Two- and k-armed bandits) In the case with the two armed bandit, where one arm gives profit with an average m 1 and variance s 2 1 while the second gives m 2 on average with variance s 2 2, one can use the following tactics. We have N pulls to our disposal. Of these, we use 2n <Nto pull n times in each arm and N 2n to then pull the arm estimated to be the best. The expected loss if we use this tactics is given by: 2

4 L(N,n) = m 1 m 2 (n + p(n)(n 2n)), where p(n) denotes the probability that we after the initial 2n pulls choose the wrong arm to pull for the rest of the N 2n pulls. Now, the p(n) can be approximated with the tail of a normal distribution according to: p(n) e x2 /2 2πx where x = m 1 m 2 s s 2 2 n. a) If we follow a policy minimizing the loss L, how much more frequent should we pull the arm estimated to be the best compared to the arm we estimate to be the worst? b) Assume that the optimal relation between the best arm and the other arms we derived in the a-part still applies in the case of the k armed bandit. Then, what parallel can you make to the behavior of genetic algorithms? 3

5 Solutions Answers 1.1 a) The probability of such a crossover destroying a schema is given by the probability of the randomized position ending up somewhere between the fixed positions in the schema. The number of such positions are given by the length of the schema, δ(s). The position for the crossover is drawn from a rectangular distribution over the possible sites in the interval {1,l 1} and we get the probability for survival as: p s =1 δ(s)/(l 1). If p k is the probability of using the crossover operator, the lower bound for a schema to survive a crossover becomes : p s 1 p k δ(s)/(l 1). This is a lower bound because the schema might live on with another individual in the population and we have not taken that into consideration. b) In order for the schema to avoid damage from mutation, all fixed positions in the schema must come through. The number of fixed positions are given by the order of the schema, o(s). The probability to survive mutation is then: p m =(1 p m ) o(s) 1 o(s)p m, where the last approximation applies when p m 1. The lower bound in total for surviving both crossover and mutation becomes δ(s) p s 1 p k l 1 o(s)p m, if we neglect the influence from the second order terms. Answers 1.2 At reproduction an individual is chosen with a its relative fitness as probability, f i / f i. A schema is therefore chosen with the probability f(s)/ f i, where f(s) is the mean value of the fitness values for all individuals in the population having the schema. If we look at the expected number of representatives of a schema S in the next generation given the number of representatives in the current generation m(s, t), we get: m(s, t +1)=m(S, t) f(s) fi n. because the size of the population is n and we consequently make n random samples. We can rewrite this expression with the help of f ave, i.e. the mean fitness of the entire population: m(s, t +1)=m(S, t) f(s). f ave Now, if a schema in average has c f ave greater fitness than the average in the population, this schema will grow according to the recursive expression: m(s, t +1) = m(s, t) f ave + c f ave f ave = m(s, t) (1+c) m(s, t) = m(s, 0) (1 + c) t. 4

6 I.e. the genetic algorithm leads to an exponential growth of such a schema. With the same line of reasoning we see that schemata with less fitness than average will die off from the population according to the same exponential function. Answers 1.3 By combining the schemata theorem with the survivability calculations from exercise 1.1 we see that the expected number of representatives for a schema S in the next generation are given by: m(s, t +1) m(s, t) f(s) f ave (1 δ(s) l 1 p k o(s)p m ). Some of the parameters are given for all schemata, p k =1.0, p m =0.01, l =5and f ave =12.5. For the remaining parameters we can set up a table: Schema f(s) δ(s) o(s) m(s, t) Inserted in the recursion expression we get the expected number of both the schemata as m(s 1,t+1)= and m(s 2,t+1)= We see that schema number two will be reduced drastically due to its length, low fitness and having many fixed positions. The opposite applies for schema number one; it s not affected by crossover, it has a low probability for being affected by mutation and it has a fitness value larger than the average. Answers 1.4 a) Again we use the Schema theorem and we assume that schema S 1 has taken over the population when more than 99.5% of its individuals have been equipped with this schema. According to the exercise we start with one individual having schema S 1, i.e. m(s 1, 0) = 1. In addition we know that this schema is 25% better than average, which gives us c =0.25. Inserting these numbers then gives us: m(s 1,t) = m(s 1, 0) (1 + c) t 99.5 < 1 (1+0.25) t t > ln 99.5 ln , i.e. the expected number of generations before all individuals in the population carries this schema is 21. b) Since it will be an exponential decrease of a bad schema, we hold that S 2 is extinct when less than 0.5% of the individuals of the population carry this schema. According to the exercise we start with 60 individuals carrying this schema, i.e. m(s 2, 0) = 60. In addition we know that this schema is 10% worse than average, which gives us c = Inserting these numbers then gives us: m(s 2,t) = m(s 2, 0) (1 + c) t 0.5 > 60 (1 0.1) t t > ln(0.5/60) ln , i.e. the expected number of generations before no individuals in the population carries this schema is 46. 5

7 Answers 1.5 a) Derivate the expression and set it to zero. Then examine how the number of times you should pull best arm, N n, depends on the number of times you pulled the worst arm, n. Since we can disregard the constant difference in mean profit we can instead derivate the function: L(N,n) = n + p(n)(n 2n) = (N n)p(n)+n(1 p(n)) dl dn = dp dp (N n) p(n)+1 p(n) n dn dn = 0 N n = (2p(n)+n dp dp 1)/ dn dn. We now let x = an which results in the following expression for the density function and its derivative: p(n) = 1 2πan e an/2 and If we insert this in the expression for N n we get: dp dn = 1+an 2n 2πan e an/2. N n = 4n 1+an + n + 2n 2πan 1+an 4 8πn a + n + a 8πn a ean/2. ean/2 ean/2 where the second last step is given by the assumption of an 1 and the last step is given by the assumption that the exponential function dominates both the constant and linear term. The conclusion of this is then that the number of times we should pull the arm we think is the best is an exponential function of the number of times we pulled the arm we think is the worst. b) In the case with a k-armed bandit you should therefore pull exponentially more times in the arm you think is best compared to any of the other arms. This is exactly what you achieve by applying a genetic algorithm on k number of schemata. The best schema will grow exponentially compared with its competitors. 6

8 Formulary 1. Activation functions The Signum function y =sign(h) = { 1 h 0 1 h<0 (1) The Fermi function y = The Hyperbolic Tangent function 1 1+e h y = y (1 y) (2) y = tanh(h) y =(1 y 2 ) (3) Stochastic Activation function { 1 with probability P (h) y = 1 with probability 1 P (h) where P (h) = 1 1+e 2βh (4) 2. Cost functions Mean Square Error E = 1 2 E { d(x) y(x) 2 } (5) Square Error Sum, p examples E = 1 p d μ y μ 2 (6) 2 Relative entropy for the probability functions P α and Q α over states α E = α P α ln P α Q α (7) Relative entropy, p examples, N classes E = p P μ N i [ d i μ ln d i μ +(1 d i μ )ln 1 d ] i μ y i μ 1 y i μ (8) Regularization (Complexity reducing punishment functions) E c = i E c = i w 2 i (9) (w i /w 0 ) 2 1+(w i /w 0 ) 2 (10) 7

9 Clustering E = 1 x μ w μ 2 (11) 2 μ Yuille s Cost function E = 1 2 wt Cw w 4 (12) Entropy for the probability distribution P α E = H(α) =E{ ln P α } = α P α ln P α (13) Differential entropy for a continuous distribution E = h(y) =E{ ln p(y)} = p(y)lnp(y)dy (14) Value function for the MDP V f (x(t)) = γ i r (x(t + i),f(x(t + i))) (15) i=0 Q-function for the MDP Q f (x, y) =r(x, y)+γv f (x(y)) (16) 3. Learning rules The Outer Product rule w ij = The Perceptron Rule p p d μ i dμ j i.e. w = d μ d μt (17) Δw(t) =η [ d(t) y(t)]x(t) (18) The LMS Rule (online) Δw(t) =η [ d(t) y(t)]x(t) =ηδ(t)x(t) (19) Backpropagation (batch) on the cost function (6) Δw αβ = η p δ μ α V μ β = η p [ σ γ ] W γα δγ μ V μ β (20) The Boltzmann machine, auto-association Δw ij = η [ <S i S j > locked <S i S j > free ] (21) 8

10 The Boltzmann machine, hetero-association Δw ij = η [ <S i S j > I,O locked <S i S j > I locked ] (22) Clustering (online) Δw = η (x w ) (23) Kohonen s rule Δw i = ηh(i, )(x w i ) (24) Oja s rule Δw = ηy [ x y w ] (25) Sanger s rule Yuille s rule [ ] k Δw k = ηy k x y i w i i=1 (26) Δw = η (y x w 2 w) (27) Bell-Sejnowski s Entropy Maximation Rule with the activation function (2) ΔW = η ([W T ] 1 (1 2y) x T ) Δw 0 = η (1 2y) (28) Q-learning ΔQ = α [r(x, y)+γq f (x(y),f(x(y))) Q f (x, y)] (29) TD(λ)-rule Δw = α [r(x,f(x)) + γv f (x(f(x))) V f (x)] 4. Probability functions k i=1 λ k i w V f (x(i)) (30) N-dim. normal distribution p(x) = [ 1 (2π) N/2 det C exp 1 ] 2 (x m)t C 1 (x m) (31) Boltzmann-Gibb s distribution of states α with energies E α P α = 1 Z exp[ E α T ] where Z = β exp[ E β T ] (32) Markov process of 1st order a ij = P (x j (t +1) x i (t)) (33) 9

11 5. Miscellaneous Bellman s Equation of Optimality V f =max {r(x, y)+γv f (x(y))} (34) y The Schemata Theorem m(s, t +1) f(s) ( ) δ(s) m(s, t) 1 p k f l 1 O(S)p m (35) 10