Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Size: px

Start display at page:

Download "Wes, Delaram, and Emily MA751. Exercise 4.5. 1 p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }]."

Emil Townsend
8 years ago
Views:

1 Wes, Delaram, and Emily MA75 Exercise 4.5 Consider a two-class logistic regression problem with x R. Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample for the two classes are separated by a point x 0 R. Generalize this result to (a) x R p and (b) more than two classes. Solution: Without loss of generality, suppose that x 0 0 and that the coding is y for > 0 and y 0 for < 0. Now, suppose that p(x; β) exp {βx + β 0} + exp {βx + β 0 } p(x; β) Since x 0 0 is the boundary then p(x 0 ) p(x 0 ) then β 0 0. Therefore, p(x; β) + exp {βx + β 0 }. exp {βx} + exp {βx} Therefore, the likelihood function p(x; β) + exp {βx}. [ ] L(β; y, x) p( ; β) y i [ p( ; β)] y p(xi i ; β) yi [ p(xi ; β)] p(x i ; β) the log-likelihood function l(β; y, x) N y i [β ] log [ + exp {β }]. Taking the derivative with respect to β and substituting in the proper coding of y i gives [exp {β }] y i [ + exp {β }] dl(β; x, y) dβ N ( y i exp {βx ) i} + exp {β } ( exp {β} + exp {β } ) ( + exp {β } <0 + exp {β } ) <0. + exp {β } Setting the above equal to zero gives N. + exp {βx i } Clearly, for any data set { } N we must have that β for the above equality to hold. (b) Now, suppose that there are K classes such that x seperates classes one and two, x 2 seperates classes two and three, and so on to x K that seperates classes K and K with x 0 < x < x 2 <... < x K < x K. Now, define probabilities p (x; β) p 2 (x; β). p K (x; β) exp {β x + β 0 } j exp {β jx + β 0j } exp {β 2 x + β 02 } j exp {β jx + β 0j } j exp {β jx + β 0j }.

Generalize this result to (a) x R p and (b) more than two classes. Solution: Without loss of generality, suppose that x 0 0 and that the coding is y for > 0 and y 0 for < 0.

2 2 Now, suppose that the coding is y i if x j < < x j and y i 0 otherwise for observation i,..., N and class j,..., K. Therefore, the likelihood function K L(β; y, x) [p j ( ; β)] y i, j where is the number of observations in class j, the log-likelihood function l(β; y, x) K N j [ ] N [ ] exp {β j + β 0j } k y i log j + K j exp {β + y i log jx + β 0j } j exp {β j + β 0j } K N j K K y i [β j + β 0j ] y i log + exp {β j + β 0j }. j j j Now, we determine the values of β 0j. First, note that β 0j is a function of β j, x j, and x j. So that the expression p(x; β) maintains proper form, for x j < x < x j we define p(x; β j ) exp {β j(x x j )} exp {β j (x x j )} j exp {β j + β 0j } exp {β jx} [exp {β j x j } exp {β j x j }] j exp {β j + β 0j } exp {β j x + β 0j } j exp {β j + β 0j }, where β 0j log [exp {β j x j } exp {β j x j }]. The reason for the begining step of the formulation above is due to the fact that, for example, when x (x, x 2 ) x classifies to class two, the probability function appears as in the following figure, where it was assumed that x and x Now, taking the derivative with respect to β (β,..., β K ) and substituting in the proper coding of y i gives dl(β; x, y) + dβ j exp {β j x j } x j exp {β j x j } x j exp {β j x j } exp {β j x j } Note that the exp{β j x j }x j exp{β j x j}x j exp{β j x j } exp{β j x j} [ + exp {β ] jx j } x j exp {β j x j } x j exp {β j β 0j } exp {β j x j } exp {β j x j } j exp {β. j + β 0j } term in the above is a constant in the sum over i,...,. Therefore, setting the above equal to zero for each j,..., K and solving for β j gives the maximum likelihood estimators in a similar fashion to the two-class case that β j. (a) Now, suppose that there are two classes in which x R p. Suppose that x and x 2 are two vectors that lie in the seperating hyperplane. Then, we have that β( x x 2 ) 0. Now that we are back in the two class case, { p( x; β) exp β x + } β 0 { + exp β x + β } 0

j + K j exp {β + y i log jx + β 0j } j exp {β j + β 0j } K N j K K y i [β j + β 0j ] y i log + exp {β j + β 0j }. j j j Now, we determine the values of β 0j.

3 3 p( x; β) { + exp β x + β }. 0 Note that if x 0 lies in the seperating hyperplane then β 0 β x 0. Therefore, { } p( x; β) exp β ( x x 0 ) { } + exp β ( x x 0 ) p( x; β) { }. + exp β ( x x 0 ) Finally, note that at this point the situation is analogous to the univariate case in that once taking derivatives of the log-likelihood function with respect to β (β,..., β p) and setting them equal to zero, the score functions reduce in a similar manner and it follows that the maximum likelihood estimator is such β.

+ exp β ( x x 0 ) Finally, note that at this point the situation is analogous to the univariate case in that once taking

4 Problem 4.6 February 8, 200 Part A Given f(x) β T x + β o 0 f(x) β T x where x [x T ] T and β [β T β 0 ] T. Because the data is separable, where z i such that and therefore 2 Part B y i β T sepz i M xi and M > 0. We can then choose a separating hyperplane β β β sep M y i β T z i i Given then β new β old + y i z i β old β sep 2 β new β sep y i z i 2 β new β sep 2 βnewy T i z i + βsepy T i z i y i zi T β new + y i zi T β sep + y i z i 2 β new β sep 2 βnewy T i z i + βsepy T i z i y i zi T β new + y i zi T β sep + From part A, y i βsepz T i and because of the iteration update y i βnewz T i 0

We can then choose a separating hyperplane β β β sep M y i β T z i i Given then β new β old + y i z i β old β sep 2 β new β sep y i z i 2 β new β

5 Given that at each step, the updated decision boundary minimizes the distance between the point update and the decision boundary, then must be true, so therefore y i β T newz i y i β T sepz i y i β T sepz i y i β T newz i 0 β old β sep 2 β new β sep 2 + is true, and the perceptron algorithm must converge in β old β sep 2 steps. 2

therefore y i β T newz i y i β T sepz i y i β T sepz i y i β T newz i 0 β old β

Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes