Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Transcription

1 Logistic Regression Department of Statistics The Pennsylvania State University

2 Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max Pr(G = k X = x). k Decision boundary between class k and l is determined by the equation: Pr(G = k X = x) = Pr(G = l X = x). Divide both sides by Pr(G = l X = x) and take log. The above equation is equivalent to log Pr(G = k X = x) Pr(G = l X = x) = 0.

3 Since we enforce linear boundary, we can assume log Pr(G = k X = x) Pr(G = l X = x) = a(k,l) 0 + p j=1 a (k,l) j x j. For logistic regression, there are restrictive relations between a (k,l) for different pairs of (k, l).

4 Assumptions log Pr(G = 1 X = x) log Pr(G = K X = x) Pr(G = 2 X = x) log Pr(G = K X = x) Pr(G = K 1 X = x) Pr(G = K X = x) = β 10 + β1 T x = β 20 + β2 T x. = β (K 1)0 + βk 1 T x

5 For any pair (k, l): log Pr(G = k X = x) Pr(G = l X = x) = β k0 β l0 + (β k β l ) T x. Number of parameters: (K 1)(p + 1). Denote the entire parameter set by θ = {β 10, β 1, β 20, β 2,..., β (K 1)0, β K 1 }. The log ratio of posterior probabilities are called log-odds or logit transformations.

6 Under the assumptions, the posterior probabilities are given by: Pr(G = k X = x) = exp(β k0 + β T k x) 1 + K 1 l=1 exp(β l0 + β T l x) for k = 1,..., K 1 Pr(G = K X = x) = K 1 l=1 exp(β l0 + βl T x). For Pr(G = k X = x) given above, obviously Sum up to 1: K k=1 Pr(G = k X = x) = 1. A simple calculation shows that the assumptions are satisfied.

7 Comparison with LR on Indicators Similarities: Both attempt to estimate Pr(G = k X = x). Both have linear classification boundaries. Difference: Linear regression on indicator matrix: approximate Pr(G = k X = x) by a linear function of x. Pr(G = k X = x) is not guaranteed to fall between 0 and 1 and to sum up to 1. Logistic regression: Pr(G = k X = x) is a nonlinear function of x. It is guaranteed to range from 0 to 1 and to sum up to 1.

8 Fitting Logistic Regression Models Criteria: find parameters that maximize the conditional likelihood of G given X using the training data. Denote p k (x i ; θ) = Pr(G = k X = x i ; θ). Given the first input x 1, the posterior probability of its class being g 1 is Pr(G = g 1 X = x 1 ). Since samples in the training data set are independent, the posterior probability for the N samples each having class g i, i = 1, 2,..., N, given their inputs x 1, x 2,..., x N is: N Pr(G = g i X = x i ).

9 The conditional log-likelihood of the class labels in the training data set is L(θ) = = N log Pr(G = g i X = x i ) N log p gi (x i ; θ).

10 Binary Classification For binary classification, if g i = 1, denote y i = 1; if g i = 2, denote y i = 0. Let p 1 (x; θ) = p(x; θ), then p 2 (x; θ) = 1 p 1 (x; θ) = 1 p(x; θ). Since K = 2, the parameters θ = {β 10, β 1 }. We denote β = (β 10, β 1 ) T.

11 If y i = 1, i.e., g i = 1, log p gi (x; β) = log p 1 (x; β) = 1 log p(x; β) = y i log p(x; β). If y i = 0, i.e., g i = 2, log p gi (x; β) = log p 2 (x; β) = 1 log(1 p(x; β)) = (1 y i ) log(1 p(x; β)). Since either y i = 0 or 1 y i = 0, we have log p gi (x; β) = y i log p(x; β) + (1 y i ) log(1 p(x; β)).

12 The conditional likelihood L(β) = = N log p gi (x i ; β) N [y i log p(x i ; β) + (1 y i ) log(1 p(x i ; β))] There p + 1 parameters in β = (β 10, β 1 ) T. Assume a column vector form for β: β 10 β 11 β = β 12.. β 1,p

13 Here we add the constant term 1 to x to accommodate the intercept. 1 x,1 x = x,2.. x,p

14 By the assumption of logistic regression model: p(x; β) = Pr(G = 1 X = x) = exp(βt x) 1 + exp(β T x) 1 p(x; β) = Pr(G = 2 X = x) = exp(β T x) Substitute the above in L(β): L(β) = N [ ] y i β T x i log(1 + e βt x i ).

15 To maximize L(β), we set the first order partial derivatives of L(β) to zero. L(β) β 1j = = = N N x ij e βt x i y i x ij 1 + e βt x i N y i x ij N p(x; β)x ij N x ij (y i p(x i ; β)) for all j = 0, 1,..., p.

16 In matrix form, we write L(β) β N = x i (y i p(x i ; β)). To solve the set of p + 1 nonlinear equations L(β) β 1j = 0, j = 0, 1,..., p, use the Newton-Raphson algorithm. The Newton-Raphson algorithm requires the second-derivatives or Hessian matrix: 2 L(β) N β β T = x i xi T p(x i ; β)(1 p(x i ; β)).

17 The element on the jth row and nth column is (counting from 0): L(β) β 1j β 1n N (1 + e βt x i )e βt x i x ij x in (e βt x i ) 2 x ij x in = (1 + e βt x i ) 2 = = N x ij x in p(x i ; β) x ij x in p(x i ; β) 2 N x ij x in p(x i ; β)(1 p(x i ; β)).

18 Starting with β old, a single Newton-Raphson update is β new = β old ( 2 ) 1 L(β) L(β) β β T β, where the derivatives are evaluated at β old.

19 The iteration can be expressed compactly in matrix form. Let y be the column vector of yi. Let X be the N (p + 1) input matrix. Let p be the N-vector of fitted probabilities with ith element p(x i ; β old ). Let W be an N N diagonal matrix of weights with ith element p(x i ; β old )(1 p(x i ; β old )). L(β) β = X T (y p) 2 L(β) β β T = X T WX.

20 The Newton-Raphson step is β new = β old + (X T WX) 1 X T (y p) = (X T WX) 1 X T W(Xβ old + W 1 (y p)) = (X T WX) 1 X T Wz, where z Xβ old + W 1 (y p). If z is viewed as a response and X is the input matrix, β new is the solution to a weighted least square problem: β new arg min β (z Xβ) T W(z Xβ). Recall that linear regression by least square is to solve arg min β (z Xβ) T (z Xβ). z is referred to as the adjusted response. The algorithm is referred to as iteratively reweighted least squares or IRLS.

21 Pseudo Code 1. 0 β 2. Compute y by setting its elements to { 1 if gi = 1 y i = 0 if g i = 2 i = 1, 2,..., N. 3. Compute p by setting its elements to, eβt x i p(x i ; β) = i = 1, 2,..., N. 1 + e βt x i 4. Compute the diagonal matrix W. The ith diagonal element is p(x i ; β)(1 p(x i ; β)), i = 1, 2,..., N. 5. z Xβ + W 1 (y p). 6. β (X T WX) 1 X T Wz. 7. If the stopping criteria is met, stop; otherwise go back to step 3.

22 Computational Efficiency Since W is an N N diagonal matrix, direct matrix operations with it may be very inefficient. A modified pseudo code is provided next.

23 1. 0 β 2. Compute y by setting its elements to { 1 if gi = 1 y i =, i = 1, 2,..., N. 0 if g i = 2 3. Compute p by setting its elements to p(x i ; β) = eβt x i 1 + e βt x i i = 1, 2,..., N. 4. Compute the N (p + 1) matrix X by multiplying the ith row of matrix X by p(x i ; β)(1 p(x i ; β)), i = 1, 2,..., N: x1 T p(x 1 ; β)(1 p(x 1 ; β))x X = x2 T 1 T X = p(x 2 ; β)(1 p(x 2 ; β))x T 2 xn T p(x N ; β)(1 p(x N ; β))xn T 5. β β + (X T X) 1 X T (y p). 6. If the stopping criteria is met, stop; otherwise go back to step 3.

24 Example Diabetes data set Input X is two dimensional. X 1 and X 2 are the two principal components of the original 8 variables. Class 1: without diabetes; Class 2: with diabetes. Applying logistic regression, we obtain β = (0.7679, , ) T.

25 The posterior probabilities are: Pr(G = 1 X = x) = Pr(G = 2 X = x) = The classification rule is: e X X e X X e X X 2 Ĝ(x) = { X X X X 2 < 0

26 Solid line: decision boundary obtained by logistic regression. Dash line: decision boundary obtained by LDA. Within training data set classification error rate: 28.12%. Sensitivity: 45.9%. Specificity: 85.8%.

27 Multiclass Case (K 3) When K 3, β is a (K-1)(p+1)-vector: β 10 β 11 β 10. β 1 β 1p β 20 β 20 β = β 2 =.. β 2p β (K 1)0 β K 1. β (K 1)0. β (K 1)p

28 Let β l = ( βl0 β l ). The likelihood function becomes L(β) = = = N log p gi (x i ; β) ( N e β g T x i i log 1 + K 1 [ ( N β g T i x i log 1 + l=1 e β T l x i K 1 l=1 ) e β T l x i )]

29 Note: the indicator function I ( ) equals 1 when the argument is true and 0 otherwise. First order derivatives: [ ] L(β) N e β k T x i x ij = I (g i = k)x ij β kj 1 + K 1 l=1 e β l T x i N = x ij (I (g i = k) p k (x i ; β))

30 Second order derivatives: = = 2 L(β) β kj β mn N 1 x ij (1 + K 1 l=1 e β l T x i ) 2 [ e β T k x i I (k = m)x in (1 + K 1 l=1 e β T l x i ) + e β T k x i e β T mx i x in ] N x ij x in ( p k (x i ; β)i (k = m) + p k (x i ; β)p m (x i ; β)) = N x ij x in p k (x i ; β)[i (k = m) p m (x i ; β)].

31 Matrix form. y is the concatenated indicator vector of dimension N (K 1). y 1 I (g 1 = k) y 2 I (g 2 = k) y =. y K 1 y k =. I (g N = k) p = 1 k K 1 p is the concatenated vector of fitted probabilities of dimension N (K 1). p 1 p k (x 1 ; β) p 2 p k (x 2 ; β). p K 1 p k =. p k (x N ; β) 1 k K 1

32 X is an N(K 1) (p + 1)(K 1) matrix: X = X X X

33 Matrix W is an N(K 1) N(K 1) square matrix: W = W 11 W 12 W 1(K 1) W 21 W 22 W 2(K 1) W (K 1),1 W (K 1),2 W (K 1),(K 1) Each submatrix W km, 1 k, m K 1, is an N N diagonal matrix. When k = m, the ith diagonal element in W kk is p k (x i ; β old )(1 p k (x i ; β old )). When k m, the ith diagonal element in W km is p k (x i ; β old )p m (x i ; β old ).

34 Similarly as with binary classification L(β) β = X T (y p) 2 L(β) β β T = X T W X. The formula for updating β new in the binary classification case holds for multiclass. β new = ( X T W X) 1 X T Wz, where z Xβ old + W 1 (y p). Or simply: β new = β old + ( X T W X) 1 X T (y p).

35 Computation Issues Initialization: one option is to use β = 0. Convergence is not guaranteed, but usually is the case. Usually, the log-likelihood increases after each iteration, but overshooting can occur. In the rare cases that the log-likelihood decreases, cut step size by half.

36 Connection with LDA Under the model of LDA: log Pr(G = k X = x) Pr(G = K X = x) = log π k π K 1 2 (µ k + µ K ) T Σ 1 (µ k µ K ) +x T Σ 1 (µ k µ K ) = a k0 + a T k x. The model of LDA satisfies the assumption of the linear logistic model. The linear logistic model only specifies the conditional distribution Pr(G = k X = x). No assumption is made about Pr(X ).

37 The LDA model specifies the joint distribution of X and G. Pr(X ) is a mixture of Gaussians: Pr(X ) = K π k φ(x ; µ k, Σ). k=1 where φ is the Gaussian density function. Linear logistic regression maximizes the conditional likelihood of G given X : Pr(G = k X = x). LDA maximizes the joint likelihood of G and X : Pr(X = x, G = k).

38 If the additional assumption made by LDA is appropriate, LDA tends to estimate the parameters more efficiently by using more information about the data. Samples without class labels can be used under the model of LDA. LDA is not robust to gross outliers. As logistic regression relies on fewer assumptions, it seems to be more robust. In practice, logistic regression and LDA often give similar results.

39 Simulation Assume input X is 1-D. Two classes have equal priors and the class-conditional densities of X are shifted versions of each other. Each conditional density is a mixture of two normals: Class 1 (red): 0.6N( 2, 1 4 ) + 0.4N(0, 1). Class 2 (blue): 0.6N(0, 1 4 ) + 0.4N(2, 1). The class-conditional densities are shown below.

40

41 LDA Result Training data set: 2000 samples for each class. Test data set: 1000 samples for each class. The estimation by LDA: ˆµ 1 = , ˆµ 2 = , ˆσ 2 = Boundary value between the two classes is (ˆµ 1 + ˆµ 2 )/2 = The classification error rate on the test data is Based on the true distribution, the Bayes (optimal) boundary value between the two classes is and the error rate is

42

43 Logistic Regression Result Linear logistic regression obtains β = ( , ) T. The boundary value satisfies X = 0, hence equals The error rate on the test data set is The estimated posterior probability is: Pr(G = 1 X = x) = e x 1 + e x.

44 The estimated posterior probability Pr(G = 1 X = x) and its true value based on the true distribution are compared in the graph below.