Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Size: px

Start display at page:

Download "Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University"

Ann Howard
8 years ago
Views:

1 Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University

2 Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

Logistic Regression Log-Likelihood Function Gradient

3 Logistic Regression 3 / 43 Logistic Regression is a generative model, because it models the posterior probabilites directly.

4 Pattern Analysis 4 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

5 5 / 43 Posteriors and the Logistic Function For two classes y {0, 1} we get: p(y = 0 x) = p(y = 0) p(x y = 0) p(x) = p(y = 0) p(x y = 0) p(y = 0)p(x y = 0) + p(y = 1)p(x y = 1) = p(y=1)p(x y=1) p(y=0)p(x y=0)

6 Posteriors and the Logistic Function 6 / 43 p(y = 0 x) = 1 p(y=1)p(x y=1) log 1 + e p(y=0)p(x y=0) = 1 + e 1 p(y=0) p(x y=0) log log p(y=1) p(x y=1)

7 Posteriors and the Logistic Function 7 / 43 We see that the posterior can be written in terms of a logistic function: and thus for the other prior p(y = 0 x) = e F (x) p(y = 1 x) = 1 p(y = 0 x) = = e F (x) 1 + e F (x) e F (x)

and thus for the other prior p(y = 0 x) = 1 1 + e F (x) p(y

8 Posteriors and the Logistic Function 8 / 43 Definition The logistic function (also called sigmoid function) is defined by where x IR. g(x) = e x

9 Posteriors and the Logistic Function 9 / 43 The derivative of the sigmoid function fulfills the nice property: g (x) = = = 1 (1 + e x ) 2 e x 1 (1 + e x ) e x (1 + e x ) 1 (1 + e x ) 1 (1 + e x ) = g(x)g( x) = g(x)(1 g(x)).

10 Posteriors and the Logistic Function 10 / Abbildung: Sigmoid function: g(ax) = 1/(1 + e ax ) for a = 1, 2, 3, 4

11 Pattern Analysis 11 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

12 Decision Boundary 12 / 43 The decision boundary δ(x) = 0 (zero level set) in feature space separates the two classes. Points x on the decision boundary satisfy: and thus p(y = 0 x) = p(y = 1 x) log p(y = 0 x) p(y = 1 x) = log 1 = 0.

13 Decision Boundary 13 / 43 Lemma The decision boundary is given by F(x) = 0. Proof: log p(y = 0 x) p(y = 1 x) p(y = 0 x) p(y = 1 x) = F(x) = 0 = e F (x) p(y = 0 x) = e F (x) p(y = 1 x) p(y = 0 x) = e F (x) (1 p(y = 0 x))

14 Decision Boundary 14 / 43 Now we use that the posteriors sum up to one: p(y = 0 x) = e F (x) (1 p(y = 0 x)) p(y = 0 x) = p(y = 0 x) = e F (x) 1 + e F (x) e F (x)

15 Decision Boundary 15 / Abbildung: Two Gaussians and its posteriors: σ 0 =σ 1 = 0.2, µ 0 = 2, µ 1 = 1

2 0 5 4 3 2 1 0 1 2 3 4 5 Abbildung: Two

16 16 / 43 Decision Boundary Example Let us assume both classes have normally distributed d-dimensional feature vectors: p(x y) = 1 det 2πΣ e 1 2 (x µy )T Σ 1 y (x µ y ) then we can write the posterior of y = 0 in terms of a logistic function: p(y = 0 x) = e xt Ax+α T x+α 0

2πΣ e 1 2 (x µy )T Σ 1 y (x µ y ) then we can write the posterior of

17 17 / 43 Decision Boundary Example log p(y = 0 x) p(y = 1 x) = log p(y = 0) p(y = 1) + log 1 e 1 2 (x µ 0) T Σ 1 0 (x µ 0) det 2πΣ0 1 e 1 2 (x µ 1) T Σ 1 1 (x µ 1) det 2πΣ1 This function has the constant component: We observe: c = log p(y = 0) p(y = 1) log det 2πΣ 1 det 2πΣ 0 Priors imply a constant offset of the decision boundary. If priors and covariance matrices of both classes are identical, this offset is c = 0.

component: We observe: c = log p(y = 0) p(y = 1) + 1 2 log det 2πΣ 1 det 2πΣ 0 Priors imply a constant

18 Decision Boundary 18 / 43 Example Furthermore we have: log e 1 2 (x µ 0) T Σ 1 0 (x µ 0) = 1 2 = 1 2 e 1 2 (x µ 1) T Σ 1 1 (x µ 1) = ( (x µ 1 ) T Σ 1 1 (x µ 1) (x µ 0 ) T Σ 1 0 (x µ 0) ( x T (Σ 1 1 Σ 1 0 )x 2(µT 1 Σ 1 1 µ T 0 Σ 1 0 )x+ +µ T 1 Σ 1 1 µ 1 µ T 0 Σ 1 0 µ 0 ) )

19 Decision Boundary 19 / 43 Example Now we have: A = 1 2 (Σ 1 1 Σ 1 0 ) α T = µ T 0 Σ 1 0 µ T 1 Σ 1 1 α 0 = log p(y = 0) p(y = 1) + 1 ( log det 2πΣ ) 1 + µ T 1 2 det 2πΣ Σ 1 1 µ 1 µ T 0 Σ 1 0 µ 0 0

20 Decision Boundary 20 / x x 1 Abbildung: Two sample sets and the Gaussian decision boundary.

21 Decision Boundary 21 / x x 1 Abbildung: Shift of decision boundary by setting identical priors: p(y) = 1/2

22 Decision Boundary 22 / 43 Example (cont.) If both classes share the same covariances i.e. Σ = Σ 0 = Σ 1, then the argument of the sigmoid function is linear in the components of x. A = 0 α T = (µ 0 µ 1 ) T Σ 1 α 0 = log p(y = 0) p(y = 1) (µ 0 + µ 1 ) T Σ 1 (µ 1 µ 0 )

23 Decision Boundary 23 / x x 1 Abbildung: Identical covariances lead to linear decision boundary

24 Decision Boundary 24 / x x 1 Abbildung: Quadratic and linear decision boundary in comparison

25 25 / 43 Decision Boundary Note: If the class conditionals are Gaussians and share the same covariance, the argument of the exponential function is affine in x. This result is even true for a more general family of pdfs and not limited to Gaussian.

26 Decision Boundary 26 / 43 Definition The exponential family is a class of pdf s that can be written in the following canonical form p(x; θ, φ) = e θ T x b(θ) +c(x,φ) a(φ) where θ IR d is the location parameter vector, φ the dispersion parameter.

27 Decision Boundary 27 / 43 Example Binomial, Poisson, hypergeometric, exponential distributions or Gaussians belong to the the exponential family.

28 Decision Boundary 28 / 43 Lemma If all class-conditional densities are members of the same exponential family distribution with equal dispersion φ, the decision boundary F(x) = 0 is linear in the components of x.

29 Pattern Analysis 29 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

30 30 / 43 Log-Likelihood Function Let us assume the posteriors are given by p(y = 0 x) = 1 g(θ T x) p(y = 1 x) = g(θ T x) where g(θ T x) is the sigmoid function parameterized in θ. The parameter vector θ has to be estimated from a set S of m training samples: S = {(x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ),..., (x m, y m )}. Method of choice: Maximum Likelihood Estimation

31 Log-Likelihood Function 31 / 43 Before we work on the formulas of the ML-estimator, we rewrite the posteriors using Bernoulli probability: p(y x) = g(θ T x) y (1 g(θ T x)) 1 y which shows the great benefit of the chosen notation for class numbers.

32 Log-Likelihood Function 32 / 43 Now we can compute the log-likelihood function (assuming that the training samples are mutually independent): m l(θ) = log p(y i x i ) = = i=1 m log g(θ T x i ) y i (1 g(θ T x i )) 1 y i i=1 m y i log g(θ T x i ) + (1 y i ) log(1 g(θ T x i )) i=1

33 33 / 43 Log-Likelihood Function Notes for the expert: The negative of the log-likelihood function is the cross entropy of y and g(θ T x). The negative of the log-likelihood function is a convex function.

34 Gradient of log-likelihood Function 34 / 43 The gradient: θ j l(θ) = m i=1 ( ) yi g(θ T x i ) 1 y i 1 g(θ T g(θ T x i ) x i ) θ j now we use the derivative of the sigmoid function and get θ j l(θ) = = m i=1 m i=1 ( ) yi g(θ T x i ) 1 y i 1 g(θ T g(θ T x i )(1 g(θ T x i ))x i,j x i ) ( ) y i (1 g(θ T x i )) (1 y i )g(θ T x i ) x i,j where x i,j is the j th component of the i th training feature vector.

35 Gradient of log-likelihood Function 35 / 43 Finally we have a quite simple gradient: θ j l(θ) = m i=1 ( ) y i g(θ T x i ) x i,j where x i,j is the j th component of the i th training feature vector. Or in vector notation: m θ l(θ) = ( ) y i g(θ T x i ) x i i=1

36 Hessian of log-likelihood Function 36 / 43 The log-likelihood function is concave. We use the Newton-Raphson algorithm to solve the unconstrained optimization problem. For that purpose the Hessian is required (remember the derivative of the sigmoid function!): 2 m θ θ T l(θ) = i=1 ( ) g(θ T x i ) 1 g(θ T x i ) x i x T i

37 Newton-Raphson Iteration 37 / 43 For the (k + 1)-st iteration step, we get: ( ) θ (k+1) = θ (k) 2 1 θ θ T l(θ) θ l(θ) Note: If you write the Newton-Raphson iteration in matrix form, you will end up with a weighted least squares iteration scheme.

38 Pattern Analysis 38 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

39 Perceptron and Logistic Regression 39 / 43

40 Pattern Analysis 40 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

41 41 / 43 Lessons Learned Posteriors can be rewritten in terms of a logistic function. Given the decision boundary F (x) = 0, we can write down the posterior p(y x) right away. Decision boundary for normally distributed feature vectors for each class is a quadratic function. If Gaussians share the same covariances, the decision boundary is a linear function.

42 Pattern Analysis 42 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

43 43 / 43 Further Readings T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer, David W. Hosmer, Stanley Lemeshow: Applied Logistic Regression, 2nd Edition, John Wiley & Sons, Hoboken 2000.

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(