Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University
Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings
Logistic Regression 3 / 43 Logistic Regression is a generative model, because it models the posterior probabilites directly.
Pattern Analysis 4 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings
5 / 43 Posteriors and the Logistic Function For two classes y {0, 1} we get: p(y = 0 x) = p(y = 0) p(x y = 0) p(x) = p(y = 0) p(x y = 0) p(y = 0)p(x y = 0) + p(y = 1)p(x y = 1) = 1 1 + p(y=1)p(x y=1) p(y=0)p(x y=0)
Posteriors and the Logistic Function 6 / 43 p(y = 0 x) = 1 p(y=1)p(x y=1) log 1 + e p(y=0)p(x y=0) = 1 + e 1 p(y=0) p(x y=0) log log p(y=1) p(x y=1)
Posteriors and the Logistic Function 7 / 43 We see that the posterior can be written in terms of a logistic function: and thus for the other prior p(y = 0 x) = 1 1 + e F (x) p(y = 1 x) = 1 p(y = 0 x) = = e F (x) 1 + e F (x) 1 1 + e F (x)
Posteriors and the Logistic Function 8 / 43 Definition The logistic function (also called sigmoid function) is defined by where x IR. g(x) = 1 1 + e x
Posteriors and the Logistic Function 9 / 43 The derivative of the sigmoid function fulfills the nice property: g (x) = = = 1 (1 + e x ) 2 e x 1 (1 + e x ) e x (1 + e x ) 1 (1 + e x ) 1 (1 + e x ) = g(x)g( x) = g(x)(1 g(x)).
Posteriors and the Logistic Function 10 / 43 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 4 3 2 1 0 1 2 3 4 5 Abbildung: Sigmoid function: g(ax) = 1/(1 + e ax ) for a = 1, 2, 3, 4
Pattern Analysis 11 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings
Decision Boundary 12 / 43 The decision boundary δ(x) = 0 (zero level set) in feature space separates the two classes. Points x on the decision boundary satisfy: and thus p(y = 0 x) = p(y = 1 x) log p(y = 0 x) p(y = 1 x) = log 1 = 0.
Decision Boundary 13 / 43 Lemma The decision boundary is given by F(x) = 0. Proof: log p(y = 0 x) p(y = 1 x) p(y = 0 x) p(y = 1 x) = F(x) = 0 = e F (x) p(y = 0 x) = e F (x) p(y = 1 x) p(y = 0 x) = e F (x) (1 p(y = 0 x))
Decision Boundary 14 / 43 Now we use that the posteriors sum up to one: p(y = 0 x) = e F (x) (1 p(y = 0 x)) p(y = 0 x) = p(y = 0 x) = e F (x) 1 + e F (x) 1 1 + e F (x)
Decision Boundary 15 / 43 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 5 4 3 2 1 0 1 2 3 4 5 Abbildung: Two Gaussians and its posteriors: σ 0 =σ 1 = 0.2, µ 0 = 2, µ 1 = 1
16 / 43 Decision Boundary Example Let us assume both classes have normally distributed d-dimensional feature vectors: p(x y) = 1 det 2πΣ e 1 2 (x µy )T Σ 1 y (x µ y ) then we can write the posterior of y = 0 in terms of a logistic function: p(y = 0 x) = 1 1 + e xt Ax+α T x+α 0
17 / 43 Decision Boundary Example log p(y = 0 x) p(y = 1 x) = log p(y = 0) p(y = 1) + log 1 e 1 2 (x µ 0) T Σ 1 0 (x µ 0) det 2πΣ0 1 e 1 2 (x µ 1) T Σ 1 1 (x µ 1) det 2πΣ1 This function has the constant component: We observe: c = log p(y = 0) p(y = 1) + 1 2 log det 2πΣ 1 det 2πΣ 0 Priors imply a constant offset of the decision boundary. If priors and covariance matrices of both classes are identical, this offset is c = 0.
Decision Boundary 18 / 43 Example Furthermore we have: log e 1 2 (x µ 0) T Σ 1 0 (x µ 0) = 1 2 = 1 2 e 1 2 (x µ 1) T Σ 1 1 (x µ 1) = ( (x µ 1 ) T Σ 1 1 (x µ 1) (x µ 0 ) T Σ 1 0 (x µ 0) ( x T (Σ 1 1 Σ 1 0 )x 2(µT 1 Σ 1 1 µ T 0 Σ 1 0 )x+ +µ T 1 Σ 1 1 µ 1 µ T 0 Σ 1 0 µ 0 ) )
Decision Boundary 19 / 43 Example Now we have: A = 1 2 (Σ 1 1 Σ 1 0 ) α T = µ T 0 Σ 1 0 µ T 1 Σ 1 1 α 0 = log p(y = 0) p(y = 1) + 1 ( log det 2πΣ ) 1 + µ T 1 2 det 2πΣ Σ 1 1 µ 1 µ T 0 Σ 1 0 µ 0 0
Decision Boundary 20 / 43 9 8 7 6 5 x 2 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 x 1 Abbildung: Two sample sets and the Gaussian decision boundary.
Decision Boundary 21 / 43 9 8 7 6 5 x 2 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 x 1 Abbildung: Shift of decision boundary by setting identical priors: p(y) = 1/2
Decision Boundary 22 / 43 Example (cont.) If both classes share the same covariances i.e. Σ = Σ 0 = Σ 1, then the argument of the sigmoid function is linear in the components of x. A = 0 α T = (µ 0 µ 1 ) T Σ 1 α 0 = log p(y = 0) p(y = 1) + 1 2 (µ 0 + µ 1 ) T Σ 1 (µ 1 µ 0 )
Decision Boundary 23 / 43 9 8 7 6 5 x 2 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 x 1 Abbildung: Identical covariances lead to linear decision boundary
Decision Boundary 24 / 43 9 8 7 6 5 x 2 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 x 1 Abbildung: Quadratic and linear decision boundary in comparison
25 / 43 Decision Boundary Note: If the class conditionals are Gaussians and share the same covariance, the argument of the exponential function is affine in x. This result is even true for a more general family of pdfs and not limited to Gaussian.
Decision Boundary 26 / 43 Definition The exponential family is a class of pdf s that can be written in the following canonical form p(x; θ, φ) = e θ T x b(θ) +c(x,φ) a(φ) where θ IR d is the location parameter vector, φ the dispersion parameter.
Decision Boundary 27 / 43 Example Binomial, Poisson, hypergeometric, exponential distributions or Gaussians belong to the the exponential family.
Decision Boundary 28 / 43 Lemma If all class-conditional densities are members of the same exponential family distribution with equal dispersion φ, the decision boundary F(x) = 0 is linear in the components of x.
Pattern Analysis 29 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings
30 / 43 Log-Likelihood Function Let us assume the posteriors are given by p(y = 0 x) = 1 g(θ T x) p(y = 1 x) = g(θ T x) where g(θ T x) is the sigmoid function parameterized in θ. The parameter vector θ has to be estimated from a set S of m training samples: S = {(x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ),..., (x m, y m )}. Method of choice: Maximum Likelihood Estimation
Log-Likelihood Function 31 / 43 Before we work on the formulas of the ML-estimator, we rewrite the posteriors using Bernoulli probability: p(y x) = g(θ T x) y (1 g(θ T x)) 1 y which shows the great benefit of the chosen notation for class numbers.
Log-Likelihood Function 32 / 43 Now we can compute the log-likelihood function (assuming that the training samples are mutually independent): m l(θ) = log p(y i x i ) = = i=1 m log g(θ T x i ) y i (1 g(θ T x i )) 1 y i i=1 m y i log g(θ T x i ) + (1 y i ) log(1 g(θ T x i )) i=1
33 / 43 Log-Likelihood Function Notes for the expert: The negative of the log-likelihood function is the cross entropy of y and g(θ T x). The negative of the log-likelihood function is a convex function.
Gradient of log-likelihood Function 34 / 43 The gradient: θ j l(θ) = m i=1 ( ) yi g(θ T x i ) 1 y i 1 g(θ T g(θ T x i ) x i ) θ j now we use the derivative of the sigmoid function and get θ j l(θ) = = m i=1 m i=1 ( ) yi g(θ T x i ) 1 y i 1 g(θ T g(θ T x i )(1 g(θ T x i ))x i,j x i ) ( ) y i (1 g(θ T x i )) (1 y i )g(θ T x i ) x i,j where x i,j is the j th component of the i th training feature vector.
Gradient of log-likelihood Function 35 / 43 Finally we have a quite simple gradient: θ j l(θ) = m i=1 ( ) y i g(θ T x i ) x i,j where x i,j is the j th component of the i th training feature vector. Or in vector notation: m θ l(θ) = ( ) y i g(θ T x i ) x i i=1
Hessian of log-likelihood Function 36 / 43 The log-likelihood function is concave. We use the Newton-Raphson algorithm to solve the unconstrained optimization problem. For that purpose the Hessian is required (remember the derivative of the sigmoid function!): 2 m θ θ T l(θ) = i=1 ( ) g(θ T x i ) 1 g(θ T x i ) x i x T i
Newton-Raphson Iteration 37 / 43 For the (k + 1)-st iteration step, we get: ( ) θ (k+1) = θ (k) 2 1 θ θ T l(θ) θ l(θ) Note: If you write the Newton-Raphson iteration in matrix form, you will end up with a weighted least squares iteration scheme.
Pattern Analysis 38 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings
Perceptron and Logistic Regression 39 / 43
Pattern Analysis 40 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings
41 / 43 Lessons Learned Posteriors can be rewritten in terms of a logistic function. Given the decision boundary F (x) = 0, we can write down the posterior p(y x) right away. Decision boundary for normally distributed feature vectors for each class is a quadratic function. If Gaussians share the same covariances, the decision boundary is a linear function.
Pattern Analysis 42 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings
43 / 43 Further Readings T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer, 2001. David W. Hosmer, Stanley Lemeshow: Applied Logistic Regression, 2nd Edition, John Wiley & Sons, Hoboken 2000.