Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Similar documents

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Christfried Webers. Canberra February June 2015

Linear Classification. Volker Tresp Summer 2015

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Statistical Machine Learning

Lecture 3: Linear methods for classification

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear Threshold Units

STA 4273H: Statistical Machine Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Models for Classification

Machine Learning and Pattern Recognition Logistic Regression

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

CSCI567 Machine Learning (Fall 2014)

CS229 Lecture notes. Andrew Ng

Logistic Regression (1/24/13)

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Least Squares Estimation

Lecture 8 February 4

An Introduction to Machine Learning

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Introduction to Logistic Regression

Maximum Likelihood Estimation

Predict Influencers in the Social Network

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Principle of Data Reduction

Linear Algebra Methods for Data Mining

Introduction to General and Generalized Linear Models

2.3 Convex Constrained Optimization Problems

Classification Problems

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Logistic Regression for Data Mining and High-Dimensional Classification

Reject Inference in Credit Scoring. Jie-Men Mok

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Basics of Statistical Machine Learning

NON-LIFE INSURANCE PRICING USING THE GENERALIZED ADDITIVE MODEL, SMOOTHING SPLINES AND L-CURVES

Gaussian Conjugate Prior Cheat Sheet

Penalized Logistic Regression and Classification of Microarray Data

Master s Theory Exam Spring 2006

Data Mining: Algorithms and Applications Matrix Math Review

Classification by Pairwise Coupling

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

1 Prior Probability and Posterior Probability

Lecture 9: Introduction to Pattern Analysis

Statistical Machine Learning from Data

Factorial experimental designs and generalized linear models

MACHINE LEARNING IN HIGH ENERGY PHYSICS

(Quasi-)Newton methods

Nonlinear Optimization: Algorithms 3: Interior-point methods

Time Series Analysis

A Simple Introduction to Support Vector Machines

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Programming Exercise 3: Multi-class Classification and Neural Networks

The Method of Least Squares

Classification. Chapter 3

CSI:FLORIDA. Section 4.4: Logistic Regression

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

GLM, insurance pricing & big data: paying attention to convergence issues.

Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the school year.

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Computer exercise 4 Poisson Regression

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Quadratic forms Cochran s theorem, degrees of freedom, and all that

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

GI01/M055 Supervised Learning Proximal Methods

Statistics in Retail Finance. Chapter 6: Behavioural models

The equivalence of logistic regression and maximum entropy models

: Introduction to Machine Learning Dr. Rita Osadchy

Estimating an ARMA Process

Precalculus REVERSE CORRELATION. Content Expectations for. Precalculus. Michigan CONTENT EXPECTATIONS FOR PRECALCULUS CHAPTER/LESSON TITLES

DATA ANALYSIS II. Matrix Algorithms

Support Vector Machines with Clustering for Training with Very Large Datasets

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application

Inner Product Spaces

Properties of Future Lifetime Distributions and Estimation

Lecture 14: GLM Estimation and Logistic Regression

Elasticity Theory Basics

10. Proximal point method

Transcription:

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University

Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

Logistic Regression 3 / 43 Logistic Regression is a generative model, because it models the posterior probabilites directly.

Pattern Analysis 4 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

5 / 43 Posteriors and the Logistic Function For two classes y {0, 1} we get: p(y = 0 x) = p(y = 0) p(x y = 0) p(x) = p(y = 0) p(x y = 0) p(y = 0)p(x y = 0) + p(y = 1)p(x y = 1) = 1 1 + p(y=1)p(x y=1) p(y=0)p(x y=0)

Posteriors and the Logistic Function 6 / 43 p(y = 0 x) = 1 p(y=1)p(x y=1) log 1 + e p(y=0)p(x y=0) = 1 + e 1 p(y=0) p(x y=0) log log p(y=1) p(x y=1)

Posteriors and the Logistic Function 7 / 43 We see that the posterior can be written in terms of a logistic function: and thus for the other prior p(y = 0 x) = 1 1 + e F (x) p(y = 1 x) = 1 p(y = 0 x) = = e F (x) 1 + e F (x) 1 1 + e F (x)

Posteriors and the Logistic Function 8 / 43 Definition The logistic function (also called sigmoid function) is defined by where x IR. g(x) = 1 1 + e x

Posteriors and the Logistic Function 9 / 43 The derivative of the sigmoid function fulfills the nice property: g (x) = = = 1 (1 + e x ) 2 e x 1 (1 + e x ) e x (1 + e x ) 1 (1 + e x ) 1 (1 + e x ) = g(x)g( x) = g(x)(1 g(x)).

Posteriors and the Logistic Function 10 / 43 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 4 3 2 1 0 1 2 3 4 5 Abbildung: Sigmoid function: g(ax) = 1/(1 + e ax ) for a = 1, 2, 3, 4

Pattern Analysis 11 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

Decision Boundary 12 / 43 The decision boundary δ(x) = 0 (zero level set) in feature space separates the two classes. Points x on the decision boundary satisfy: and thus p(y = 0 x) = p(y = 1 x) log p(y = 0 x) p(y = 1 x) = log 1 = 0.

Decision Boundary 13 / 43 Lemma The decision boundary is given by F(x) = 0. Proof: log p(y = 0 x) p(y = 1 x) p(y = 0 x) p(y = 1 x) = F(x) = 0 = e F (x) p(y = 0 x) = e F (x) p(y = 1 x) p(y = 0 x) = e F (x) (1 p(y = 0 x))

Decision Boundary 14 / 43 Now we use that the posteriors sum up to one: p(y = 0 x) = e F (x) (1 p(y = 0 x)) p(y = 0 x) = p(y = 0 x) = e F (x) 1 + e F (x) 1 1 + e F (x)

Decision Boundary 15 / 43 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 5 4 3 2 1 0 1 2 3 4 5 Abbildung: Two Gaussians and its posteriors: σ 0 =σ 1 = 0.2, µ 0 = 2, µ 1 = 1

16 / 43 Decision Boundary Example Let us assume both classes have normally distributed d-dimensional feature vectors: p(x y) = 1 det 2πΣ e 1 2 (x µy )T Σ 1 y (x µ y ) then we can write the posterior of y = 0 in terms of a logistic function: p(y = 0 x) = 1 1 + e xt Ax+α T x+α 0

17 / 43 Decision Boundary Example log p(y = 0 x) p(y = 1 x) = log p(y = 0) p(y = 1) + log 1 e 1 2 (x µ 0) T Σ 1 0 (x µ 0) det 2πΣ0 1 e 1 2 (x µ 1) T Σ 1 1 (x µ 1) det 2πΣ1 This function has the constant component: We observe: c = log p(y = 0) p(y = 1) + 1 2 log det 2πΣ 1 det 2πΣ 0 Priors imply a constant offset of the decision boundary. If priors and covariance matrices of both classes are identical, this offset is c = 0.

Decision Boundary 18 / 43 Example Furthermore we have: log e 1 2 (x µ 0) T Σ 1 0 (x µ 0) = 1 2 = 1 2 e 1 2 (x µ 1) T Σ 1 1 (x µ 1) = ( (x µ 1 ) T Σ 1 1 (x µ 1) (x µ 0 ) T Σ 1 0 (x µ 0) ( x T (Σ 1 1 Σ 1 0 )x 2(µT 1 Σ 1 1 µ T 0 Σ 1 0 )x+ +µ T 1 Σ 1 1 µ 1 µ T 0 Σ 1 0 µ 0 ) )

Decision Boundary 19 / 43 Example Now we have: A = 1 2 (Σ 1 1 Σ 1 0 ) α T = µ T 0 Σ 1 0 µ T 1 Σ 1 1 α 0 = log p(y = 0) p(y = 1) + 1 ( log det 2πΣ ) 1 + µ T 1 2 det 2πΣ Σ 1 1 µ 1 µ T 0 Σ 1 0 µ 0 0

Decision Boundary 20 / 43 9 8 7 6 5 x 2 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 x 1 Abbildung: Two sample sets and the Gaussian decision boundary.

Decision Boundary 21 / 43 9 8 7 6 5 x 2 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 x 1 Abbildung: Shift of decision boundary by setting identical priors: p(y) = 1/2

Decision Boundary 22 / 43 Example (cont.) If both classes share the same covariances i.e. Σ = Σ 0 = Σ 1, then the argument of the sigmoid function is linear in the components of x. A = 0 α T = (µ 0 µ 1 ) T Σ 1 α 0 = log p(y = 0) p(y = 1) + 1 2 (µ 0 + µ 1 ) T Σ 1 (µ 1 µ 0 )

Decision Boundary 23 / 43 9 8 7 6 5 x 2 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 x 1 Abbildung: Identical covariances lead to linear decision boundary

Decision Boundary 24 / 43 9 8 7 6 5 x 2 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 x 1 Abbildung: Quadratic and linear decision boundary in comparison

25 / 43 Decision Boundary Note: If the class conditionals are Gaussians and share the same covariance, the argument of the exponential function is affine in x. This result is even true for a more general family of pdfs and not limited to Gaussian.

Decision Boundary 26 / 43 Definition The exponential family is a class of pdf s that can be written in the following canonical form p(x; θ, φ) = e θ T x b(θ) +c(x,φ) a(φ) where θ IR d is the location parameter vector, φ the dispersion parameter.

Decision Boundary 27 / 43 Example Binomial, Poisson, hypergeometric, exponential distributions or Gaussians belong to the the exponential family.

Decision Boundary 28 / 43 Lemma If all class-conditional densities are members of the same exponential family distribution with equal dispersion φ, the decision boundary F(x) = 0 is linear in the components of x.

Pattern Analysis 29 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

30 / 43 Log-Likelihood Function Let us assume the posteriors are given by p(y = 0 x) = 1 g(θ T x) p(y = 1 x) = g(θ T x) where g(θ T x) is the sigmoid function parameterized in θ. The parameter vector θ has to be estimated from a set S of m training samples: S = {(x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ),..., (x m, y m )}. Method of choice: Maximum Likelihood Estimation

Log-Likelihood Function 31 / 43 Before we work on the formulas of the ML-estimator, we rewrite the posteriors using Bernoulli probability: p(y x) = g(θ T x) y (1 g(θ T x)) 1 y which shows the great benefit of the chosen notation for class numbers.

Log-Likelihood Function 32 / 43 Now we can compute the log-likelihood function (assuming that the training samples are mutually independent): m l(θ) = log p(y i x i ) = = i=1 m log g(θ T x i ) y i (1 g(θ T x i )) 1 y i i=1 m y i log g(θ T x i ) + (1 y i ) log(1 g(θ T x i )) i=1

33 / 43 Log-Likelihood Function Notes for the expert: The negative of the log-likelihood function is the cross entropy of y and g(θ T x). The negative of the log-likelihood function is a convex function.

Gradient of log-likelihood Function 34 / 43 The gradient: θ j l(θ) = m i=1 ( ) yi g(θ T x i ) 1 y i 1 g(θ T g(θ T x i ) x i ) θ j now we use the derivative of the sigmoid function and get θ j l(θ) = = m i=1 m i=1 ( ) yi g(θ T x i ) 1 y i 1 g(θ T g(θ T x i )(1 g(θ T x i ))x i,j x i ) ( ) y i (1 g(θ T x i )) (1 y i )g(θ T x i ) x i,j where x i,j is the j th component of the i th training feature vector.

Gradient of log-likelihood Function 35 / 43 Finally we have a quite simple gradient: θ j l(θ) = m i=1 ( ) y i g(θ T x i ) x i,j where x i,j is the j th component of the i th training feature vector. Or in vector notation: m θ l(θ) = ( ) y i g(θ T x i ) x i i=1

Hessian of log-likelihood Function 36 / 43 The log-likelihood function is concave. We use the Newton-Raphson algorithm to solve the unconstrained optimization problem. For that purpose the Hessian is required (remember the derivative of the sigmoid function!): 2 m θ θ T l(θ) = i=1 ( ) g(θ T x i ) 1 g(θ T x i ) x i x T i

Newton-Raphson Iteration 37 / 43 For the (k + 1)-st iteration step, we get: ( ) θ (k+1) = θ (k) 2 1 θ θ T l(θ) θ l(θ) Note: If you write the Newton-Raphson iteration in matrix form, you will end up with a weighted least squares iteration scheme.

Pattern Analysis 38 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

Perceptron and Logistic Regression 39 / 43

Pattern Analysis 40 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

41 / 43 Lessons Learned Posteriors can be rewritten in terms of a logistic function. Given the decision boundary F (x) = 0, we can write down the posterior p(y x) right away. Decision boundary for normally distributed feature vectors for each class is a quadratic function. If Gaussians share the same covariances, the decision boundary is a linear function.

Pattern Analysis 42 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

43 / 43 Further Readings T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer, 2001. David W. Hosmer, Stanley Lemeshow: Applied Logistic Regression, 2nd Edition, John Wiley & Sons, Hoboken 2000.