Logistic Regression. Chapter 12. 12.1 Modeling Conditional Probabilities



Similar documents
I. Chi-squared Distributions

Properties of MLE: consistency, asymptotic normality. Fisher information.

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Soving Recurrence Relations

5: Introduction to Estimation

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

1. C. The formula for the confidence interval for a population mean is: x t, which was

Hypothesis testing. Null and alternative hypotheses

Chapter 7 Methods of Finding Estimators

Maximum Likelihood Estimators.

Determining the sample size

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

Overview of some probability distributions.


In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

1 Correlation and Regression Analysis

Math C067 Sampling Distributions

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

Lesson 17 Pearson s Correlation Coefficient

CHAPTER 3 DIGITAL CODING OF SIGNALS

Output Analysis (2, Chapters 10 &11 Law)

PSYCHOLOGICAL STATISTICS

Modified Line Search Method for Global Optimization

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

CHAPTER 3 THE TIME VALUE OF MONEY

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Incremental calculation of weighted mean and variance

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

Now here is the important step

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

Normal Distribution.

Linear classifier MAXIMUM ENTROPY. Linear regression. Logistic regression 11/3/11. f 1

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

Lesson 15 ANOVA (analysis of variance)

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Confidence Intervals for One Mean

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Sequences and Series

A probabilistic proof of a binomial identity

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

INVESTMENT PERFORMANCE COUNCIL (IPC)

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

Confidence intervals and hypothesis tests

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Factoring x n 1: cyclotomic and Aurifeuillian polynomials Paul Garrett <garrett@math.umn.edu>

Simple Annuities Present Value.

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

Measures of Spread and Boxplots Discrete Math, Section 9.4

Hypergeometric Distributions

Chapter 7: Confidence Interval and Sample Size

5 Boolean Decision Trees (February 11)

Chapter 14 Nonparametric Statistics

Page 1. Real Options for Engineering Systems. What are we up to? Today s agenda. J1: Real Options for Engineering Systems. Richard de Neufville

Building Blocks Problem Related to Harmonic Series

The Stable Marriage Problem

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

Quadrat Sampling in Population Ecology

Exam 3. Instructor: Cynthia Rudin TA: Dimitrios Bisias. November 22, 2011

How to use what you OWN to reduce what you OWE

MEI Structured Mathematics. Module Summary Sheets. Statistics 2 (Version B: reference to new book)

A gentle introduction to Expectation Maximization

Universal coding for classes of sources

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

5.4 Amortization. Question 1: How do you find the present value of an annuity? Question 2: How is a loan amortized?

Professional Networking

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE The absolute value of the complex number z a bi is

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

LECTURE 13: Cross-validation

INFINITE SERIES KEITH CONRAD

Learning objectives. Duc K. Nguyen - Corporate Finance 21/10/2014

1 Computing the Standard Deviation of Sample Means

1 The Gaussian channel

NATIONAL SENIOR CERTIFICATE GRADE 12

Chapter 5 Unit 1. IET 350 Engineering Economics. Learning Objectives Chapter 5. Learning Objectives Unit 1. Annual Amount and Gradient Functions

One-sample test of proportions

Mann-Whitney U 2 Sample Test (a.k.a. Wilcoxon Rank Sum Test)

2-3 The Remainder and Factor Theorems

Section 11.3: The Integral Test

I. Why is there a time value to money (TVM)?

MMQ Problems Solutions with Calculators. Managerial Finance

Systems Design Project: Indoor Location of Wireless Devices

Asymptotic Growth of Functions

Solving Logarithms and Exponential Equations

Valuing Firms in Distress

WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER?

Data-Enhanced Predictive Modeling for Sales Targeting

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

THE TWO-VARIABLE LINEAR REGRESSION MODEL

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1)

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

Basic Elements of Arithmetic Sequences and Series

Example: Probability ($1 million in S&P 500 Index will decline by more than 20% within a

Transcription:

Chapter 12 Logistic Regressio 12.1 Modelig Coditioal Probabilities So far, we either looked at estimatig the coditioal expectatios of cotiuous variables (as i regressio), or at estimatig distributios. There are may situatios where however we are iterested i iputoutput relatioships, as i regressio, but the output variable is discrete rather tha cotiuous. I particular there are may situatios where we have biary outcomes (it sows i Pittsburgh o a give day, or it does t; this squirrel carries plague, or it does t; this loa will be paid back, or it wo t; this perso will get heart disease i the ext five years, or they wo t). I additio to the biary outcome, we have some iput variables, which may or may ot be cotiuous. How could we model ad aalyze such data? We could try to come up with a rule which guesses the biary output from the iput variables. This is called classificatio, ad is a importat topic i statistics ad machie learig. However, simply guessig yes or o is pretty crude especially if there is o perfect rule. (Why should there be?) Somethig which takes oise ito accout, ad does t just give a biary aswer, will ofte be useful. I short, we wat probabilities which meas we eed to fit a stochastic model. What would be ice, i fact, would be to have coditioal distributio of the respose Y, give the iput variables, Pr(Y X ). This would tell us about how precise our predictios are. If our model says that there s a 51% chace of sow ad it does t sow, that s better tha if it had said there was a 99% chace of sow (though eve a 99% chace is ot a sure thig). We have see how to estimate coditioal probabilities oparametrically, ad could do this usig the kerels for discrete variables from lecture 6. While there are a lot of merits to this approach, it does ivolve comig up with a model for the joit distributio of outputs Y ad iputs X, which ca be quite timecosumig. Let s pick oe of the classes ad call it 1 ad the other 0. (It does t matter which is which. The Y becomes a idicator variable, ad you ca covice yourself that Pr(Y = 1) = E[Y ]. Similarly, Pr(Y = 1 X = x) = E[Y X = x]. (I a phrase, coditioal probability is the coditioal expectatio of the idicator.) 223

224 CHAPTER 12. LOGISTIC REGRESSION This helps us because by this poit we kow all about estimatig coditioal expectatios. The most straightforward thig for us to do at this poit would be to pick out our favorite smoother ad estimate the regressio fuctio for the idicator variable; this will be a estimate of the coditioal probability fuctio. There are two reasos ot to just pluge ahead with that idea. Oe is that probabilities must be betwee 0 ad 1, but our smoothers will ot ecessarily respect that, eve if all the observed y i they get are either 0 or 1. The other is that we might be better off makig more use of the fact that we are tryig to estimate probabilities, by more explicitly modelig the probability. Assume that Pr(Y = 1 X = x) = p(x;θ), for some fuctio p parameterized by θ. parameterized fuctio θ, ad further assume that observatios are idepedet of each other. The the (coditioal) likelihood fuctio is Pr Y = y i X = x i = p(x i ;θ) y i (1 p(xi ;θ) 1 y i ) (12.1) Recall that i a sequece of Beroulli trials y 1,...y, where there is a costat probability of success p, the likelihood is p y i (1 p) 1 y i (12.2) As you leared i itro. stats, this likelihood is maximized whe p = ˆp = 1 y i. If each trial had its ow success probability p i, this likelihood becomes p y i (1 p i i ) 1 y i (12.3) Without some costraits, estimatig the ihomogeeous Beroulli model by maximum likelihood does t work; we d get ˆp i = 1 whe y i = 1, ˆp i = 0 whe y i = 0, ad lear othig. If o the other had we assume that the p i are t just arbitrary umbers but are liked together, those costraits give otrivial parameter estimates, ad let us geeralize. I the kid of model we are talkig about, the costrait, p i = p(x i ;θ), tells us that p i must be the same wheever x i is the same, ad if p is a cotiuous fuctio, the similar values of x i must lead to similar values of p i. Assumig p is kow (up to parameters), the likelihood is a fuctio of θ, ad we ca estimate θ by maximizig the likelihood. This lecture will be about this approach. 12.2 Logistic Regressio To sum up: we have a biary output variable Y, ad we wat to model the coditioal probability Pr(Y = 1 X = x) as a fuctio of x; ay ukow parameters i the fuctio are to be estimated by maximum likelihood. By ow, it will ot surprise you to lear that statisticias have approach this problem by askig themselves how ca we use liear regressio to solve this?

12.2. LOGISTIC REGRESSION 225 1. The most obvious idea is to let p(x) be a liear fuctio of x. Every icremet of a compoet of x would add or subtract so much to the probability. The coceptual problem here is that p must be betwee 0 ad 1, ad liear fuctios are ubouded. Moreover, i may situatios we empirically see dimiishig returs chagig p by the same amout requires a bigger chage i x whe p is already large (or small) tha whe p is close to 1/2. Liear models ca t do this. 2. The ext most obvious idea is to let log p(x) be a liear fuctio of x, so that chagig a iput variable multiplies the probability by a fixed amout. The problem is that logarithms are ubouded i oly oe directio, ad liear fuctios are ot. 3. Fially, the easiest modificatio of log p which has a ubouded rage is the logistic (or logit) trasformatio, log p. We ca make this a liear fuctio of x without fear of osesical results. (Of course the results could still happe to be wrog, but they re ot guarateed to be wrog.) 1 p This last alterative is logistic regressio. Formally, the model logistic regressio model is that Solvig for p, this gives p(x; b, w)= p(x) log 1 p(x) = β 0 x β (12.4) eβ 0 x β 1 e β 0 x β = 1 1 e (β 0 x β) (12.5) Notice that the overall specificatio is a lot easier to grasp i terms of the trasformed probability that i terms of the utrasformed probability. 1 To miimize the misclassificatio rate, we should predict Y = 1 whe p ad Y = 0 whe p <. This meas guessig 1 wheever β 0 x β is oegative, ad 0 otherwise. So logistic regressio gives us a liear classifier. The decisio boudary separatig the two predicted classes is the solutio of β 0 x β = 0, which is a poit if x is oe dimesioal, a lie if it is two dimesioal, etc. Oe ca show (exercise!) that the distace from the decisio boudary is β 0 /β x β/β. Logistic regressio ot oly says where the boudary betwee the classes is, but also says (via Eq. 12.5) that the class probabilities deped o distace from the boudary, i a particular way, ad that they go towards the extremes (0 ad 1) more rapidly whe β is larger. It s these statemets about probabilities which make logistic regressio more tha just a classifier. It makes stroger, more detailed predictios, ad ca be fit i a differet way; but those strog predictios could be wrog. Usig logistic regressio to predict class probabilities is a modelig choice, just like it s a modelig choice to predict quatitative variables with liear regressio. 1 Uless you ve take statistical mechaics, i which case you recogize that this is the Boltzma distributio for a system with two states, which differ i eergy by β 0 x β.

226 CHAPTER 12. LOGISTIC REGRESSION!! 1 1!,w=!,!! 2 2! 1 2 2 x[,2] Liear classifier with b= x[,1] Logistic regressio with b=2.5, w=(5,5) x[,1] x[,2] x[,1] x[,2] x[,2] Logistic regressio with b=, w=(1,1) Logistic regressio with b=0.1, w=(.2,.2) x[,1] Figure 12.1: Effects of scalig logistic regressio parameters. Values of x1 ad x2 are the same i all plots ( Uif( 1, 1) for both coordiates), but labels were geerated radomly from logistic regressios with β0 = 0.1, β = ( 0.2, 0.2) (top left); from β0 =, β = ( 1, 1) (top right); from β0 = 2.5, β = ( 5, 5) (bottom left); ad from a perfect liear classifier with the same boudary. The large black dot is the origi.

12.2. LOGISTIC REGRESSION 227 I either case is the appropriateess of the model guarateed by the gods, ature, mathematical ecessity, etc. We begi by positig the model, to get somethig to work with, ad we ed (if we kow what we re doig) by checkig whether it really does match the data, or whether it has systematic flaws. Logistic regressio is oe of the most commoly used tools for applied statistics ad discrete data aalysis. There are basically four reasos for this. 1. Traditio. 2. I additio to the heuristic approach above, the quatity log p/(1 p) plays a importat role i the aalysis of cotigecy tables (the log odds ). Classificatio is a bit like havig a cotigecy table with two colums (classes) ad ifiitely may rows (values of x). With a fiite cotigecy table, we ca estimate the logodds for each row empirically, by just takig couts i the table. With ifiitely may rows, we eed some sort of iterpolatio scheme; logistic regressio is liear iterpolatio for the logodds. 3. It s closely related to expoetial family distributios, where the probability of some vector v is proportioal to expβ 0 m f j =1 j (v)β j. If oe of the compoets of v is biary, ad the fuctios f j are all the idetity fuctio, the we get a logistic regressio. Expoetial families arise i may cotexts i statistical theory (ad i physics!), so there are lots of problems which ca be tured ito logistic regressio. 4. It ofte works surprisigly well as a classifier. But, may simple techiques ofte work surprisigly well as classifiers, ad this does t really testify to logistic regressio gettig the probabilities right. 12.2.1 Likelihood Fuctio for Logistic Regressio Because logistic regressio predicts probabilities, rather tha just classes, we ca fit it usig likelihood. For each traiig datapoit, we have a vector of features, x i, ad a observed class, y i. The probability of that class was either p, if y i = 1, or 1 p, if y i = 0. The likelihood is the L(β 0,β)= p(x i ) y i (1 p(xi ) 1 y i (12.6)

228 CHAPTER 12. LOGISTIC REGRESSION (I could substitute i the actual equatio for p, but thigs will be clearer i a momet if I do t.) The loglikelihood turs products ito sums: (β 0,β) = = = = y i log p(x i )(1 y i )log1 p(x i ) (12.7) log1 p(x i ) log1 p(x i ) p(x i ) y i log (12.8) 1 p(x i ) y i (β 0 x i β) (12.9) log1 e β 0 x i β y i (β 0 x i β) (12.10) where i the exttolast step we fially use equatio 12.4. Typically, to fid the maximum likelihood estimates we d differetiate the log likelihood with respect to the parameters, set the derivatives equal to zero, ad solve. To start that, take the derivative with respect to oe compoet of β, say β j. 1 = β j 1 e β 0 x i β eβ 0 x i β x ij y i x ij (12.11) = yi p(x i ;β 0,β) x ij (12.12) We are ot goig to be able to set this to zero ad solve exactly. (That s a trascedetal equatio, ad there is o closedform solutio.) We ca however approximately solve it umerically. 12.2.2 Logistic Regressio with More Tha Two Classes If Y ca take o more tha two values, say k of them, we ca still use logistic regressio. Istead of havig oe set of parameters β 0,β, each class c i 0 : (k 1) will have its ow offset β (c) 0 ad vector β (c), ad the predicted coditioal probabilities will be Pr Y = c X = x = (c) eβ 0 x β(c) (12.13) c eβ(c) 0 x β(c) You ca check that whe there are oly two classes (say, 0 ad 1), equatio 12.13 reduces to equatio 12.5, with β 0 = β (1) 0 β(0) ad β = β (1) β (0). I fact, o matter 0 how may classes there are, we ca always pick oe of them, say c = 0, ad fix its parameters at exactly zero, without ay loss of geerality 2. 2 Sice we ca arbitrarily chose which class s parameters to zero out without affectig the predicted probabilities, strictly speakig the model i Eq. 12.13 is uidetified. That is, differet parameter settigs lead to exactly the same outcome, so we ca t use the data to tell which oe is right. The usual respose here is to deal with this by a covetio: we decide to zero out the parameters of the first class, ad the estimate the cotrastig parameters for the others.

12.3. NEWTON S METHOD FOR NUMERICAL OPTIMIZATION 229 Calculatio of the likelihood ow proceeds as before (oly with more bookkeepig), ad so does maximum likelihood estimatio. 12.3 Newto s Method for Numerical Optimizatio There are a huge umber of methods for umerical optimizatio; we ca t cover all bases, ad there is o magical method which will always work better tha aythig else. However, there are some methods which work very well o a awful lot of the problems which keep comig up, ad it s worth spedig a momet to sketch how they work. Oe of the most aciet yet importat of them is Newto s method (alias NewtoRaphso ). Let s start with the simplest case of miimizig a fuctio of oe scalar variable, say f (β). We wat to fid the locatio of the global miimum, β. We suppose that f is smooth, ad that β is a regular iterior miimum, meaig that the derivative at β is zero ad the secod derivative is positive. Near the miimum we could make a Taylor expasio: f (β) f (β ) 1 2 (β β ) 2 d 2 f dβ 2 (12.14) β=β (We ca see here that the secod derivative has to be positive to esure that f (β) > f (β ).) I words, f (β) is close to quadratic ear the miimum. Newto s method uses this fact, ad miimizes a quadratic approximatio to the fuctio we are really iterested i. (I other words, Newto s method is to replace the problem we wat to solve, with a problem which we ca solve.) Guess a iitial poit β (0). If this is close to the miimum, we ca take a secod order Taylor expasio aroud β (0) ad it will still be accurate: f (β) f (β (0) )(β β (0) ) df 1 β β (0) 2 d 2 f dw β=β (0) 2 dw 2 (12.15) β=β (0) Now it s easy to miimize the righthad side of equatio 12.15. Let s abbreviate df the derivatives, because they get tiresome to keep writig out: = f (β (0) ), β=β (0) d 2 f dw 2 β=β (0) = f (β (0) ). We just take the derivative with respect to β, ad set it equal to zero at a poit we ll call β (1) : 0 = f (β (0) ) 1 2 f (β (0) )2(β (1) β (0) ) (12.16) dw β (1) = β (0) f (β (0) ) f (β (0) ) (12.17) The value β (1) should be a better guess at the miimum β tha the iitial oe β (0) was. So if we use it to make a quadratic approximatio to f, we ll get a better approximatio, ad so we ca iterate this procedure, miimizig oe approximatio

230 CHAPTER 12. LOGISTIC REGRESSION ad the usig that to get a ew approximatio: β (1) = β () f (β () ) f (β () ) (12.18) Notice that the true miimum β is a fixed poit of equatio 12.18: if we happe to lad o it, we ll stay there (sice f (β )=0). We wo t show it, but it ca be proved that if β (0) is close eough to β, the β () β, ad that i geeral β () β = O( 2 ), a very rapid rate of covergece. (Doublig the umber of iteratios we use does t reduce the error by a factor of two, but by a factor of four.) Let s put this together i a algorithm. my.ewto = fuctio(f,f.prime,f.prime2,beta0,tolerace=1e3,max.iter=50) { beta = beta0 old.f = f(beta) iteratios = 0 made.chages = TRUE while(made.chages & (iteratios < max.iter)) { iteratios < iteratios 1 made.chages < FALSE ew.beta = beta f.prime(beta)/f.prime2(beta) ew.f = f(ew.beta) relative.chage = abs(ew.f old.f)/old.f 1 made.chages = (relative.chages > tolerace) beta = ew.beta old.f = ew.f } if (made.chages) { warig("newto s method termiated before covergece") } retur(list(miimum=beta,value=f(beta),deriv=f.prime(beta), deriv2=f.prime2(beta),iteratios=iteratios, coverged=!made.chages)) } The first three argumets here have to all be fuctios. The fourth argumet is our iitial guess for the miimum, β (0). The last argumets keep Newto s method from cyclig forever: tolerace tells it to stop whe the fuctio stops chagig very much (the relative differece betwee f (β () ) ad f (β (1) ) is small), ad max.iter tells it to ever do more tha a certai umber of steps o matter what. The retur value icludes the estmated miimum, the value of the fuctio there, ad some diagostics the derivative should be very small, the secod derivative should be positive, etc. You may have oticed some potetial problems what if we lad o a poit where f is zero? What if f (β (1) ) > f (β () )? Etc. There are ways of hadlig these issues, ad more, which are icorporated ito real optimizatio algorithms from umerical aalysis such as the optim fuctio i R; I strogly recommed

12.3. NEWTON S METHOD FOR NUMERICAL OPTIMIZATION 231 you use that, or somethig like that, rather tha tryig to roll your ow optimizatio code. 3 12.3.1 Newto s Method i More tha Oe Dimesio Suppose that the objective f is a fuctio of multiple argumets, f (β 1,β 2,...β p ). Let s budle the parameters ito a sigle vector, w. The the Newto update is β (1) = β () H 1 (β () ) f (β () ) (12.19) where f is the gradiet of f, its vector of partial derivatives [ f / β 1, f / β 2,... f / β p ], ad H is the Hessia of f, its matrix of secod partial derivatives, H ij = 2 f / β i β j. Calculatig H ad f is t usually very timecosumig, but takig the iverse of H is, uless it happes to be a diagoal matrix. This leads to various quasinewto methods, which either approximate H by a diagoal matrix, or take a proper iverse of H oly rarely (maybe just oce), ad the try to update a estimate of H 1 (β () ) as β () chages. 12.3.2 Iteratively ReWeighted Least Squares This discussio of Newto s method is quite geeral, ad therefore abstract. I the particular case of logistic regressio, we ca make everythig look much more statistical. Logistic regressio, after all, is a liear model for a trasformatio of the probability. Let s call this trasformatio g: So the model is p g(p) log 1 p (12.20) g(p)=β 0 x β (12.21) ad Y X = x Biom(1, g 1 (β 0 x β)). It seems that what we should wat to do is take g(y) ad regress it liearly o x. Of course, the variace of Y, accordig to the model, is goig to chace depedig o x it will be (g 1 (β 0 x β))(1 g 1 (β 0 x β)) so we really ought to do a weighted liear regressio, with weights iversely proportioal to that variace. Sice writig β 0 x β is gettig aoyig, let s abbreviate it by µ (for mea ), ad let s abbreviate that variace as V (µ). The problem is that y is either 0 or 1, so g(y) is either or. We will evade this by usig Taylor expasio. g(y) g(µ)(y µ)g (µ) z (12.22) The right had side, z will be our effective respose variable. To regress it, we eed its variace, which by propagatio of error will be (g (µ)) 2 V (µ). 3 optim actually is a wrapper for several differet optimizatio methods; method=bfgs selects a Newtoia method; BFGS is a acroym for the ames of the algorithm s ivetors.

232 CHAPTER 12. LOGISTIC REGRESSION Notice that both the weights ad z deped o the parameters of our logistic regressio, through µ. So havig doe this oce, we should really use the ew parameters to update z ad the weights, ad do it agai. Evetually, we come to a fixed poit, where the parameter estimates o loger chage. The treatmet above is rather heuristic 4, but it turs out to be equivalet to usig Newto s method, with the expected secod derivative of the log likelihood, istead of its actual value. 5 Sice, with a large umber of observatios, the observed secod derivative should be close to the expected secod derivative, this is oly a small approximatio. 12.4 Geeralized Liear Models ad Geeralized Additive Models Logistic regressio is part of a broader family of geeralized liear models (GLMs), where the coditioal distributio of the respose falls i some parametric family, ad the parameters are set by the liear predictor. Ordiary, leastsquares regressio is the case where respose is Gaussia, with mea equal to the liear predictor, ad costat variace. Logistic regressio is the case where the respose is biomial, with equal to the umber of datapoits with the give x (ofte but ot always 1), ad p is give by Equatio 12.5. Chagig the relatioship betwee the parameters ad the liear predictor is called chagig the lik fuctio. For computatioal reasos, the lik fuctio is actually the fuctio you apply to the mea respose to get back the liear predictor, rather tha the other way aroud (12.4) rather tha (12.5). There are thus other forms of biomial regressio besides logistic regressio. 6 There is also Poisso regressio (appropriate whe the data are couts without ay upper limit), gamma regressio, etc.; we will say more about these i Chapter 13. I R, ay stadard GLM ca be fit usig the (base) glm fuctio, whose sytax is very similar to that of lm. The major wrikle is that, of course, you eed to specify the family of probability distributios to use, by the family optio family=biomial defaults to logistic regressio. (See help(glm) for the gory details o how to do, say, probit regressio.) All of these are fit by the same sort of umerical likelihood maximizatio. Oe cautio about usig maximum likelihood to fit logistic regressio is that it ca seem to work badly whe the traiig data ca be liearly separated. The reaso is that, to make the likelihood large, p(x i ) should be large whe y i = 1, ad p should be small whe y i = 0. If β 0,β 0 is a set of parameters which perfectly classifies the traiig data, the cβ 0, cβ is too, for ay c > 1, but i a logistic regressio the secod 4 That is, mathematically icorrect. 5 This takes a reasoable amout of algebra to show, so we ll skip it. The key poit however is the followig. Take a sigle Beroulli observatio with success probability p. The loglikelihood is Y log p (1 Y )log1 p. The first derivative with respect to p is Y / p (1 Y )/(1 p), ad the secod derivative is Y / p 2 (1 Y )/(1 p) 2. Takig expectatios of the secod derivative gives 1/ p 1/(1 p) = 1/ p(1 p). I other words, V ( p)= 1/E. Usig weights iversely proportioal to the variace thus turs out to be equivalet to dividig by the expected secod derivative. 6 My experiece is that these ted to give similar error rates as classifiers, but have rather differet guesses about the uderlyig probabilities.

12.4. GENERALIZED LINEAR MODELS AND GENERALIZED ADDITIVE MODELS233 set of parameters will have more extreme probabilities, ad so a higher likelihood. For liearly separable data, the, there is o parameter vector which maximizes the likelihood, sice ca always be icreased by makig the vector larger but keepig it poited i the same directio. You should, of course, be so lucky as to have this problem. 12.4.1 Geeralized Additive Models A atural step beyod geeralized liear models is geeralized additive models (GAMs), where istead of makig the trasformed mea respose a liear fuctio of the iputs, we make it a additive fuctio of the iputs. This meas combiig a fuctio for fittig additive models with likelihood maximizatio. The R fuctio here is gam, from the CRAN package of the same ame. (Alterately, use the fuctio gam i the package mgcv, which is part of the default R istallatio.) We will look at how this works i some detail i Chapter 13. GAMs ca be used to check GLMs i much the same way that smoothers ca be used to check parametric regressios: fit a GAM ad a GLM to the same data, the simulate from the GLM, ad refit both models to the simulated data. Repeated may times, this gives a distributio for how much better the GAM will seem to fit tha the GLM does, eve whe the GLM is true. You ca the read a pvalue off of this distributio. 12.4.2 A Example (Icludig Model Checkig) Here s a worked R example, usig the data from the upper right pael of Figure 12.1. The 50 2 matrix x holds the iput variables (the coordiates are idepedetly ad uiformly distributed o [ 1, 1]), ad y.1 the correspodig class labels, themselves geerated from a logistic regressio with β 0 =, β =( 1,1). > logr = glm(y.1 ~ x[,1] x[,2], family=biomial) > logr Call: glm(formula = y.1 ~ x[, 1] x[, 2], family = biomial) Coefficiets: (Itercept) x[, 1] x[, 2] 0.410 50 1.366 Degrees of Freedom: 49 Total (i.e. Null); 47 Residual Null Deviace: 68.59 Residual Deviace: 58.81 AIC: 64.81 > sum(ifelse(logr$fitted.values<,0,1)!= y.1)/legth(y.1) [1] 0.32 The deviace of a model fitted by maximum likelihood is (twice) the differece betwee its log likelihood ad the maximum log likelihood for a saturated model, i.e., a model with oe parameter per observatio. Hopefully, the saturated model

234 CHAPTER 12. LOGISTIC REGRESSION ca give a perfect fit. 7 Here the saturated model would assig probability 1 to the observed outcomes 8, ad the logarithm of 1 is zero, so D = 2( β 0, β). The ull deviace is what s achievable by usig just a costat bias b ad settig w = 0. The fitted model defiitely improves o that. 9 The fitted values of the logistic regressio are the class probabilities; this shows that the error rate of the logistic regressio, if you force it to predict actual classes, is 32%. This souds bad, but otice from the cotour lies i the figure that lots of the probabilities are ear, meaig that the classes are just geuiely hard to predict. To see how well the logistic regressio assumptio holds up, let s compare this to a GAM. 10 > library(gam) > gam.1 = gam(y.1~lo(x[,1])lo(x[,2]),family="biomial") > gam.1 Call: gam(formula = y.1 ~ lo(x[, 1]) lo(x[, 2]), family = "biomial") Degrees of Freedom: 49 total; 41.39957 Residual Residual Deviace: 49.17522 This fits a GAM to the same data, usig lowess smoothig of both iput variables. Notice that the residual deviace is lower. That is, the GAM fits better. We expect this; the questio is whether the differece is sigificat, or withi the rage of what we should expect whe logistic regressio is valid. To test this, we eed to simulate from the logistic regressio model. simulate.from.logr = fuctio(x, coefs) { require(faraway) # For accessible logit ad iverselogit fuctios = row(x) liear.part = coefs[1] x %*% coefs[1] probs = ilogit(liear.part) # Iverse logit y = rbiom(,size=1,prob=probs) retur(y) } Now we simulate from our fitted model, ad refit both the logistic regressio ad the GAM. 7 The factor of two is so that the deviace will have a χ 2 distributio. Specifically, if the model with p parameters is right, the deviace will have a χ 2 distributio with p degrees of freedom. 8 This is ot possible whe there are multiple observatios with the same iput features, but differet classes. 9 AIC is of course the Akaike iformatio criterio, 2 2q, with q beig the umber of parameters (here, q = 3). AIC has some truly devoted adherets, especially amog ostatisticias, but I have bee deliberately igorig it ad will cotiue to do so. Basically, to the extet AIC succeeds, it works as fast, largesample approximatio to doig leaveoeout crossvalidatio. Claeskes ad Hjort (2008) is a thorough, moder treatmet of AIC ad related modelselectio criteria from a statistical viewpoit. 10 Previous examples of usig GAMs have mostly used the mgcv package ad splie smoothig. There is o particular reaso to switch to the gam library ad lowess smoothig here, but there s also o real reaso ot to.

12.4. GENERALIZED LINEAR MODELS AND GENERALIZED ADDITIVE MODELS235 delta.deviace.sim = fuctio (x,logistic.model) { y.ew = simulate.from.logr(x,logistic.model$coefficiets) GLM.dev = glm(y.ew ~ x[,1] x[,2], family="biomial")$deviace GAM.dev = gam(y.ew ~ lo(x[,1]) lo(x[,2]), family="biomial")$deviace retur(glm.dev GAM.dev) } Notice that i this simulatio we are ot geeratig ew X values. The logistic regressio ad the GAM are both models for the respose coditioal o the iputs, ad are agostic about how the iputs are distributed, or eve whether it s meaigful to talk about their distributio. Fially, we repeat the simulatio a buch of times, ad see where the observed differece i deviaces falls i the samplig distributio. > delta.dev = replicate(1000,delta.deviace.sim(x,logr)) > delta.dev.observed = logr$deviace gam.1$deviace # 9.64 > sum(delta.dev.observed > delta.dev)/1000 [1] 0.685 I other words, the amout by which a GAM fits the data better tha logistic regressio is pretty ear the middle of the ull distributio. Sice the example data really did come from a logistic regressio, this is a relief.

236 CHAPTER 12. LOGISTIC REGRESSION Amout by which GAM fits better tha logistic regressio Desity 0 2 4 6 8 0.10 0 10 20 30 N = 1000 Badwidth = 0.8386 Samplig distributio uder logistic regressio Figure 12.2: Samplig distributio for the differece i deviace betwee a GAM ad a logistic regressio, o data geerated from a logistic regressio. The observed differece i deviaces is show by the dashed horizotal lie.

12.5. EXERCISES 237 12.5 Exercises To thik through, ot to had i. 1. A multiclass logistic regressio, as i Eq. 12.13, has parameters β (c) ad β (c) 0 for each class c. Show that we ca always get the same predicted probabilities by settig β (c) = 0, β (c) = 0 for ay oe class c, ad adjustig the parameters 0 for the other classes appropriately. 2. Fid the first ad secod derivatives of the loglikelihood for logistic regressio with oe predictor variable. Explicitly write out the formula for doig oe step of Newto s method. Explai how this relates to reweighted least squares.