Logistic Regression (a type of Generalized Linear Model)

Size: px

Start display at page:

Download "Logistic Regression (a type of Generalized Linear Model)"

Amberly Burke
8 years ago
Views:

1 Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36

2 How do we find patterns in data? We begin with a model of how the world works We use our knowledge of a system to create a model of a Data Generating Process We know that there is variation in any relationship due to an Error Generating Process We build hypothesis tests on top of this error generating process based on assuming our model of the data generating process is accurate 3/36 We started Linear. Why? Often, our first stab at a hypothesis is that two variables are associated Linearity is a naive, but reasonable, first assumption Y = a + BX is straightforward to fit y /36

variation in any relationship due to an Error Generating Process We build hypothesis tests on top of this error generating process based on assuming

3 We started Normal. Why? It is reasonable to assume that small errors are common It is reasonable to assume that large errors are large It is reasonable to assume that error is additive for many phenomena Many processes we measure are continuous Y = a + BX + e implies additive error Y N(mean = a + BX, sd = σ) Histogram of rnorm(100) Frequency Deviation from Mean 5/36 Example: Pufferfish Mimics & Predator Approaches What assumptions would you make about similarity and predator response? How might predators vary in response? What kinds of error might we have in measuring predator responses? 6/36

additive for many phenomena Many processes we measure are continuous Y = a + BX + e implies additive error Y N(mean = a + BX, sd = σ) Histogram of

4 Example: A Linear Data Generating Process and Gaussian Error Generating Process 15 predators resemblance 7/36 What if We Have More Information about the Data Generating Process We often have real biological models of a phenomenon! For example? Even if we do not, we often know something about theory, we know the shape of the data For example? 8/36

Generating Process We often have real biological models of a phenomenon! For example?

5 Example: Michaelis-Mented Enzyme Kinetics We know how Enzymes work We have no reason to suspect non-normal error We build a model that fits biology 9/36 Example: Michaelis-Mented Enzyme Kinetics 0.8 We know how Enzymes work We have no reason to suspect non-normal error We build a model that fits biology Rate Concentration 10/36

6 Example: Michaelis-Mented Enzyme Kinetics Rate Concentration Even if we had no biological model, saturating data is striking We may have fit some other curve - examples? We will discuss model selection later 11/36 Many Data Types Cannot Have a Normal Error Generating Process Count data: discrete, cannot be <0, variance increases with mean Poisson Overdispersed Count data: discrete, cannot be <0, variance increases faster than mean Negatie Binomial or Quasipoisson Multiplicative Error: Many errors, typically small, but biological process is multiplicative Log-Normal Data discribes distribution of properties of mutiple events: cannot be <0, variance increases faster than mean Gamma 12/36

We will discuss model selection later 11/36 Many Data Types Cannot Have a Normal Error Generating Process Count data: discrete, cannot be <0, variance increases with mean Poisson

7 Example: Wolf Inbreeding and Litter Size The Number of Pups is a Count! The Number of Pups are Additive! No a priori reason to think the relationship nonlinear 13/36 Example: Wolf Inbreeding and Litter Size The Number of Pups is a Count! The Number of Pups are additive! No a priori reason to think the relationship nonlinear pups inbreeding.coefficient 14/36

No a priori reason to think the relationship nonlinear 13/36 The Number of Pups are additive!

8 So what is with this Generalized Linear Modeling Thing? Many models have data generating processes that can be linearized E.g., Y = e a+bx log(y ) = a + BX Many error generating processes are in the exponential family This is *easy* to fit using Likelihood and IWLS - the glm framework We can use other Likelihood functions, or Bayesian methods Or Least Squares fits for normal linear models 15/36 Can I Stop Now? Is GLMs All I Need? NO! Many models have data generating processes that cannot be linearized E.g., Y = e a+sin(bx) Many possible error generating processes My favorite - the Gumbel distribution, for maximum values And we haven t even started with mixed models, autocorrelation, etc... For these, we use other Likelihood or Bayesian methods Some problems have shortcuts, others do not 16/36

9 Logistic Regression!!! 17/36 The Logitistic Curve (for Probabilities) Probability X 18/36

10 Binomial Error Generating Process Possible values bounded by probability Probability = 0.01 Probability = 0.3 Frequency Frequency Probability = 0.7 Probability = 0.99 Frequency Frequency /36 The Logitistic Function p = e(a+bx) 1 + e (a+bx) logit(p) = a + BX 20/36

0 0 1 2 3 4 5 Probability = 0.7 Probability = 0.

11 Generalized Linear Model with a Logit Link logit(p) = a + BX Y Binom(T rials, p) 21/36 Cryptosporidium 22/36

12 Drug Trial with Mice 23/36 Fraction of Mice Infected = Probability of Infection Fraction of Mice Infected Dose 24/36

13 Two Different Ways of Writing the Model # 1) using Heads, Tails glm(cbind(y, N-Y) Dose, data=crypto, family=binomial) # # # 2) using weights as size parameter for Binomial glm(y/n Dose, weights=n, data=crypto, family=binomial) 25/36 The Fit Model 1.00 Fraction of Mice Infected Dose 26/36

Binomial glm(y/n Dose, weights=n, data=crypto, family=binomial) 25/36 The Fit

14 The Fit Model # # Call: # glm(formula = cbind(y, N - Y) Dose, family = binomial, data = crypto) # # Deviance Residuals: # Min 1Q Median 3Q Max # # # Coefficients: # Estimate Std. Error z value Pr(> z ) # (Intercept) <2e-16 # Dose <2e-16 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: on 67 degrees of freedom # Residual deviance: on 66 degrees of freedom # AIC: # 27/36 # Number of Fisher Scoring iterations: 4 The Odds Odds = p 1 p Log Odds = Log p 1 p = logit(p) 28/36

013468 0.001046 12.871 <2e-16 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 434.

15 The Meaning of a Logit Coefficient Logit Coefficient: A 1 unit increase in a predictor = an increase of β increase in the log-odds of the response. β = logit(p 2 ) logit(p 1 ) p 1 p 2 β = Log Log 1 p 1 1 p 2 We need to know both p1 and β to interpret this. If p1 = 0.5, β = , then p2 = If p1 = 0.7, β = , then p2 = /36 What if we Only Have 1 s and 0 s? Predation log.seed.weight 30/36

β = logit(p 2 ) logit(p 1 ) p 1 p 2 β = Log Log 1 p 1 1 p 2 We need to know both p1 and β to interpret this.

16 Seed Predators 31/36 The GLM seed.glm <- glm(predation log.seed.weight, data=seeds, family=binomial) 32/36

17 Fitted Seed Predation Plot Predation log.seed.weight 33/36 Diagnostics Look Odd Due to Binned Nature of the Data Residuals Residuals vs Fitted Std. deviance resid Normal Q Q Predicted values Theoretical Quantiles Std. deviance resid Scale Location Std. Pearson resid Residuals vs Leverage Cook's distance Predicted values Leverage 34/36

deviance resid. 1 1 2 Normal Q Q 1113 554 572 3 2 1 0 3 1 1 2 3 Predicted values Theoretical Quantiles Std. deviance resid. 0.0 1.

18 Creating Binned Residuals residuals(seed.glm) fitted(seed.glm, type = "deviance") 35/36 Binned Residuals Should Look Spread Out 200 Bins Residual Fitted 36/36

Generalized Linear Models

Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the