Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36
How do we find patterns in data? We begin with a model of how the world works We use our knowledge of a system to create a model of a Data Generating Process We know that there is variation in any relationship due to an Error Generating Process We build hypothesis tests on top of this error generating process based on assuming our model of the data generating process is accurate 3/36 We started Linear. Why? Often, our first stab at a hypothesis is that two variables are associated Linearity is a naive, but reasonable, first assumption Y = a + BX is straightforward to fit 10 5 0 5 10 y 2 1 0 1 2 3 4/36
We started Normal. Why? It is reasonable to assume that small errors are common It is reasonable to assume that large errors are large It is reasonable to assume that error is additive for many phenomena Many processes we measure are continuous Y = a + BX + e implies additive error Y N(mean = a + BX, sd = σ) Histogram of rnorm(100) Frequency 0 10 20 30 3 2 1 0 1 2 3 Deviation from Mean 5/36 Example: Pufferfish Mimics & Predator Approaches What assumptions would you make about similarity and predator response? How might predators vary in response? What kinds of error might we have in measuring predator responses? 6/36
Example: A Linear Data Generating Process and Gaussian Error Generating Process 15 predators 10 5 0 1 2 3 4 resemblance 7/36 What if We Have More Information about the Data Generating Process We often have real biological models of a phenomenon! For example? Even if we do not, we often know something about theory, we know the shape of the data For example? 8/36
Example: Michaelis-Mented Enzyme Kinetics We know how Enzymes work We have no reason to suspect non-normal error We build a model that fits biology 9/36 Example: Michaelis-Mented Enzyme Kinetics 0.8 We know how Enzymes work We have no reason to suspect non-normal error We build a model that fits biology Rate 0.6 0.4 0.2 0.0 0 1 2 Concentration 10/36
Example: Michaelis-Mented Enzyme Kinetics Rate 0.8 0.6 0.4 0.2 0.0 0 1 2 Concentration Even if we had no biological model, saturating data is striking We may have fit some other curve - examples? We will discuss model selection later 11/36 Many Data Types Cannot Have a Normal Error Generating Process Count data: discrete, cannot be <0, variance increases with mean Poisson Overdispersed Count data: discrete, cannot be <0, variance increases faster than mean Negatie Binomial or Quasipoisson Multiplicative Error: Many errors, typically small, but biological process is multiplicative Log-Normal Data discribes distribution of properties of mutiple events: cannot be <0, variance increases faster than mean Gamma 12/36
Example: Wolf Inbreeding and Litter Size The Number of Pups is a Count! The Number of Pups are Additive! No a priori reason to think the relationship nonlinear 13/36 Example: Wolf Inbreeding and Litter Size The Number of Pups is a Count! The Number of Pups are additive! No a priori reason to think the relationship nonlinear pups 7.5 5.0 2.5 0.0 0.1 0.2 0.3 0.4 inbreeding.coefficient 14/36
So what is with this Generalized Linear Modeling Thing? Many models have data generating processes that can be linearized E.g., Y = e a+bx log(y ) = a + BX Many error generating processes are in the exponential family This is *easy* to fit using Likelihood and IWLS - the glm framework We can use other Likelihood functions, or Bayesian methods Or Least Squares fits for normal linear models 15/36 Can I Stop Now? Is GLMs All I Need? NO! Many models have data generating processes that cannot be linearized E.g., Y = e a+sin(bx) Many possible error generating processes My favorite - the Gumbel distribution, for maximum values And we haven t even started with mixed models, autocorrelation, etc... For these, we use other Likelihood or Bayesian methods Some problems have shortcuts, others do not 16/36
Logistic Regression!!! 17/36 The Logitistic Curve (for Probabilities) Probability 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 X 18/36
Binomial Error Generating Process Possible values bounded by probability Probability = 0.01 Probability = 0.3 Frequency 0 20 40 Frequency 0 4 8 12 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 Probability = 0.7 Probability = 0.99 Frequency 0 4 8 12 Frequency 0 20 40 4 5 6 7 8 9 8.0 8.5 9.0 9.5 19/36 The Logitistic Function p = e(a+bx) 1 + e (a+bx) logit(p) = a + BX 20/36
Generalized Linear Model with a Logit Link logit(p) = a + BX Y Binom(T rials, p) 21/36 Cryptosporidium 22/36
Drug Trial with Mice 23/36 Fraction of Mice Infected = Probability of Infection Fraction of Mice Infected 1.00 0.75 0.50 0.25 0.00 0 100 200 300 400 Dose 24/36
Two Different Ways of Writing the Model # 1) using Heads, Tails glm(cbind(y, N-Y) Dose, data=crypto, family=binomial) # # # 2) using weights as size parameter for Binomial glm(y/n Dose, weights=n, data=crypto, family=binomial) 25/36 The Fit Model 1.00 Fraction of Mice Infected 0.75 0.50 0.25 0.00 0 100 200 300 400 Dose 26/36
The Fit Model # # Call: # glm(formula = cbind(y, N - Y) Dose, family = binomial, data = crypto) # # Deviance Residuals: # Min 1Q Median 3Q Max # -3.9532-1.2442 0.2327 1.5531 3.6013 # # Coefficients: # Estimate Std. Error z value Pr(> z ) # (Intercept) -1.407769 0.148479-9.481 <2e-16 # Dose 0.013468 0.001046 12.871 <2e-16 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 434.34 on 67 degrees of freedom # Residual deviance: 200.51 on 66 degrees of freedom # AIC: 327.03 # 27/36 # Number of Fisher Scoring iterations: 4 The Odds Odds = p 1 p Log Odds = Log p 1 p = logit(p) 28/36
The Meaning of a Logit Coefficient Logit Coefficient: A 1 unit increase in a predictor = an increase of β increase in the log-odds of the response. β = logit(p 2 ) logit(p 1 ) p 1 p 2 β = Log Log 1 p 1 1 p 2 We need to know both p1 and β to interpret this. If p1 = 0.5, β = 0.01347, then p2 = 0.503 If p1 = 0.7, β = 0.01347, then p2 = 0.702 29/36 What if we Only Have 1 s and 0 s? 1.00 0.75 Predation 0.50 0.25 0.00 0 2 4 6 8 log.seed.weight 30/36
Seed Predators http://denimandtweed.com 31/36 The GLM seed.glm <- glm(predation log.seed.weight, data=seeds, family=binomial) 32/36
Fitted Seed Predation Plot 1.00 0.75 Predation 0.50 0.25 0.00 0.0 2.5 5.0 7.5 log.seed.weight 33/36 Diagnostics Look Odd Due to Binned Nature of the Data Residuals 1 1 2 Residuals vs Fitted 554 572 1113 Std. deviance resid. 1 1 2 Normal Q Q 1113 554 572 3 2 1 0 3 1 1 2 3 Predicted values Theoretical Quantiles Std. deviance resid. 0.0 1.0 Scale Location 554 572 1113 Std. Pearson resid. 1 1 3 Residuals vs Leverage 554 572 1113 Cook's distance 3 2 1 0 0.000 0.003 0.006 Predicted values Leverage 34/36
Creating Binned Residuals residuals(seed.glm) 1 0 1 2 0.1 0.2 0.3 0.4 0.5 fitted(seed.glm, type = "deviance") 35/36 Binned Residuals Should Look Spread Out 200 Bins Residual 1.0 0.0 0.5 1.0 1.5 2.0 0.1 0.2 0.3 0.4 0.5 Fitted 36/36