Logistic Regression (a type of Generalized Linear Model)



Similar documents
Generalized Linear Models

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Multivariate Logistic Regression

Lab 13: Logistic Regression

Outline. Dispersion Bush lupine survival Quasi-Binomial family

SAS Software to Fit the Generalized Linear Model

Examining a Fitted Logistic Model

Régression logistique : introduction

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Introduction to Predictive Modeling Using GLMs

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

MSwM examples. Jose A. Sanchez-Espigares, Alberto Lopez-Moreno Dept. of Statistics and Operations Research UPC-BarcelonaTech.

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

Statistical Models in R

1 Logistic Regression

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

Automated Biosurveillance Data from England and Wales,

Logistic regression (with R)

Logistic Regression (1/24/13)

Lecture 8: Gamma regression

GLMs: Gompertz s Law. GLMs in R. Gompertz s famous graduation formula is. or log µ x is linear in age, x,

Geostatistics Exploratory Analysis

13. Poisson Regression Analysis

Simple linear regression

Predictive Modeling Techniques in Insurance

Regression 3: Logistic Regression

11. Analysis of Case-control Studies Logistic Regression

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Local classification and local likelihoods

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

International Statistical Institute, 56th Session, 2007: Phil Everson

Lecture 14: GLM Estimation and Logistic Regression

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Additional sources Compilation of sources:

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Simple Linear Regression Inference

DATA INTERPRETATION AND STATISTICS

Logistic regression modeling the probability of success

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Java Modules for Time Series Analysis

5. Linear Regression

GLM I An Introduction to Generalized Linear Models

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Part 2: Analysis of Relationship Between Two Variables

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Multiple Linear Regression

Exercise 1.12 (Pg )

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Some Essential Statistics The Lure of Statistics

Ordinal Regression. Chapter

SUMAN DUVVURU STAT 567 PROJECT REPORT

Basic Statistical and Modeling Procedures Using SAS

Principles of Hypothesis Testing for Public Health

2. Simple Linear Regression

Simple example of collinearity in logistic regression

Basic Probability and Statistics Review. Six Sigma Black Belt Primer

Factorial experimental designs and generalized linear models

Final Exam Practice Problem Answers

How To Test For Significance On A Data Set

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Chapter 7: Simple linear regression Learning Objectives

Review of Random Variables

VI. Introduction to Logistic Regression

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Normality Testing in Excel

15.1 The Structure of Generalized Linear Models

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

List of Examples. Examples 319

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Introduction to Quantitative Methods

Multinomial and Ordinal Logistic Regression

Statistical Machine Learning

5. Multiple regression

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Section 6: Model Selection, Logistic Regression and more...

2. Filling Data Gaps, Data validation & Descriptive Statistics

Lecture 6: Poisson regression

Exploratory Data Analysis

Poisson Models for Count Data

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

II. DISTRIBUTIONS distribution normal distribution. standard scores

Interaction between quantitative predictors

LOGIT AND PROBIT ANALYSIS

GLM with a Gamma-distributed Dependent Variable

HLM software has been one of the leading statistical packages for hierarchical

Statistical Models in R

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Week 5: Multiple Linear Regression

Introduction to General and Generalized Linear Models

Nominal and ordinal logistic regression

Statistical Analysis of Life Insurance Policy Termination and Survivorship

Applying Statistics Recommended by Regulatory Documents

Regression Analysis: A Complete Example

Transcription:

Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36

How do we find patterns in data? We begin with a model of how the world works We use our knowledge of a system to create a model of a Data Generating Process We know that there is variation in any relationship due to an Error Generating Process We build hypothesis tests on top of this error generating process based on assuming our model of the data generating process is accurate 3/36 We started Linear. Why? Often, our first stab at a hypothesis is that two variables are associated Linearity is a naive, but reasonable, first assumption Y = a + BX is straightforward to fit 10 5 0 5 10 y 2 1 0 1 2 3 4/36

We started Normal. Why? It is reasonable to assume that small errors are common It is reasonable to assume that large errors are large It is reasonable to assume that error is additive for many phenomena Many processes we measure are continuous Y = a + BX + e implies additive error Y N(mean = a + BX, sd = σ) Histogram of rnorm(100) Frequency 0 10 20 30 3 2 1 0 1 2 3 Deviation from Mean 5/36 Example: Pufferfish Mimics & Predator Approaches What assumptions would you make about similarity and predator response? How might predators vary in response? What kinds of error might we have in measuring predator responses? 6/36

Example: A Linear Data Generating Process and Gaussian Error Generating Process 15 predators 10 5 0 1 2 3 4 resemblance 7/36 What if We Have More Information about the Data Generating Process We often have real biological models of a phenomenon! For example? Even if we do not, we often know something about theory, we know the shape of the data For example? 8/36

Example: Michaelis-Mented Enzyme Kinetics We know how Enzymes work We have no reason to suspect non-normal error We build a model that fits biology 9/36 Example: Michaelis-Mented Enzyme Kinetics 0.8 We know how Enzymes work We have no reason to suspect non-normal error We build a model that fits biology Rate 0.6 0.4 0.2 0.0 0 1 2 Concentration 10/36

Example: Michaelis-Mented Enzyme Kinetics Rate 0.8 0.6 0.4 0.2 0.0 0 1 2 Concentration Even if we had no biological model, saturating data is striking We may have fit some other curve - examples? We will discuss model selection later 11/36 Many Data Types Cannot Have a Normal Error Generating Process Count data: discrete, cannot be <0, variance increases with mean Poisson Overdispersed Count data: discrete, cannot be <0, variance increases faster than mean Negatie Binomial or Quasipoisson Multiplicative Error: Many errors, typically small, but biological process is multiplicative Log-Normal Data discribes distribution of properties of mutiple events: cannot be <0, variance increases faster than mean Gamma 12/36

Example: Wolf Inbreeding and Litter Size The Number of Pups is a Count! The Number of Pups are Additive! No a priori reason to think the relationship nonlinear 13/36 Example: Wolf Inbreeding and Litter Size The Number of Pups is a Count! The Number of Pups are additive! No a priori reason to think the relationship nonlinear pups 7.5 5.0 2.5 0.0 0.1 0.2 0.3 0.4 inbreeding.coefficient 14/36

So what is with this Generalized Linear Modeling Thing? Many models have data generating processes that can be linearized E.g., Y = e a+bx log(y ) = a + BX Many error generating processes are in the exponential family This is *easy* to fit using Likelihood and IWLS - the glm framework We can use other Likelihood functions, or Bayesian methods Or Least Squares fits for normal linear models 15/36 Can I Stop Now? Is GLMs All I Need? NO! Many models have data generating processes that cannot be linearized E.g., Y = e a+sin(bx) Many possible error generating processes My favorite - the Gumbel distribution, for maximum values And we haven t even started with mixed models, autocorrelation, etc... For these, we use other Likelihood or Bayesian methods Some problems have shortcuts, others do not 16/36

Logistic Regression!!! 17/36 The Logitistic Curve (for Probabilities) Probability 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 X 18/36

Binomial Error Generating Process Possible values bounded by probability Probability = 0.01 Probability = 0.3 Frequency 0 20 40 Frequency 0 4 8 12 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 Probability = 0.7 Probability = 0.99 Frequency 0 4 8 12 Frequency 0 20 40 4 5 6 7 8 9 8.0 8.5 9.0 9.5 19/36 The Logitistic Function p = e(a+bx) 1 + e (a+bx) logit(p) = a + BX 20/36

Generalized Linear Model with a Logit Link logit(p) = a + BX Y Binom(T rials, p) 21/36 Cryptosporidium 22/36

Drug Trial with Mice 23/36 Fraction of Mice Infected = Probability of Infection Fraction of Mice Infected 1.00 0.75 0.50 0.25 0.00 0 100 200 300 400 Dose 24/36

Two Different Ways of Writing the Model # 1) using Heads, Tails glm(cbind(y, N-Y) Dose, data=crypto, family=binomial) # # # 2) using weights as size parameter for Binomial glm(y/n Dose, weights=n, data=crypto, family=binomial) 25/36 The Fit Model 1.00 Fraction of Mice Infected 0.75 0.50 0.25 0.00 0 100 200 300 400 Dose 26/36

The Fit Model # # Call: # glm(formula = cbind(y, N - Y) Dose, family = binomial, data = crypto) # # Deviance Residuals: # Min 1Q Median 3Q Max # -3.9532-1.2442 0.2327 1.5531 3.6013 # # Coefficients: # Estimate Std. Error z value Pr(> z ) # (Intercept) -1.407769 0.148479-9.481 <2e-16 # Dose 0.013468 0.001046 12.871 <2e-16 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 434.34 on 67 degrees of freedom # Residual deviance: 200.51 on 66 degrees of freedom # AIC: 327.03 # 27/36 # Number of Fisher Scoring iterations: 4 The Odds Odds = p 1 p Log Odds = Log p 1 p = logit(p) 28/36

The Meaning of a Logit Coefficient Logit Coefficient: A 1 unit increase in a predictor = an increase of β increase in the log-odds of the response. β = logit(p 2 ) logit(p 1 ) p 1 p 2 β = Log Log 1 p 1 1 p 2 We need to know both p1 and β to interpret this. If p1 = 0.5, β = 0.01347, then p2 = 0.503 If p1 = 0.7, β = 0.01347, then p2 = 0.702 29/36 What if we Only Have 1 s and 0 s? 1.00 0.75 Predation 0.50 0.25 0.00 0 2 4 6 8 log.seed.weight 30/36

Seed Predators http://denimandtweed.com 31/36 The GLM seed.glm <- glm(predation log.seed.weight, data=seeds, family=binomial) 32/36

Fitted Seed Predation Plot 1.00 0.75 Predation 0.50 0.25 0.00 0.0 2.5 5.0 7.5 log.seed.weight 33/36 Diagnostics Look Odd Due to Binned Nature of the Data Residuals 1 1 2 Residuals vs Fitted 554 572 1113 Std. deviance resid. 1 1 2 Normal Q Q 1113 554 572 3 2 1 0 3 1 1 2 3 Predicted values Theoretical Quantiles Std. deviance resid. 0.0 1.0 Scale Location 554 572 1113 Std. Pearson resid. 1 1 3 Residuals vs Leverage 554 572 1113 Cook's distance 3 2 1 0 0.000 0.003 0.006 Predicted values Leverage 34/36

Creating Binned Residuals residuals(seed.glm) 1 0 1 2 0.1 0.2 0.3 0.4 0.5 fitted(seed.glm, type = "deviance") 35/36 Binned Residuals Should Look Spread Out 200 Bins Residual 1.0 0.0 0.5 1.0 1.5 2.0 0.1 0.2 0.3 0.4 0.5 Fitted 36/36