III. INTRODUCTION TO LOGISTIC REGRESSION. a) Example: APACHE II Score and Mortality in Sepsis



Similar documents
How to set the main menu of STATA to default factory settings standards

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Multinomial and Ordinal Logistic Regression

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

17. SIMPLE LINEAR REGRESSION II

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

Ordinal Regression. Chapter

From the help desk: Swamy s random-coefficients model

Introduction. Survival Analysis. Censoring. Plan of Talk

LOGISTIC REGRESSION ANALYSIS

13. Poisson Regression Analysis

Handling missing data in Stata a whirlwind tour

Two Correlated Proportions (McNemar Test)

Correlation and Regression

Multivariate Logistic Regression

11. Analysis of Case-control Studies Logistic Regression

VI. Introduction to Logistic Regression

Discussion Section 4 ECON 139/ Summer Term II

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

From the help desk: hurdle models

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

SOLUTIONS TO BIOSTATISTICS PRACTICE PROBLEMS

Regression Analysis: A Complete Example

MULTIPLE REGRESSION EXAMPLE

Simple Linear Regression Inference

Part 2: Analysis of Relationship Between Two Variables

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Module 14: Missing Data Stata Practical

Logit Models for Binary Data

Sample Size Calculation for Longitudinal Studies

Generalized Linear Models

Lecture 3: Linear methods for classification

Simple linear regression

Nominal and ordinal logistic regression

Stata Walkthrough 4: Regression, Prediction, and Forecasting

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Nonlinear Regression Functions. SW Ch 8 1/54/

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Parametric Survival Models

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

11 Logistic Regression - Interpreting Parameters

Institut für Soziologie Eberhard Karls Universität Tübingen

LOGIT AND PROBIT ANALYSIS

Regression III: Advanced Methods

Final Exam Practice Problem Answers

Introduction to Quantitative Methods

Standard errors of marginal effects in the heteroskedastic probit model

Logistic Regression (a type of Generalized Linear Model)

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

1 Simple Linear Regression I Least Squares Estimation

Descriptive Statistics

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

August 2012 EXAMINATIONS Solution Part I

Competing-risks regression

Simple Regression Theory II 2010 Samuel L. Baker

Factors affecting online sales

SUMAN DUVVURU STAT 567 PROJECT REPORT

Topic 3 - Survival Analysis

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Missing data and net survival analysis Bernard Rachet

Laboratory 3 Type I, II Error, Sample Size, Statistical Power

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

The zero-adjusted Inverse Gaussian distribution as a model for insurance claims

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Chapter 23. Inferences for Regression

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Review of Fundamental Mathematics

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Logistic Regression (1/24/13)

Nested Logit. Brad Jones 1. April 30, University of California, Davis. 1 Department of Political Science. POL 213: Research Methods

GLMs: Gompertz s Law. GLMs in R. Gompertz s famous graduation formula is. or log µ x is linear in age, x,

Multiple logistic regression analysis of cigarette use among high school students

Chapter 3 RANDOM VARIATE GENERATION

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015

CALCULATIONS & STATISTICS

Statistical Machine Learning

Nonlinear relationships Richard Williams, University of Notre Dame, Last revised February 20, 2015

JUST THE MATHS UNIT NUMBER 1.8. ALGEBRA 8 (Polynomials) A.J.Hobson

SAS Software to Fit the Generalized Linear Model

4. Continuous Random Variables, the Pareto and Normal Distributions

especially with continuous

A Basic Introduction to Missing Data

Gamma Distribution Fitting

Statistics in Retail Finance. Chapter 6: Behavioural models

2 Binomial, Poisson, Normal Distribution

10 Dichotomous or binary responses

Lecture 8. Confidence intervals and the central limit theorem

Transcription:

III. INTRODUCTION TO LOGISTIC REGRESSION 1. Simple Logistic Regression a) Example: APACHE II Score and Mortality in Sepsis The following figure shows 30 day mortality in a sample of septic patients as a function of their baseline APACHE II Score. Patients are coded as 1 or 0 depending on whether they are dead or alive in 30 days, respectively. Died 1 30 Day Mortality in Patients with Sepsis Survived 0 0 5 10 15 20 25 30 35 40 45 APACHE II Score at Baseline We wish to predict death from baseline APACHE II score in these patients. Let π(x) be the probability that a patient with score x will die. Note that linear regression would not work well here since it could produce probabilities less than zero or greater than one. Died 1 30 Day Mortality in Patients with Sepsis Survived 0 0 5 10 15 20 25 30 35 40 45 APACHE II Score at Baseline 1

b) Sigmoidal family of logistic regression curves Logistic regression fits probability functions of the following form: p( x) = exp( a+ bx)/( 1 + exp( a+ bx)) This equation describes a family of sigmoidal curves, three examples of which are given below. p( x) = exp( a+ bx)/ ( 1 + exp( a+ bx)) 1.0 π(x) 0.8 0.6 0.4 0.2 α -4-8 -12-20 β 0.4 0.4 0.6 1.0 0.0 0 5 10 15 20 25 30 35 40 c) Parameter values and the shape of the regression curve For negative values of x, exp ( a+ b ) Æ and hence p( x ) Æ 0/ ( 1+ 0) = 0 x For now assume that β > 0. x 0 as x Æ- For very large values of x, exp( a+ bx) Æ p( x) Æ ( 1+ ) = 1 and hence 2

p( x) = exp( a+ bx)/ ( 1 + exp( a+ bx)) π(x) 1.0 0.8 0.6 0.4 0.2 α -4-8 -12-20 β 0.4 0.4 0.6 1.0 0.0 0 5 10 15 20 25 30 35 40 When x =- a/ b, a+ bx = 0 and hence p( x ) = 1 1+ 1 = 05. The slope of π(x) when π(x)=.5 is β/4. Thus β controls how fast π(x) rises from 0 to 1. x b g For given β, α controls were the 50% survival point is located. We wish to choose the best curve to fit the data. Data that has a sharp survival cut off point between patients who live or die should have a large value of β. Died 1 Survived 0 0 5 10 15 20 25 30 35 40 x Data with a lengthy transition from survival to death should have a low value of β. Died 1 Survived 0 0 5 10 15 20 25 30 35 40 x 3

d) The probability of death under the logistic model This probability is p( x) = exp( a+ bx)/ ( 1 + exp( a+ bx)) {3.1} Hence 1 - p( x) = probability of survival 1 = + exp( a + bx) - exp( a + bx) 1 + exp( a+ bx) = 1 ( 1+ exp( a+ bx)), and the odds of death is p( x) ( 1 - p( x)) = exp( a+ bx) The log odds of death equals log( p( x) ( 1 - p( x)) = a+ bx {3.2} e) The logit function For any number π between 0 and 1 the logit function is defined by logit( p) = log( p/( 1 -p)) Let d i = R S T 1: i 0: i th th patient dies patient lives x i be the APACHE II score of the i th patient Then the expected value of d i is Ed ( ) =p ( x) = Pr[ d= 1] i i i Thus we can rewrite the logistic regression equation {5.2} as logit( Ed ( )) =p ( x) =a+bx i i i {3.3} 4

2. Contrast Between Logistic and Linear Regression In linear regression, the expected value of y i given x i is E( y ) = a+ bx for i = 12,,..., n i i y i has a normal distribution with standard deviation σ. it is the random component of the model, which has a normal distribution. a+ bx i is the linear predictor. In logistic regression, the expected value of d i given x i is E(d i ) = logit(e(d i )) = α + x i β for i = 1, 2,, n d is dichotomous with probability of event p i = p[ xi] i it is the random component of the model p = p i [ xi] logit is the link function that relates the expected value of the random component to the linear predictor. 3. Maximum Likelihood Estimation In linear regression we used the method of least squares to estimate regression coefficients. In generalized linear models we use another approach called maximum likelihood estimation. The maximum likelihood estimate of a parameter is that value that maximizes the probability of the observed data. We estimate α and β by those values â and ˆb that maximize the probability of the observed data under the logistic regression model. 5

Bas eline APACHE II Score Number of Patients Number of Deaths Baseline APACHE II S c o re Number of Patients Number of Deaths 0 1 0 20 13 6 2 1 0 21 17 9 3 4 1 22 14 12 4 11 0 23 13 7 5 9 3 24 11 8 6 14 3 25 12 8 7 12 4 26 6 2 8 22 5 27 7 5 9 33 3 28 3 1 10 19 6 29 7 4 11 31 5 30 5 4 12 17 5 31 3 3 13 32 13 32 3 3 14 25 7 33 1 1 15 18 7 34 1 1 16 24 8 35 1 1 17 27 8 36 1 1 18 19 13 37 1 1 19 15 7 41 1 0 This data is analyzed as follows. tabulate fate Death by 30 days Freq. Percent Cum. ------------+----------------------------------- alive 279 61.45 61.45 dead 175 38.55 100.00 ------------+----------------------------------- Total 454 100.00 6

. histogram apache [fweight=freq ], discrete (start=0, width=1) Density 0.02.04.06.08 0 10 20 30 40 APACHE Score at Baseline. scatter proportion apache Proportion of Deaths with Indicated Score 0.2.4.6.8 1 0 10 20 30 40 APACHE Score at Baseline 7

Logit estimates Number of obs = 454 LR chi2(1) = 62.01 Prob > chi2 = 0.0000 Log likelihood = -271.66534 Pseudo R2 = 0.1024 ------------------------------------------------------------------------------ fate Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- apache.1156272.0159997 7.23 0.000.0842684.146986 _cons -2.290327.2765283-8.28 0.000-2.832313-1.748342 ------------------------------------------------------------------------------ logit( Ed ( )) =p ( x) =a+bx ˆb =.1156272 i i i â = -2.290327 p( x) = exp( a+ bx)/( 1 + exp( a+ bx)) exp(-2.290327 +.1156272 x) = 1 + exp(-2.290327 +.1156272 x) exp(-2.290327 +.1156272 20) p ( 20) = = 0.50555402 1 + exp(-2.290327 +.1156272 20) Probability of Death by 30 days 0.2.4.6.8 1 0 10 20 30 40 APACHE Score at Baseline Proportion of Deaths with Indicated Score Pr(fate) 8

4. Odds Ratios and the Logistic Regression Model a) Odds ratio associated with a unit increase in x The log odds that patients with APACHE II scores of x and x + 1 will die are and respectively. logit ( p( x)) = a+ bx {3.5} logit ( p( x + 1)) = a+ b( x + 1) = a+ bx + b {3.6} subtracting {3.5} from {3.6} gives β = logit ( p( x + 1)) -logit( p( x)) β = logit ( p( x + 1)) -logit( p( x)) F I HG K J - F p x HG p x 1 - F HG p( x + 1)) p( x) = log log 1- p( x + 1) 1- p( x) = log ( + 1)/( 1- p( x + 1)) ( )/( p( x)) I K J I K J and hence exp(β) is the odds ratio for death associated with a unit increase in x. A property of logistic regression is that this ratio remains constant for all values of x. 9

5. 95% Confidence Intervals for Odds Ratio Estimates In our sepsis example the parameter estimate for apache (β) was.1156272 with a standard error of.0159997. Therefore, the odds ratio for death associated with a unit rise in APACHE II score is exp(.1156272) = 1.123 with a 95% confidence interval of (exp(0.1156-1.96 0.0160), exp(0.1156 + 1.96 0.0160)) = (1.09, 1.15). 6. Quality of Model fit If our model is correct then ( ) ˆ ˆ logit observed proportion =a+bx i It can be helpful to plot the observed log odds against a+bx ˆ ˆ i 10

Log odds of Death in 30 days -2-1 0 1 2 3 0 10 20 30 40 APACHE Score at Baseline obs_logodds Linear prediction 7. 95% Confidence Interval for p[ x] Let 2 and 2 s denote the variance of â and ˆb. s â ˆb Let s ˆ denote the covariance between â and ˆb. âb Then it can be shown that the standard error of is seèa+b ˆ ˆx = Î s + 2xs + x s 2 2 2 aˆ ab ˆ ˆ bˆ A 95% confidence interval for a+bx is a+b ˆ ˆx ± 1.96 seèa+b ˆ ˆx Î 11

-4-2 0 2 4 0 10 20 30 40 APACHE Score at Baseline lb_logodds/ub_logodds Linear prediction obs_logodds A 95% confidence interval for a+bx is a+b ˆ ˆx ± 1.96 seèa+b ˆ ˆx Î ( xi) exp ( xi) / ( 1 exp( xi) ) p = a+b + a+b Hence, a 95% confidence interval for pˆ x, pˆ x, where and ( L[ ] U[ ]) p[ x] expèa+b ˆ ˆx - 1.96 se È ˆ ˆx Î a+b p ˆ L [ x] = Î 1+ expèa+b ˆ ˆx - 1.96 seèa+b ˆ ˆx Î Î exp Èa+b ˆ ˆx + 1.96 se È ˆ ˆx Î a+b p ˆ U [ x] = Î 1 + expèa+b ˆ ˆx + 1.96 seèa+b ˆ ˆx Î Î is 12

0.2.4.6.8 1 0 10 20 30 40 APACHE Score at Baseline lb_prob/ub_prob Pr(fate) Proportion Dead by 30 Days It is common to recode continuous variables into categorical variables in order to calculate odds ratios for, say the highest quartile compared to the lowest.. centile apache, centile(25 50 75) -- Binom. Interp. -- Variable Obs Percentile Centile [95% Conf. Interval] -------------+------------------------------------------------------------- apache 454 25 10 9 11 50 14 13.60845 15 75 20 19 21. generate float upper_q= apache >= 20. tabulate upper_q upper_q Freq. Percent Cum. ------------+----------------------------------- 0 334 73.57 73.57 1 120 26.43 100.00 ------------+----------------------------------- Total 454 100.00 13

. cc fate upper_q if apache >= 20 apache <= 10 Proportion Exposed Unexposed Total Exposed -----------------+------------------------+---------------------- Cases 77 25 102 0.7549 Controls 43 101 144 0.2986 -----------------+------------------------+---------------------- Total 120 126 246 0.4878 Point estimate [95% Conf. Interval] ------------------------+---------------------- Odds ratio 7.234419 3.924571 13.44099 (exact) Attr. frac. ex..8617719.7451951.9256007 (exact) Attr. frac. pop.6505533 +----------------------------------------------- chi2(1) = 49.75 Pr>chi2 = 0.0000 This approach discards potentially valuable information and may not be as clinically relevant as an odds ratio at two specific values. Alternately we can calculate the odds ratio for death for patients at the 75 th percentile of Apache scores compared to patients at the 25 th percentile logit ( p (20)) = a+b 20 logit ( p (10)) = a+b 10 Subtracting gives ( 20 )/ ( 1 ( 20) ) ( 10 )/ ( 1 ( 10) ) Êp -p ˆ log Á =b 10 = 0.1156 10 = 1.156 Ë p -p Hence, the odds ratio equals exp(1.156) = 3.18 A problem with this estimate is that it is strongly dependent on the accuracy of the logistic regression model. 14

Hence, the odds ratio equals exp(1.156) = 3.18 With Stata we can calculate the 95% confidence interval for this odds ratio as follows:. lincom 10*apache, eform ( 1) 10 apache = 0 ------------------------------------------------------------------------------ fate exp(b) Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) 3.178064.5084803 7.23 0.000 2.322593 4.348628 ------------------------------------------------------------------------------ A problem with this estimate is that it is strongly dependent on the accuracy of the logistic regression model. Simple logistic regression generalizes to allow multiple covariates logit( Ed ( )) =a+b 1x +b 2x 2 +... +b x i i1 i k ik where x i1, x 12,, x ik are covariates from the i th patient α and β 1,...β k, are known parameters d i = 1: ith patient suffers event of interest 0: otherwise Multiple logistic regression can be used for many purposes. One of these is to weaken the logit-linear assumption of simple logistic regression using restricted cubic splines. 15

8. Restricted Cubic Splines These curves have k knots located at t1, t2,, t k. They are: Linear before t and after t. 1 Piecewise cubic polynomials between adjacent knots 3 2 (i.e. of the form ax + bx + cx + d ) Continuous and smooth. k t1 t2 t3 Given x and k knots a restricted cubic spline can be defined by y =a+ x b + x b + + x - b - 1 1 2 2 k 1 k 1 for suitably defined values of These covariates are functions of x and the knots but are independent of y. x1 xi = x and hence the hypothesis tests the linear hypothesis. b =b = =b k- 2 3 1 In logistic regression we use restricted cubic splines by modeling ( E( d) ) =a+ x1b 1+ x2b 2+ + x - 1b - 1 logit i k k Programs to calculate x1,, xk - 1are available in Stata, R and other statistical software packages. 16

We fit a logistic regression model using a three knot restricted cubic spline model with knots at the default locations at the 10th percentile, 50th percentile, and 90th percentile.. rc_spline apache, nknots(3) number of knots = 3 value of knot 1 = 7 value of knot 2 = 14 value of knot 3 = 25. logit fate _Sapache1 _Sapache2 Logit estimates Number of obs = 454 LR chi2(2) = 62.05 Prob > chi2 = 0.0000 Log likelihood = -271.64615 Pseudo R2 = 0.1025 ------------------------------------------------------------------------------ fate Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- _Sapache1.1237794.0447174 2.77 0.006.036135.2114238 _Sapache2 -.0116944.0596984-0.20 0.845 -.128701.1053123 _cons -2.375381.5171971-4.59 0.000-3.389069-1.361694 ------------------------------------------------------------------------------ Note that the coefficient for _Sapache2 is small and not significant, indicating an excellent fit for the simple logistic regression model. Giving analogous commands as for simple logistic regression gives the following plot of predicted mortality given the baseline Apache score.. drop prob logodds se lb_logodds ub_logodds ub_prob lb_prob. rename prob_rcs prob. rename logodds_rcs logodds. predict se, stdp. generate float lb_logodds= logodds-1.96* se. generate float ub_logodds= logodds+1.96* se. generate float ub_prob= exp( ub_logodds)/(1+exp( ub_logodds)). generate float lb_prob= exp( lb_logodds)/(1+exp( lb_logodds)). twoway (rarea lb_prob ub_prob apache, blcolor(yellow) bfcolor(yellow)) > (scatter proportion apache) > (line prob apache, clcolor(red) clwidth(medthick)) This plot is very similar to the one for simple logistic regression except the 95% confidence band is a little wider. 17

0.2.4.6.8 1 Three Knot Restricted Cubic Spline Model 0 10 20 30 40 APACHE Score at Baseline lb_prob/ub_prob Pr(fate) Proportion Dead by 30 Days 0.2.4.6.8 1 Simple Logistic Regression 0 10 20 30 40 APACHE Score at Baseline lb_prob/ub_prob Pr(fate) Proportion Dead by 30 Days 18

This regression gives the following table of values Percentile Apache _Sapache1 _Sapache2 25 10 10 0.083333 75 20 20 5.689955 We calculate the odds ratio of death for patients at the 75th vs. 25th percentile of apache score as follows: The logodds at the 75th percentile equals logit p 20 = a+b 20 +b 5.689955 ( ( )) 1 2 The logodds at the 25th percentile equals ( ( )) 1 2 logit p 10 = a+b 10 +b 0.083333 Subtracting the second from the first equation gives that the log odds ratio for patients at the 75th vs. 25th percentile of apache score is ( ( ) ( )) 1 2 logit p 20 / p 10 =b 10 +b 5.606622 Stata calculates this odds ratio to be 3.22 with a 95% confidence interval of 2.3 -- 4.6. lincom _Sapache1*10 + 5.606622* _Sapache2, eform ( 1) 10 _Sapache1 + 5.606622 _Sapache2 = 0 ------------------------------------------------------------------------------ fate exp(b) Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) 3.22918.5815912 6.51 0.000 2.26875 4.596188 ------------------------------------------------------------------------------ Recall that for the simple model we had the following odds ratio and confidence interval. ------------------------------------------------------------------------------ fate exp(b) Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) 3.178064.5084803 7.23 0.000 2.322593 4.348628 ------------------------------------------------------------------------------ The close agreement of these results supports the use of the simple logistic regression model for these data. 19

9. Simple 2x2 Case-Control Studies a) Example: Esophageal Cancer and Alcohol Breslow & Day, Vol. I give the following results from the Ille-et-Vilaine case-control study of esophageal cancer and alcohol. Cases were 200 men diagnosed with esophageal cancer in regional hospitals between 1/1/1972 and 4/30/1974. Controls were 775 men drawn from electoral lists in each commune. Esophageal Daily Alcohol Consumption Cancer > 80g < 80g Total Yes (Cases) 96 104 200 No (Controls) 109 666 775 Total 205 770 975 b) Review of Classical Case-Control Theory Let x i = S T R1 = cases 0 = for controls m i = number of cases (i = 1) or controls (i = 0) d i = number of cases (i = 1) or controls (i = 0) who are heavy drinkers. Then the observed prevalence of heavy drinkers is d 0 /m 0 = 109/775 for controls and d 1 /m 1 = 96/200 for cases. The observed prevalence of moderate or non-drinkers is (m 0 - d 0 )/m 0 = 666/775 for controls and (m 1 - d 1 )/m 1 = 104/200 for cases. 20

The observed odds that a case or control will be a heavy drinker is ( d / m )/[( m - d )/ m ] = d / ( m -d ) i i i i i i i i = 109/666 and 96/104 for controls and cases, respectively. The observed odds ratio for heavy drinking in cases relative to controls is d /( ) 1 m1 - d1 y= d /( m - d ) 0 0 0 = 96 /104 = 5.64 109 / 666 If the cases and Since controls esophageal are a representative cancer is rare sample y from their respective also estimates underlying the populations relative risk then of 1. y is an unbiased esophageal estimate cancer of the in heavy true odds drinkers ratio for heavy drinking in relative cases relative to moderate to controls drinkers. the underlying population. 2. This true odds ratio also equals the true odds ratio for esophageal cancer in heavy drinkers relative to moderate drinkers. Case-control studies would be pointless if this were not true. Woolf s estimate of the standard error of the log odds ratio is 1 1 1 1 selog( ˆ ) = d + y 0 m0 -d + 0 d + 1 m1 -d1 and the distribution of Hence, if we let y ˆ L = yˆ expè-1.96se log Î ( yˆ ) and y ˆ U = yˆ expè1.96se log Î ( yˆ ) ( ˆ, ˆ ) then y y is a 95% confidence interval for y. L U log( yˆ ) is approximately normal. 21

10. Logistic Regression Models for 2x2 Contingency Tables Consider the logistic regression model logit ( E( di / mi)) = a+ bxi {3.9} where E( di / mi) = pi = Probability of being a heavy drinker for cases (i = 1) and controls (i = 0). Then {3.9} can be rewritten logit( pi) = log( pi / ( 1 - pi)) = a+ bxi Hence log( p1/ ( 1 - p1)) = a+ bx1 = a+ b and log( p0 / ( 1 - p0)) = a+ bx0 = a since x 1 = 1 and x 0 = 0. Subtracting these two equations gives log( p1 / ( 1-p1)) -log( p0 / ( 1- p0)) = b L p1/( 1 - p1) O log log( y) b p0 /( 1 - p0) QP = = NM b and hence the true odds ratio y = e a) Estimating relative risks from the model coefficients Our primary interest is in β. Given an estimate b of β then y = e b b) Nuisance parameters α is called a nuisance parameter. This is one that is required by the model but is not used to calculate interesting statistics. 11. Analyzing Case-Control Data with Stata Consider the following data on esophageal cancer and heavy drinking cancer alcohol patients No < 80g 666 Yes < 80g 104 No >= 80g 109 Yes >= 80g 96 22

. cc cancer alcohol [freq=patients], woolf alcohol Proportion Exposed Unexposed Total Exposed -----------------+------------------------+---------------------- Cases 96 104 200 0.4800 Controls 109 666 775 0.1406 -----------------+------------------------+---------------------- Total 205 770 975 0.2103 Point estimate [95% Conf. Interval] ------------------------+---------------------- Odds ratio 5.640085 4.000589 7.951467 (Woolf) Attr. frac. ex..8226977.7500368.8742371 (Woolf) Attr. frac. pop.3948949 +----------------------------------------------- chi2(1) = 110.26 Pr>chi2 = 0.0000 * * Now calculate the same odds ratio using logistic regression * 96 /104 The estimated odds ratio is = 5.64 109 / 666. logistic alcohol cancer [freq=patients] Logistic regression No. of obs = 975 LR chi2(1) = 96.43 Prob > chi2= 0.0000 Log likelihood = -453.2224 Pseudo R2 = 0.0962 ------------------------------------------------------------------------------ alcohol Odds Ratio Std. Err. z P> z [95% Conf. Interval] ---------+-------------------------------------------------------------------- cancer 5.640085.9883491 9.87 0.000 4.000589 7.951467 ------------------------------------------------------------------------------ This is the analogous logistic command for simple logistic regression. If we had entered the data as cancer heavy patients 0 109 775 1 96 200 This commands fit the model logit(e(alcohol)) = α + cancer*β giving β = 1.73 = the log odds ratio of being a heavy drinker in cancer patients relative to controls. The odds ratio is exp(1.73) = 5.64. 23

a) Logistic and classical estimates of the 95% CI of the OR The 95% confidence interval is (5.64exp(-1.96 0.1752), 5.64exp(1.96 0.1752)) = (4.00, 7.95). The classical limits using Woolf s method is (5.64exp(-1.96 s), 5.64exp(1.96 s)) =(4.00, 7.95), where s 2 = 1/96 + 1/109 + 1/104 + 1/666 = 0.0307 = (0.1752) 2. Hence Logistic regression is in exact agreement with classical methods in this simple case. In Stata the command cc cancer alcohol [freq=patients], woolf gives us Woolf s 95% confidence interval for the odds ratio. We will cover how to calculate confidence intervals using glm in the next chapter. 24