BIOSTAT640 R-Solution for HW7 Logistic Minming Li & Steele H. Valenzuela Mar.10, 2016

Transcription

1 BIOSTAT640 R-Solution for HW7 Logistic Minming Li & Steele H. Valenzuela Mar.10, 2016 Required Libraries If there is an error in loading the libraries, you must first install the packaged library (install.packages("package name")), with package name in quotes, and then run the library() command, where the package name is not in quotes. library(foreign) # install.packages("readstata13") library(readstata13) # install.packages("proc", "dplyr", "lmtest") library(proc) # use roc{proc} and plot.roc{proc} library(dplyr) # use glimpse{dplyr} library(lmtest) # use lrtest{lmtest} # install.packages("resourceselection") library(resourceselection) # use hoslem.test{resourceselection} Upload Data We will be downloading the following data set from the 640 course website: link <- " dat <- read.dta13(link) Warning in read.dta13(link): agegp: Factor codes of type double or float detected - no labels assigned. Set option nonint.factors to TRUE to assign labels anyway. Warning in read.dta13(link): tobgp: Factor codes of type double or float detected - no labels assigned. Set option nonint.factors to TRUE to assign labels anyway. Warning in read.dta13(link): alcgp: Factor codes of type double or float detected - no labels assigned. Set option nonint.factors to TRUE to assign labels anyway. glimpse(dat) Observations: 975 Variables: 7 $ case (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... $ age (dbl) 42, 45, 35, 78, 45, 64, 76, 42, 48, 42, 37, 45, 60, 54,... 1

2 $ agegp (dbl) 2, 3, 2, 6, 3, 4, 6, 2, 3, 2, 2, 3, 4, 3, 5, 4, 3, 4, 3,... $ tob (dbl) 0.0, 7.5, 0.0, 0.0, 7.5, 17.5, 2.5, 0.0, 12.5, 25.0, $ tobgp (dbl) 1, 1, 1, 1, 1, 2, 1, 1, 2, 3, 2, 2, 1, 3, 1, 1, 3, 1, 1,... $ alc (dbl) 139, 66, 24, 39, 64, 49, 1, 33, 55, 62, 77, 84, 4, 32, 2... $ alcgp (dbl) 4, 2, 1, 1, 2, 2, 1, 1, 2, 2, 2, 3, 1, 1, 1, 1, 1, 2, 1,... # str(dat) 1. Global Chi Square test VS. Likelihood Ratio Test of reduced model versus current model Firstly, fit a Logistic Regression Model The command glm(y ~ x_1 + x_ x_n, family = binomial) will fit logistic regression models. And here is our first logistic model, with our predictor/dependent variable case, and our independent variables Tobacco Use/tobgp and Age/agegp. model1 <- glm(case ~ tobgp + agegp, data = dat, family = binomial) summary(model1) Call: glm(formula = case ~ tobgp + agegp, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** tobgp e-09 *** agegp < 2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 974 degrees of freedom Residual deviance: on 972 degrees of freedom AIC: Number of Fisher Scoring iterations: 5 # Odds Ratios and 95% CI: exp(cbind(or = coef(model1), confint(model1))) Waiting for profiling to be done... 2

3 OR 2.5 % 97.5 % (Intercept) tobgp agegp The Likelihood Ratio test and the Global Chi-Square test (also called Score test) usually give compatible results. If the significance results differ, use the p-value from the Likelihood Ratio Test. # fit into intercept only model model2 <- glm(case ~ 1, data = dat, family = binomial) summary(model2) Call: glm(formula = case ~ 1, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 974 degrees of freedom Residual deviance: on 974 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 # Likelihood Ratio Test of reduced model versus current model # lrtest(model2, model1) # L.R.Chisq is Summary It is of interest to know whether the inclusion of extra predictors to a model is statistically significant. The smaller model ( reduced ) contains the control variables. The larger model ( full ) contains the control variables plus the extra variables in question. First, the easy way. From the package lmtest will be a function called lrtest(...), hence the Likelihood Ratio test. lrtest(model2, model1) 3

4 Likelihood ratio test Model 1: case ~ 1 Model 2: case ~ tobgp + agegp #Df LogLik Df Chisq Pr(>Chisq) < 2.2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 We see that the order is simple, the reduced model followed by the full model. The output displays the degrees of freedom for each model (#DF), the value of the log-likelihoods for each model, the difference in degrees of freedom (DF), followed by the LR statistic or in this case the Chi-square statistic, and lastyl, the p-value which determines if the extra variable in the full model is significant. That is a lot to absord but it s extremely valuable to understand. Now, for the long way. First, let s find the LR statistic. There are two ways, which follows the aforementioned formula. We must first be aware of the objects that are contained in model2. If you perform the command names(model2), you ll see that deviance is an option. You pull out the value of that object as specified in the code below. Additionally, R contains the command loglik(...) that you may also use to obtain the LR object. LR.statistic.1 <- model2$deviance - model1$deviance LR.statistic.1 [1] LR.statistic.2 <- -2*logLik(model2)[1] - (-2*logLik(model1)[1]) LR.statistic.2 [1] After finding the LR test statistic two different ways, we find the p-value under a chi-square distribution with degrees of freedom = k, where k represents the number of extra variables the full model has over the reduced model. In this case, k=2. pchisq(lr.statistic.1, 1, lower.tail = FALSE) [1] e Illustration of Hosmer-Lemeshow Goodness of Fit Test (NULL: model is a good fit) library(resourceselection) hoslem.test(model1$y, fitted(model1), g = 9) 4

5 Hosmer and Lemeshow goodness of fit (GOF) test data: model1$y, fitted(model1) X-squared = , df = 7, p-value = # g: number of bins to use to calculate quantiles. pchisq(9.0807, df = 7, lower.tail = FALSE) [1] WHAT TO LOOK FOR: Evidence of a OVERALL GOODNESS OF FIT is reflected in a NON-SIGNIFICANT p-value. Here, the Hosmer-Lemeshow test p-value is non-significant, which suggest a good overall fit. 3. Illustration of Link Test (same as linktest in Stata) Here is a great description from UCLA s statistical computation website: The Stata command linktest can be used to detect a specification error, and it is issued after the logit or logistic command. The idea behind linktest is that if the model is properly specified, one should not be able to find any additional predictors that are statistically significant except by chance. After the regression command (in our case, logit or logistic), linktest uses the linear predicted value (_hat) and linear predicted value squared (_hatsq) as the predictors to rebuild the model. The variable _hat should be a statistically significant predictor, since it is the predicted value from the model. This will be the case unless the model is completely misspecified. On the other hand, if our model is properly specified, variable _hatsq shouldn t have much predictive power except by chance. Therefore, if _hatsq is significant, then the linktest is significant. This usually means that either we have omitted relevant variable(s) or our link function is not correctly specified." model3 <- glm(case ~ agegp, data = dat, family = binomial) summary(model3) Call: glm(formula = case ~ agegp, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) <2e-16 *** agegp <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) 5

6 Null deviance: on 974 degrees of freedom Residual deviance: on 973 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 hat <- predict(model3) # hat hatsq <- hat^2 # hatsq linktest <- summary(glm(case ~ hat + hatsq, data = dat, family = binomial)) linktest Call: glm(formula = case ~ hat + hatsq, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) *** hat ** hatsq e-07 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 974 degrees of freedom Residual deviance: on 972 degrees of freedom AIC: Number of Fisher Scoring iterations: 6 # try another model: model1 hat <- predict(model1) # hat hatsq <- hat^2 # hatsq linktest <- summary(glm(case ~ hat + hatsq, data = dat, family = binomial)) linktest Call: glm(formula = case ~ hat + hatsq, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max

7 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) hat hatsq ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 974 degrees of freedom Residual deviance: on 972 degrees of freedom AIC: Number of Fisher Scoring iterations: 6 4. Illustration of Classification Table of current model # observed vs. predicted # Classified as case if predicted Pr(D) >=.5 pred <- predict(model1) # pred # if pred>0.5, then set it as 1, otherwise set it as 0; pred_cut <- ifelse(pred>0.5, 1, 0) # pred_cut table(pred_cut) pred_cut # dat$case mydata <- as.data.frame(cbind(case=dat$case, pred_cut)) # is.data.frame(mydata) # head(mydata) # 2-Way Cross Tabulation library(gmodels) Attaching package: 'gmodels' The following object is masked from 'package:proc': ci 7

8 # The CrossTable( ) function in the gmodels package produces crosstabulations, similar to PROC FREQ in S CrossTable(mydata$case, mydata$pred_cut) Cell Contents N Chi-square contribution N / Row Total N / Col Total N / Table Total Total Observations in Table: 975 mydata$pred_cut mydata$case 0 1 Row Total Column Total mydat <- as.table(matrix(c(4, 196, 7, 768), nrow=2)) mydat A B A 4 7 B colnames(mydat) <- c("case","control") rownames(mydat) <- c("pred+","pred-") mydat Case Control Pred+ 4 7 Pred

9 library(epir) Loading required package: survival Package epir is loaded Type help(epi.about) for summary information # epi.tests(): Sensitivity, specificity and predictive value (PPV, NPV) of a diagnostic test epi.tests(mydat) Outcome + Outcome - Total Test Test Total Point estimates and 95 % CIs: Apparent prevalence 0.01 (0.01, 0.02) True prevalence 0.21 (0.18, 0.23) Sensitivity 0.02 (0.01, 0.05) Specificity 0.99 (0.98, 1.00) Positive predictive value 0.36 (0.11, 0.69) Negative predictive value 0.80 (0.77, 0.82) Positive likelihood ratio 2.21 (0.65, 7.49) Negative likelihood ratio 0.99 (0.97, 1.01) # is.table(mydat) # TRUE [1] TRUE is.data.frame(mydat) # FALSE [1] FALSE # because epi.tests(dat), the data needs to be a TABLE, NOT A DATA.FRAME! 5. Illustration of ROC Curve An ROC curve ( Receiver-Operating Characteristic) is a visual display of the overall performance of a fitted logistic model and its associated equation for predicted probabilities. 9

10 library(proc) roc1 <- roc(dat$case, fitted(model1)) plot.roc(roc1, print.auc = TRUE, legacy.axes = TRUE) Sensitivity AUC: Specificity Call: roc.default(response = dat$case, predictor = fitted(model1)) Data: fitted(model1) in 775 controls (dat$case 0) < 200 cases (dat$case 1). Area under the curve: WHAT TO LOOK FOR: Classification that is no better than a coin toss is referenced in the 45 degree line. Evidence of a GOOD FIT is reflected in an ROC curve that lies above the 45 degree line reference. Area under the ROC curve = , which says that 74.6% of the observations are correctly classified. 10