BSTA 6651 Cat. Data Anal. Homework #3 Fall, 2011

Similar documents
VI. Introduction to Logistic Regression

Lecture 14: GLM Estimation and Logistic Regression

11. Analysis of Case-control Studies Logistic Regression

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Basic Statistical and Modeling Procedures Using SAS

Generalized Linear Models

Multinomial and Ordinal Logistic Regression

III. INTRODUCTION TO LOGISTIC REGRESSION. a) Example: APACHE II Score and Mortality in Sepsis

LOGISTIC REGRESSION ANALYSIS

Logit Models for Binary Data

Regression Analysis: A Complete Example

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

Simple linear regression

Multivariate Logistic Regression

SAS Software to Fit the Generalized Linear Model

LEGEND OPTIONS USING MULTIPLE PLOT STATEMENTS IN PROC GPLOT

Ordinal Regression. Chapter

13. Poisson Regression Analysis

Simple Linear Regression Inference

STATISTICA Formula Guide: Logistic Regression. Table of Contents

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Lecture 18: Logistic Regression Continued

LOGIT AND PROBIT ANALYSIS

Factors affecting online sales

Logistic (RLOGIST) Example #1

Correlation and Simple Linear Regression

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Two Correlated Proportions (McNemar Test)

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Statistics 305: Introduction to Biostatistical Methods for Health Sciences

Chapter 7: Simple linear regression Learning Objectives

Using R for Linear Regression

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Logistic Regression.

Independent t- Test (Comparing Two Means)

Part 2: Analysis of Relationship Between Two Variables

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Nominal and Real U.S. GDP

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Odds ratio, Odds ratio test for independence, chi-squared statistic.

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Discussion Section 4 ECON 139/ Summer Term II

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

ABSTRACT INTRODUCTION

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

Lecture 19: Conditional Logistic Regression

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Statistics and Data Analysis

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Cool Tools for PROC LOGISTIC

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

DATA INTERPRETATION AND STATISTICS

SUMAN DUVVURU STAT 567 PROJECT REPORT

SUGI 29 Statistics and Data Analysis

Final Exam Practice Problem Answers

Poisson Models for Count Data

Chapter 5 Analysis of variance SPSS Analysis of variance

Algebra II End of Course Exam Answer Key Segment I. Scientific Calculator Only

Poisson Regression or Regression of Counts (& Rates)

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Chapter 6: Multivariate Cointegration Analysis

Pearson s Correlation

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Introduction to Quantitative Methods

From the help desk: Swamy s random-coefficients model

MORE ON LOGISTIC REGRESSION

Premaster Statistics Tutorial 4 Full solutions

MATH 60 NOTEBOOK CERTIFICATIONS

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Categorical Data Analysis

Descriptive Statistics

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Regression step-by-step using Microsoft Excel

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Standard errors of marginal effects in the heteroskedastic probit model

Regression III: Advanced Methods

Review of Fundamental Mathematics

The KaleidaGraph Guide to Curve Fitting

Statistics in Retail Finance. Chapter 2: Statistical models of default

SPSS Guide: Regression Analysis

Dongfeng Li. Autumn 2010

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Lecture 3: Linear methods for classification

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

EQUATIONS and INEQUALITIES

Univariate Regression

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015

Example: Boats and Manatees

Comparison of sales forecasting models for an innovative agro-industrial product: Bass model versus logistic function

Distribution (Weibull) Fitting

Transcription:

Problem 5.1 Table 5.11 shows the statistical output of logistic regression results for modeling the probability of remission of cancer using a labeling index (LI) explanatory variable. The following optional SAS code can be used to reproduce the results shown in Table 5.11 DATA PROB_5_1; Input LI N Remiss @@; datalines; 8 2 0 10 2 0 12 3 0 14 3 0 16 3 0 18 1 1 20 3 2 22 2 1 24 1 0 26 1 1 28 1 1 32 1 0 34 1 1 38 3 2 ; Run; PROC LOGISTIC Data=PROB_5_1; Model Remiss/N = LI /COVB; * The COVB option displays the covariance matrix; Output Out=Logit5 p=pi_hat Lower=Lower Upper=Upper; Run; * Lower and Upper are the confidence limits for Pi; PROC PRINT Data=Logit5(where=(LI in (8 10))); Run; * Display the Pi_Hat values for LI = 8 and LI = 10; 5.1a. The model being fit is Logit( π ) = α + β*li, where α = -3.7771 and β= 0.1449 are listed under Estimate in Table 5.11 To extract the values for π first rewrite this in terms of: θ = e α+β*li now find θ π = θ For LI = 8, ( LI = 8) = e 3.777 0.1449* 8 now find ( LI = 8) θ = 0.07296 0.07296 π = = 0.06797 0.07296 5.1b. Using the equations above we can find π (LI = 26) as: ( LI = 26) = e 3.777 0.1449* 26 θ = 0.9903 0.9903 π = = 0.4973 0.50 0.9903 now find ( LI = 26) 5.1c. Using the formula from section 5.1.1 we can calculate the rate of change in π as: π( LI) = βπ( LI) [ 1 π( LI) ] LI ( LI = 8) π LI = 0.1449 * 0.06797 1 [ 0.06797] = 0.009 Page 1 of 7

( LI = 26) π LI = 0.1449 * 0.5 1 [ 0.5] = 0.036 5.1d. We can calculate π at the lower and upper quartiles of LI as : θ ( LI = 14) = e 3.777 0.1449* 14 = 0.1740 now find ( LI = 14) 0.1740 π = = 0.1482 0.15 0.1740 θ ( LI = 28) = e 3.777 0.1449* 28 1.3233 = 1.3233 now find π ( LI = 28) = = 0.5696 0.57 1.3233 This allows us to calculate the change in π over the middle half of the range of LI values as: π = 0.57 0.15 = 0.42 5.1e. From part a, we can rewrite the model as: θ = e α+β*li = e α e β*li = e α e 0.1449*LI For a unit change in LI, we can write LI* = LI + 1 and noting that e 0.1449 1.16 θ * = e α e 0.1449*(LI+1) = e α e 0.1449 e 0.1449*LI = 1.16*e α e 0.1449*LI and thus θ * = 1.16 θ which shows that for a unit change in LI the odds ratio of remission changes by a multiplicative factor of e β = 1.16 5.1.f. The 95% C.I. of β is 0.1449+1.96(0.0593)=(0.0287,0.2611) and so the 95% C.I. of exp(β), θ, is (exp(.0287), exp(.2611))=(1.029, 1.298). ˆ 2 β 2 0.1449 2 5.1.g. The Wald test statistics for LI is χw = ( ) = ( ) = 5.96 and the upper tailed SE..( ˆ β ) 0.0593 probability of Chi-square with d.f. of 1 at 5.96 is 0.0146, which is smaller than 0.05. We therefore can conclude LI has a significant effect on the remission rate (at 5% significant level). 5.1h. Using the output at the top of Table 5.11 we can construct the Likelihood Ratio statistic using the values for -2LogL listed under Intercept Only (L 0 ) and Intercept and Covariates (L 1 ): -2(L 0 -L 1 ) = 34.372 26.073 = 8.299 This value agrees with the 8.2988 value shown in Table 5.11 for Likelihood Ratio under the section titled Testing Global Null Hypothesis: BETA=0. This test statistic is χ 2 and the df is 1 due to the addition of the single factor (LI) in the fitted model compared to the intercept only model. The p-value of 0.004 (listed in Table 5.11 for the parameter Li) is highly significant showing we can reject Ho: β=0. 5.1.i. Find the 95% C.I. for logit(π) first. The MLE of logit(π) is log it( ˆ π )= ˆ α + ˆ βli and so its (asymptotical) variance at LI=8 is ˆ 2 ˆ 2 var( ˆ α) + 2LI cov( ˆ α, β ) + LI var( β ) 1.90 2(8)( 0.077) + 8 (0.004) = 0.925. Therefore, the 95% C.I. logit(π) at LI=8 is Page 2 of 7

( ˆ α + ˆ β LI) ± 1.96S.E.( ˆ α + ˆ β LI) = 3.78 + 0.145(8) ± 1.96 0.925 = ( 4.505, 0.735). Converting the logit function we get the 95% C.I. for π is 1 1 exp( 4.505) exp( 0.735) (logit ( 4.505), logit ( 0.735)) = (, ) = (0.01, 0.32). exp( 4.505) exp( 0.735) Problem 5.2 The data in Table 5.12 shows the flight number, (Ft), temperature (Temp, of), and o-ring thermal distress response (TD: 1=yes; 0=no) for 23 space shuttle flights prior to the Challenger disaster in 1986. The data is based on Table 1 in J. Amer. Statist. Assoc., 84: 945_957, 1989., by S. R. Dalal, E. B. Fowlkes, and B. Hoadley. 5.2 a. (The SAS code used to produce the following results can be found in the appendix.) PROC LOGISTIC was used to model the effect of temperature on the probability of thermal distress in O-rings. The fitted model obtained was: Logit( π ) = 15.0429-0.2322 * Temperature A plot of π across the range of temperatures is given in figure 1, which shows that as the temperature increases the probability of thermal distress decreases. We also know that the steepest decreasing rate of the probability (of thermal distress) occurs at which corresponds to a temperature value of. Figure 1. Plot for problem 5.2 showing the predicted probability of Thermal Distress for a range of temperatures. Page 3 of 7

5.2b. Therefore, the probability of thermal distress at 31 is 99.96%. Also, this ( π @ 31F) is an extrapolation beyond the data range of temperature, which is not recommended. 5.2c. Shown below is the SAS output for PROC GENMOD detailing the parameter estimates. The confidence interval for the effect of temperature on the odds of thermal distress can be obtained from the Walds 95% confidence interval for the β value. Therefore, the confidence interval is given by ( 0.9801). With a Chi Square value of 4.6 and df =1, the p- value is given as 0.032, therefore, the null hypothesis of H 0 : β=0 is rejected at the 5% significance level. Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 15.0429 7.3786 0.5810 29.5048 4.16 0.0415 Temp 1-0.2322 0.1082-0.4443-0.0200 4.60 0.0320 If you use PROC LOGISTIC and the CLODDS=PL model option we obtain the point estimate of 0.793 with a 95% profile Likelihood confidence interval of (0.597, 0.941) for the effect on the odds of TD per 1 of change in temperature. Thus a 1 of temperature increase will reduce the odds of TD to a point between 59.7% and 94.1% of the original odds. The CLPARM=PL option gives the 95% profile Likelihood confidence intervals for the model parameters. 5.2d. If we re-run the analysis using a more complex model, model d: Logit( π ) = α + β 1 * Temperature + β 2 * Temperature 2 We obtain an -2LogL value for this more complex model of 19.389. The log likelihood for this model is -9.6944 and we know from earlier result (part a) that the log likelihood for the linear term only (model a) is -10.1576. The likelihood ratio statistic is given by This follows a Chi-square distribution with 1 df, and its p-value from the table is 0.3378. Based on this p value, we can conclude that adding the quadratic term will not improve the goodness of fit significantly. Problem 5.9 The output in Table 5.14 shows the result of fitting a logit model to the death penalty data in Table 2.6. Let def be the defendant s race and vic be the victim s race. The fitted model is then: Logit( π ) = -3.5961-0.8678 *def + 2.4044 *vic. 5.9a. Since the def coefficient is negative and the vic coefficient is larger and positive, we conclude that cases with a white victim (vic = 1) and a black defendant (def = 0) will have the highest probability, which is Page 4 of 7

Changing the defendant from black to white (0 to 1) changes the odds of the death penalty by a multiplicative factor of e 0.8678 = 0.42. Similarly, changing the victim from black to white (0 to 1) changes the odds of the death penalty by a multiplicative factor of e 2.4044 = 11.07. 5.9b.Since there is no interaction between def and vic in the model, the conditional odds ratios for def are the same for black and white victims. The 95% confidence interval for the conditional odds ratio for def is (e -1.5633, e -0.1140 ) = (0.21, 0.89) and for the conditional odds ratio for vic is (e 1.3068, e 3.7175 ) = (3.69, 41.16). The size of the confidence interval for the victim is substantially larger than the CI for the defendant. Due to the logarithmic relationship, the CIs are not centered about the estimates. These C.I. s can be interpreted as follows. Controlling for victims race the odds of death penalty when the defendant was white is between exp (-1.5633) =0.209 and exp (-0.114) =0.892 times the odds when the defendant was black. Likewise, controlling for defendants race, the odds of death penalty when the victim was white is between exp (1.3068) =3.69 and exp (3.7175) =41.16 times the odds when the victim was black. 5.9c. The hypothesis to test for conditional independence of defendant s race and death penalty controlling for victim s race is H 0 : β 1 =0. (i) Wald test = /SE)^2 = (-0.8678/0.3671)^2 = 5.59 (ii) The Chi_sq for LR test is given as 5.01 and is comparable to the Wald test value both give small the p-values (<0.05), and hence we reject the null hypothesis and conclude there is an significant effect of defendant s race on death penalty. 5.9d. The deviance G 2 = 0.3798 and Pearson χ 2 =0.1978 both with df = 1 have p-values of 0.54 and 0.66, respectively and both show that we fail to reject Ho and so that the fit is reasonable. Problem 5.15 Table 5.17, repeated below, shows the parameter estimates for the logistic regression model for esophageal cancer. The model is: Logit( π ) = α + β 1 A + β 2 S + β 3 R + β 4 RS Variable Effect P-value Intercept -7.0 <0.01 Alcohol use (A) 0.1 0.03 Smoking (S) 1.2 <0.01 Race (R) 0.3 0.02 Race X smoking (RS) 0.2 0.04 Based on the parameter estimates, the fitted model is: When we have to consider blacks, that is R=1, the above equation upon the substitution becomes : When R=0: Page 5 of 7

The YS conditional odds ratio is given by exp(1.4)=4.055 for blacks and exp(1.2)=3.32 for whites. The model equation when S=1 is given by: The model equation when S=0 is given by: The YR conditional odds ratio is given by exp(0.5)= 1.65 for S= 1 and exp(0.3)= 1.35 for S=0. When R=0, RxS is also zero and so the coefficient of S of the fitted equation is the log odds ratio between Y and S for whites (R=0). Similarly, the coefficient of R is the log odds ratio between Y and R for S=0. Therefore, the p-values for R and S are testing the null hypothesis of no effect of R on Y given S=0 and of no effect of S on Y given whites, respectively. Problem 5.18 (a) From the computer output on Table 5.1, the fitted model is: log(odds) = -12.3508+0.4972width. i) At width of 26.3 cm, the estimated odds is exp(-12.3508+0.4972x26.3)=2.07 ii) At width of 27.3 cm, the estimated odds is exp(-12.3508+0.4972x27.3)=3.40 iii) The odds ratio of 27.3 cm to 26.3 cm is therefore 3.40/2.07=1.64. Therefore, the odds increase by 64% as the width increases from 26.3 cm to 27.3 cm. (b) The 95% confidence interval (C.I.) for slope parameter β is (0.3084, 0.7090). The instant change rate of probability of having satellites is βπ(1 π), which equals.25β at π=0.5. Therefore, the 95% C.I. of the instant change rate of π when it is at 0.5 is 0.25(0.3084, 0.7090)=(0.07,0.17). Page 6 of 7

BSTA 6651 Cat. Data Anal. Homework #2 Fall, 2011 Dr. Fan Appendix ************************ SAS code for problem 5.2. ***************************; DATA PROB_5_2; Input Ft Temp TD @@; label TD = 'Thermal Distress (1=Yes, 0=No)'; datalines; 1 66 0 2 70 1 3 69 0 4 68 0 5 67 0 6 72 0 7 73 0 8 70 0 9 57 1 10 63 1 11 70 1 12 78 0 13 67 0 14 53 1 15 67 0 16 75 0 17 70 0 18 81 0 19 76 0 20 79 0 21 75 1 22 76 0 23 58 1 24 31. ; * NOTE: Added Obs #24 is to produce the Pi_Hat values for Temp = 31 of. Run; PROC LOGISTIC Data=PROB_5_2 Descending; Model TD = Temp / CLODDS=PL CLPARM=PL; Output Out=Logit2 p=pi_hat Lower=Lower Upper=Upper; Run; *Lower and Upper are the confidence limits for Pi; PROC PRINT Data=Logit2(where=(Temp in (31))); Run; * Display the Pi_Hat values for Temp = 31 of; PROC SORT Data=Logit2; By Temp Ft; Run; * Sort data for plotting.; * Setup plotting symbols and axes definitions to improve plot appearance. ; *************************************************************************; Symbol1 value=dot h=1.2 i=spline w=2 c=blue l=1; *Dot symbol with spline through points. ; Symbol2 value= + h=1.5 i=join w=1.5 c=red l=42; *Circle symbol with dashed line.; Axis1 label=(angle=90 f=swissb height=1.5 'Estimated Probability') value=(height=1.5); run; Axis2 label=(f=swissb height=1.5 "Temperature (of)") value=(height=1.5); run; Legend1 label=(height=1.5 "Key:") value=( height=1.5); PROC GPLOT Data = Logit2(Where=(Temp GT 50)); Plot (Pi_Hat TD) * Temp / GRID vaxis=axis1 haxis=axis2 Overlay Legend=Legend1; Run; Quit; Page 7 of 7