BSTA 6651 Cat. Data Anal. Homework #3 Fall, 2011

Problem 5.1 Table 5.11 shows the statistical output of logistic regression results for modeling the probability of remission of cancer using a labeling index (LI) explanatory variable. The following optional SAS code can be used to reproduce the results shown in Table 5.11 DATA PROB_5_1; Input LI N Remiss @@; datalines; 8 2 0 10 2 0 12 3 0 14 3 0 16 3 0 18 1 1 20 3 2 22 2 1 24 1 0 26 1 1 28 1 1 32 1 0 34 1 1 38 3 2 ; Run; PROC LOGISTIC Data=PROB_5_1; Model Remiss/N = LI /COVB; * The COVB option displays the covariance matrix; Output Out=Logit5 p=pi_hat Lower=Lower Upper=Upper; Run; * Lower and Upper are the confidence limits for Pi; PROC PRINT Data=Logit5(where=(LI in (8 10))); Run; * Display the Pi_Hat values for LI = 8 and LI = 10; 5.1a. The model being fit is Logit( π ) = α + β*li, where α = -3.7771 and β= 0.1449 are listed under Estimate in Table 5.11 To extract the values for π first rewrite this in terms of: θ = e α+β*li now find θ π = θ For LI = 8, ( LI = 8) = e 3.777 0.1449* 8 now find ( LI = 8) θ = 0.07296 0.07296 π = = 0.06797 0.07296 5.1b. Using the equations above we can find π (LI = 26) as: ( LI = 26) = e 3.777 0.1449* 26 θ = 0.9903 0.9903 π = = 0.4973 0.50 0.9903 now find ( LI = 26) 5.1c. Using the formula from section 5.1.1 we can calculate the rate of change in π as: π( LI) = βπ( LI) [ 1 π( LI) ] LI ( LI = 8) π LI = 0.1449 * 0.06797 1 [ 0.06797] = 0.009 Page 1 of 7

( LI = 26) π LI = 0.1449 * 0.5 1 [ 0.5] = 0.036 5.1d. We can calculate π at the lower and upper quartiles of LI as : θ ( LI = 14) = e 3.777 0.1449* 14 = 0.1740 now find ( LI = 14) 0.1740 π = = 0.1482 0.15 0.1740 θ ( LI = 28) = e 3.777 0.1449* 28 1.3233 = 1.3233 now find π ( LI = 28) = = 0.5696 0.57 1.3233 This allows us to calculate the change in π over the middle half of the range of LI values as: π = 0.57 0.15 = 0.42 5.1e. From part a, we can rewrite the model as: θ = e α+β*li = e α e β*li = e α e 0.1449*LI For a unit change in LI, we can write LI* = LI + 1 and noting that e 0.1449 1.16 θ * = e α e 0.1449*(LI+1) = e α e 0.1449 e 0.1449*LI = 1.16*e α e 0.1449*LI and thus θ * = 1.16 θ which shows that for a unit change in LI the odds ratio of remission changes by a multiplicative factor of e β = 1.16 5.1.f. The 95% C.I. of β is 0.1449+1.96(0.0593)=(0.0287,0.2611) and so the 95% C.I. of exp(β), θ, is (exp(.0287), exp(.2611))=(1.029, 1.298). ˆ 2 β 2 0.1449 2 5.1.g. The Wald test statistics for LI is χw = ( ) = ( ) = 5.96 and the upper tailed SE..( ˆ β ) 0.0593 probability of Chi-square with d.f. of 1 at 5.96 is 0.0146, which is smaller than 0.05. We therefore can conclude LI has a significant effect on the remission rate (at 5% significant level). 5.1h. Using the output at the top of Table 5.11 we can construct the Likelihood Ratio statistic using the values for -2LogL listed under Intercept Only (L 0 ) and Intercept and Covariates (L 1 ): -2(L 0 -L 1 ) = 34.372 26.073 = 8.299 This value agrees with the 8.2988 value shown in Table 5.11 for Likelihood Ratio under the section titled Testing Global Null Hypothesis: BETA=0. This test statistic is χ 2 and the df is 1 due to the addition of the single factor (LI) in the fitted model compared to the intercept only model. The p-value of 0.004 (listed in Table 5.11 for the parameter Li) is highly significant showing we can reject Ho: β=0. 5.1.i. Find the 95% C.I. for logit(π) first. The MLE of logit(π) is log it( ˆ π )= ˆ α + ˆ βli and so its (asymptotical) variance at LI=8 is ˆ 2 ˆ 2 var( ˆ α) + 2LI cov( ˆ α, β ) + LI var( β ) 1.90 2(8)( 0.077) + 8 (0.004) = 0.925. Therefore, the 95% C.I. logit(π) at LI=8 is Page 2 of 7

( ˆ α + ˆ β LI) ± 1.96S.E.( ˆ α + ˆ β LI) = 3.78 + 0.145(8) ± 1.96 0.925 = ( 4.505, 0.735). Converting the logit function we get the 95% C.I. for π is 1 1 exp( 4.505) exp( 0.735) (logit ( 4.505), logit ( 0.735)) = (, ) = (0.01, 0.32). exp( 4.505) exp( 0.735) Problem 5.2 The data in Table 5.12 shows the flight number, (Ft), temperature (Temp, of), and o-ring thermal distress response (TD: 1=yes; 0=no) for 23 space shuttle flights prior to the Challenger disaster in 1986. The data is based on Table 1 in J. Amer. Statist. Assoc., 84: 945_957, 1989., by S. R. Dalal, E. B. Fowlkes, and B. Hoadley. 5.2 a. (The SAS code used to produce the following results can be found in the appendix.) PROC LOGISTIC was used to model the effect of temperature on the probability of thermal distress in O-rings. The fitted model obtained was: Logit( π ) = 15.0429-0.2322 * Temperature A plot of π across the range of temperatures is given in figure 1, which shows that as the temperature increases the probability of thermal distress decreases. We also know that the steepest decreasing rate of the probability (of thermal distress) occurs at which corresponds to a temperature value of. Figure 1. Plot for problem 5.2 showing the predicted probability of Thermal Distress for a range of temperatures. Page 3 of 7

5.2b. Therefore, the probability of thermal distress at 31 is 99.96%. Also, this ( π @ 31F) is an extrapolation beyond the data range of temperature, which is not recommended. 5.2c. Shown below is the SAS output for PROC GENMOD detailing the parameter estimates. The confidence interval for the effect of temperature on the odds of thermal distress can be obtained from the Walds 95% confidence interval for the β value. Therefore, the confidence interval is given by ( 0.9801). With a Chi Square value of 4.6 and df =1, the p- value is given as 0.032, therefore, the null hypothesis of H 0 : β=0 is rejected at the 5% significance level. Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 15.0429 7.3786 0.5810 29.5048 4.16 0.0415 Temp 1-0.2322 0.1082-0.4443-0.0200 4.60 0.0320 If you use PROC LOGISTIC and the CLODDS=PL model option we obtain the point estimate of 0.793 with a 95% profile Likelihood confidence interval of (0.597, 0.941) for the effect on the odds of TD per 1 of change in temperature. Thus a 1 of temperature increase will reduce the odds of TD to a point between 59.7% and 94.1% of the original odds. The CLPARM=PL option gives the 95% profile Likelihood confidence intervals for the model parameters. 5.2d. If we re-run the analysis using a more complex model, model d: Logit( π ) = α + β 1 * Temperature + β 2 * Temperature 2 We obtain an -2LogL value for this more complex model of 19.389. The log likelihood for this model is -9.6944 and we know from earlier result (part a) that the log likelihood for the linear term only (model a) is -10.1576. The likelihood ratio statistic is given by This follows a Chi-square distribution with 1 df, and its p-value from the table is 0.3378. Based on this p value, we can conclude that adding the quadratic term will not improve the goodness of fit significantly. Problem 5.9 The output in Table 5.14 shows the result of fitting a logit model to the death penalty data in Table 2.6. Let def be the defendant s race and vic be the victim s race. The fitted model is then: Logit( π ) = -3.5961-0.8678 *def + 2.4044 *vic. 5.9a. Since the def coefficient is negative and the vic coefficient is larger and positive, we conclude that cases with a white victim (vic = 1) and a black defendant (def = 0) will have the highest probability, which is Page 4 of 7

Changing the defendant from black to white (0 to 1) changes the odds of the death penalty by a multiplicative factor of e 0.8678 = 0.42. Similarly, changing the victim from black to white (0 to 1) changes the odds of the death penalty by a multiplicative factor of e 2.4044 = 11.07. 5.9b.Since there is no interaction between def and vic in the model, the conditional odds ratios for def are the same for black and white victims. The 95% confidence interval for the conditional odds ratio for def is (e -1.5633, e -0.1140 ) = (0.21, 0.89) and for the conditional odds ratio for vic is (e 1.3068, e 3.7175 ) = (3.69, 41.16). The size of the confidence interval for the victim is substantially larger than the CI for the defendant. Due to the logarithmic relationship, the CIs are not centered about the estimates. These C.I. s can be interpreted as follows. Controlling for victims race the odds of death penalty when the defendant was white is between exp (-1.5633) =0.209 and exp (-0.114) =0.892 times the odds when the defendant was black. Likewise, controlling for defendants race, the odds of death penalty when the victim was white is between exp (1.3068) =3.69 and exp (3.7175) =41.16 times the odds when the victim was black. 5.9c. The hypothesis to test for conditional independence of defendant s race and death penalty controlling for victim s race is H 0 : β 1 =0. (i) Wald test = /SE)^2 = (-0.8678/0.3671)^2 = 5.59 (ii) The Chi_sq for LR test is given as 5.01 and is comparable to the Wald test value both give small the p-values (<0.05), and hence we reject the null hypothesis and conclude there is an significant effect of defendant s race on death penalty. 5.9d. The deviance G 2 = 0.3798 and Pearson χ 2 =0.1978 both with df = 1 have p-values of 0.54 and 0.66, respectively and both show that we fail to reject Ho and so that the fit is reasonable. Problem 5.15 Table 5.17, repeated below, shows the parameter estimates for the logistic regression model for esophageal cancer. The model is: Logit( π ) = α + β 1 A + β 2 S + β 3 R + β 4 RS Variable Effect P-value Intercept -7.0 <0.01 Alcohol use (A) 0.1 0.03 Smoking (S) 1.2 <0.01 Race (R) 0.3 0.02 Race X smoking (RS) 0.2 0.04 Based on the parameter estimates, the fitted model is: When we have to consider blacks, that is R=1, the above equation upon the substitution becomes : When R=0: Page 5 of 7

The YS conditional odds ratio is given by exp(1.4)=4.055 for blacks and exp(1.2)=3.32 for whites. The model equation when S=1 is given by: The model equation when S=0 is given by: The YR conditional odds ratio is given by exp(0.5)= 1.65 for S= 1 and exp(0.3)= 1.35 for S=0. When R=0, RxS is also zero and so the coefficient of S of the fitted equation is the log odds ratio between Y and S for whites (R=0). Similarly, the coefficient of R is the log odds ratio between Y and R for S=0. Therefore, the p-values for R and S are testing the null hypothesis of no effect of R on Y given S=0 and of no effect of S on Y given whites, respectively. Problem 5.18 (a) From the computer output on Table 5.1, the fitted model is: log(odds) = -12.3508+0.4972width. i) At width of 26.3 cm, the estimated odds is exp(-12.3508+0.4972x26.3)=2.07 ii) At width of 27.3 cm, the estimated odds is exp(-12.3508+0.4972x27.3)=3.40 iii) The odds ratio of 27.3 cm to 26.3 cm is therefore 3.40/2.07=1.64. Therefore, the odds increase by 64% as the width increases from 26.3 cm to 27.3 cm. (b) The 95% confidence interval (C.I.) for slope parameter β is (0.3084, 0.7090). The instant change rate of probability of having satellites is βπ(1 π), which equals.25β at π=0.5. Therefore, the 95% C.I. of the instant change rate of π when it is at 0.5 is 0.25(0.3084, 0.7090)=(0.07,0.17). Page 6 of 7

BSTA 6651 Cat. Data Anal. Homework #2 Fall, 2011 Dr. Fan Appendix ************************ SAS code for problem 5.2. ***************************; DATA PROB_5_2; Input Ft Temp TD @@; label TD = 'Thermal Distress (1=Yes, 0=No)'; datalines; 1 66 0 2 70 1 3 69 0 4 68 0 5 67 0 6 72 0 7 73 0 8 70 0 9 57 1 10 63 1 11 70 1 12 78 0 13 67 0 14 53 1 15 67 0 16 75 0 17 70 0 18 81 0 19 76 0 20 79 0 21 75 1 22 76 0 23 58 1 24 31. ; * NOTE: Added Obs #24 is to produce the Pi_Hat values for Temp = 31 of. Run; PROC LOGISTIC Data=PROB_5_2 Descending; Model TD = Temp / CLODDS=PL CLPARM=PL; Output Out=Logit2 p=pi_hat Lower=Lower Upper=Upper; Run; *Lower and Upper are the confidence limits for Pi; PROC PRINT Data=Logit2(where=(Temp in (31))); Run; * Display the Pi_Hat values for Temp = 31 of; PROC SORT Data=Logit2; By Temp Ft; Run; * Sort data for plotting.; * Setup plotting symbols and axes definitions to improve plot appearance. ; *************************************************************************; Symbol1 value=dot h=1.2 i=spline w=2 c=blue l=1; *Dot symbol with spline through points. ; Symbol2 value= + h=1.5 i=join w=1.5 c=red l=42; *Circle symbol with dashed line.; Axis1 label=(angle=90 f=swissb height=1.5 'Estimated Probability') value=(height=1.5); run; Axis2 label=(f=swissb height=1.5 "Temperature (of)") value=(height=1.5); run; Legend1 label=(height=1.5 "Key:") value=( height=1.5); PROC GPLOT Data = Logit2(Where=(Temp GT 50)); Plot (Pi_Hat TD) * Temp / GRID vaxis=axis1 haxis=axis2 Overlay Legend=Legend1; Run; Quit; Page 7 of 7