2. Comparison results on MCAR, MAR, MNAR versions of HAMD study data

Transcription

1 Lecture Common methods for missing data 2. Comparison results on MCAR, MAR, MNAR versions of HAMD study data 3. Risk and odds 4. Smoothing binary data 5. Logistic regression with binary response 1 Common methods for MAR data MAR property: missing-ness related only to observed data. 1. Complete case analysis. Omit observations missing any part of the data. SAS default for many procedures. Requires MCAR to be unbiased. 2. Last observation carried forward (LOCF). Longitudinal data collection where early measurements are not missing but final measurements are missing. Use each subjects last non-missing measurement to fill in later missing values. Requires strong assumptions about response; does not account for uncertainty of missing data. 2

2 3. Imputation. This means filling in each missing value with a guess. Many ways to impute: Use mean of individual s other values. Replace missing value in a group with group mean. Predict missing values of a variable V from regression of V on other variables. Requires strong assumptions about response; does not account for uncertainty of missing data Multiple imputation: (a) Impute observations for all missing values of a variable V : use random samples from normal distribution with mean and SD of V. (Or use regression to predict mean and SD, then sample from this normal distribution.) (b) Do the imputation M times, creating M complete data sets. (c) Analyze each of the M complete data sets. (d) Combine the results of the M analyses to draw conclusions. Requires MAR to be unbiased. Partially accounts for uncertainty of missing data. 4

3 Plan Estimate treatment means, test treatment*center interaction from full data, MCAR, MAR, and MNAR. For MCAR, MAR, and MNAR, apply 1. complete case analysis 2. last observation carried forward (LOCF) 3. multiple imputation 5 Full data analysis Test interaction between treatments and centers: Proc GLM data=ph6470.hamd2; class drug center; model final = baseline drug center baseline*drug center; Estimate treatment means using main-effect model: Proc GLM data=ph6470.hamd2; class drug center; model final = baseline drug center; LSmeans drug / stderr; 6

4 Source DF Type III SS Mean Square F Value Pr > F baseline <.0001 drug <.0001 center drug*center From the main-effects model: Standard Parameter Estimate Error t Value Pr > t Intercept B baseline <.0001 drug D B <.0001 drug P B... center B center B center B center B center B... Least Squares Means H0:LSMean1= Standard H0:LSMEAN=0 LSMean2 drug final LSMEAN Error Pr > t Pr > t D <.0001 <.0001 P <.0001 Where is estimate of treatment difference? 8

5 Interaction Drug Effect Drug Effect Data Method P-value (Pbo Drug) ± SE P-value Full ± 1 <.0001 MCAR MAR MNAR 9 Complete Case (CC) Proc GLM omits any observations with missing values for the response or any predictors in the model or class statement. Apply interaction and main-effects Proc GLM to MCAR, MAR, MNAR data sets. MCAR complete case The GLM Procedure Class Level Information Class Levels Values drug 2 D P center Number of Observations Read 100 Number of Observations Used 67 10

6 Interaction Drug Effect Drug Effect Data Method P-value (Pbo Drug) ± SE P-value Full ± 1 <.0001 MCAR CC ± 1 <.0001 MAR CC ± 1 <.0001 MNAR CC ± 1 < Last observation carried forward (LOCF) Fill in the missing final values with baseline in a data step. data MCAR_lcf; set MCAR; final_lcf =final; create a new response variable if final=. then final_lcf=baseline; data MAR_lcf; set MAR; final_lcf=final; if final=. then final_lcf=baseline; data MNAR_lcf; set MNAR; final_lcf=final; if final=. then final_lcf=baseline; 12

7 MCAR last value carried forward The GLM Procedure Class Levels Values drug 2 D P center Dependent Variable: final_lcf Number of Observations Read 100 Number of Observations Used 100 The GLM Procedure No missing data now, because we have filled all the holes. 13 Interaction Drug Effect Drug Effect Data Method P-value (Pbo Drug) ± SE P-value Full ± 1 <.0001 MCAR CC ± 1 <.0001 LOCF ± MAR CC ± 1 <.0001 LOCF ± MNAR CC ± 1 <.0001 LOCF ± 2 <

8 Multiple Imputation: Proc MI + Proc MIanalyze We want to estimate a parameter µ (eg. adjusted mean or regression coefficient) from data with missing values. 1. Proc MI For each missing value Y i, generate M estimates y im, m = 1,..., M using the distribution of observed values. Use MAR property: missingness related only to observed data. Fill in missing values in the data using each set {y im }, to produce M complete data sets. 2. Analyze each of the M complete data sets to get a parameter estimate ˆµ m with variance W m (squared standard error) Proc MIanalyze Combine the results of the M analyses. Combined estimate of µ is the average of the M estimates { ˆµ m }: µ M = 1 M MX ˆµ m. Variance of this estimate comes from the within-imputation variance, estimated by the mean W M of the variances {W m }, and the between-imputation variance and so its standard error is: B M = SE( µ M ) = 1 1 MX ( ˆµ m µ M ) 2, M 1 r 1 W M + M + 1 M B M. Little & Rubin (2002) Statistical Analysis with Missing Data, Second Edition 16

9 For Depression Study example, imputation code will have 3 steps: 1. Proc MI generates M complete data sets, indexed by _Imputation_ 2. Proc GLM fits the model, BY _Imputation_, and outputs the results as a dataset (use ODS close listing to prevent writing them to the output window) 3. Proc MIanalyze reads the output dataset and produces the combined estimate An additional problem is that drug and center are CLASS variables and MIanalyze has problems with these. Need to add these indicators to data. 17 Make indicators for CLASS variables in MCAR, MAR, and MNAR data sets: data ph6470.hamd_mcar; set mar; drugd = (drug="d"); logical variables to make indicators center1=(center=1); center2=(center=2); center3=(center=3); center4=(center=4); drugcenter_1 = drugd * center1; drugcenter_2 = drugd * center2; drugcenter_3 = drugd * center3; drugcenter_4 = drugd * center4; 18

10 Multiple Imputation SAS code Step 1. Make 20 complete datasets using imputation Proc MI data=ph6470.hamd_mcar out=c output data set nimpute=20 number of filled-in datasets seed= minimum= 0 maximum= 40 reject values outside 0-40, range of HAMD round=1.0; round to integer var final; variables to fill in 19 The MI Procedure Model Information Data Set PH6470.HAMD_MCAR Method MCMC Multiple Imputation Chain Single Chain Initial Estimates for MCMC EM Posterior Mode Start Starting Value Prior Jeffreys Number of Imputations 20 Number of Burn-in Iterations 200 Number of Iterations 100 Seed for random number generator Missing Data Patterns Group Means Group baseline final Freq Percent baseline final 1 X X X

11 Step 2. Fit model in Proc GLM to each of the 20 imputed datasets. Write results to output datasets see examples in Help Documentation for Proc MIanalyze. ODS listing close; Proc GLM data=c; model final = baseline drugd center1 center2 center3 center4 drugcenter_1 drugcenter_2 drugcenter_3 drugcenter_4 / inverse solution; by _Imputation_; ODS output ParameterEstimates=glmparms InvXPX=glmxpxi; run; ODS listing; 21 Step 3. Combine estimates. Proc MIanalyze parms=glmparms xpxi=glmxpxi ; modeleffects Intercept baseline drugd center1 center2 center3 center4 drugcenter_1 drugcenter_2 drugcenter_3 drugcenter_4; Very difficult to figure out what output should be passed from procedures (step 2) to Proc MIanalyze. Follow examples given in documentation for MIanalyze or use Google to look for examples. 22

12 MIanalyze: interaction model The MIANALYZE Procedure Parameter Estimates Parameter Estimate Std Error 95% Confidence Limits DF drugd center center center center drugcenter_ drugcenter_ drugcenter_ drugcenter_ t for H0: Parameter Theta0 Parameter=Theta0 Pr > t drugcenter_ drugcenter_ drugcenter_ drugcenter_ Interaction significant? 23 From main-effects model: The MIANALYZE Procedure Parameter Estimates Parameter Estimate Std Error 95% Confidence Limits DF Intercept baseline drugd center center center center t for H0: Parameter Theta0 Parameter=Theta0 Pr > t Intercept baseline <.0001 drugd

13 Interaction Drug Effect Drug Effect Data Method P-value (Pbo Drug) ± SE P-value Full ± 1 <.0001 MCAR CC ± 1 <.0001 LOCF ± MI NS 4.5 ± MAR CC ± 1 <.0001 LOCF ± MI NS 4.8 ± MNAR CC ± 1 <.0001 LOCF ± 2 <.0001 MI NS 3.7 ± Imputing values when data are not missing at random can lead to severe bias. 25 Difficult questions with missing data imputation: 1. Do you have missing at random? How do you know? 2. How do you choose an imputation method? How can you use what you know to improve the process of imputation? References: Dmitrienko et. al. (2005) Analysis of Clinical Trials Using SAS, Chapter 5 R Little and D Rubin (2002) Statistical Analysis with Missing Data, Second Edition 26

14 2 2 Tables: Relative Risk, Odds Ratio group event Frequency Row Pct 0 1 Total Total row percent = rate of events = risk of events Comparisons: risk difference risk ratio (relative risk) odds ratio 27 Two different null hypotheses for risks 1. H 0 : p a p c = 0, risk difference is zero, Z -test, based on ( ˆp a ˆp c ), is equivalent to chi-square test Alternative test of risk differences when some cells have small counts: Fisher s exact test. Risk differences usually relevant for individuals. 2. Risk ratio, or relative risk = 1: H 0 : p a p c = 1 Null value for differences is 0. Null value for ratios is 1. Relative risk applies to groups. 28

15 Odds odds = number of events number without event in sample Event No Event odds risk Group 1 A B A/B A/(A + B) Group 2 C D C/D C/(C + D) Relating odds to risk: odds = number with event number without event = ± number with event n number without event ± n = ˆp 1 ˆp For rare events, ˆp º 0 and so the denominator is almost 1, and odds º risk. 29 Event No Event odds risk Group 1 A B A/B A/(A + B) Group 2 C D C/D C/(C + D) Odds are compared only by ratio, never by difference. Odds ratio is the odds in the top row divided by odds in the bottom row, which simplifies to AD ± BC. Test whether population odds ratio is one, H 0 : OR = 1 by checking whether the 95% confidence interval covers 1. 30

16 Comparing risks and odds in Proc Freq 31 group event Frequency Row Pct 0 1 Total Total

17 Statistic DF Value Prob Chi-Square Fisher s Exact Test Cell (1,1) Frequency (F) 141 Left-sided Pr <= F Right-sided Pr >= F Table Probability (P) Two-sided Pr <= P Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits Case-Control ( Odds Ratio ) Cohort (Col1 Risk) Cohort (Col2 Risk) Sample Size = Binary responses Event ± no-event, or 0 ± 1 responses are binary responses. Set-up: one trial in which Y takes values 1=event ± 0=no event. Chance of event = P[Y = 1] = º = population event rate Chance of no event = P[Y = 0] = 1 º Y has Bernoulli distribution (after Jakob Bernoulli, ). mean of Y = º, standard deviation of Y = p º(1 º). SD is a function of the mean, unlike Normal distribution. 34

18 Example: Obesity in NHANES 2004 NHANES 2004 data for children and adults people under age 50 (n = 6116) Event = obesity, defined as BMI 30, or 95th percentile for children Association between age and rate of obesity? P[obese age] = º(age) 35 Graph data, use LOESS (local linear regression) to estimate º(age) without assuming shape Proc SGplot data=under50; loess y = obese x = age / smooth=0.4; 36

19 To see the data, jitter the 0s and 1s: data under50; set pubh.obesity_2004; if (10.0 < age < 50.0); * too many zeros below age 10 gender = "F"; if (female=0) then gender="m"; if obese=1 then y_jitter = *ranuni( )-0.2; if obese=0 then y_jitter = 0.15*ranuni( )+.002; Proc SGplot data=under50; loess y = obese x = age / smooth=0.3 ; scatter y = y_jitter x=age ; Plot smoother from original data, then add jittered data 37 Plotting characters are too large for density of data 38

20 Proc SGplot data=under50; loess y = obese x = age / smooth=0.3 MARKERATTRS=(symbol="circlefilled" size=1); scatter y = y_jitter x=age / MARKERATTRS=(symbol="circlefilled" size=1); Both statements plot their data, so need to set small plotting characters for both 39 40

21 Smooth separately for each gender: age gender interaction? Proc SGplot data=under50; loess y = obese x = age / group = gender smooth=0.3 MARKERATTRS=(symbol="circlefilled" size=1); 41 Regression with binary responses Continuous response y, regression models mean of y as a function of predictors x µ Y (x) = Ø 0 + Ø 1 x Binary (0/1) response y: regression models mean of y as a function of predictors x º(x) = f Ø 0 + Ø 1 x Many choices for f : logistic, probit, log-binomial, Poisson but not the identity function, which is linear regression 42

22 Logistic link between mean and predictors mean = P[obese age] = º(age) = exp(ø 0 + Ø 1 age) = exp(ø 0 + Ø 1 age) 1 + exp( Ø 0 Ø 1 age) Logistic curve on the probability scale. Equivalent: µ º(age) log = Ø 0 + Ø 1 age 1 º(age) Function on left is log odds or logit because ô 1 ô ± number with event n = number without event ± n = number with event number without event = odds Linear on the log odds (logit) scale. 43 Logistic curve (probability scale) for obesity and age: º(age) = exp(ø 0 + Ø 1 age) 1 + exp(ø 0 + Ø 1 age) 1.0 Fitted probability of obesity Age (years) 44

23 1 Rate of Obesity Range of data Age (years) 45 Logistic curves on probability scale º(x) = intercept + (slope)x = Ø 0 + Ø 1 x slope = +1 slope = +0.5 Mean Probability of Event P(x) intercept = +3 slope = +1 intercept = slope = Predictor X 46

24 Logistic curve (log-odds scale) for obesity and age: µ º(age) log = Ø 0 + Ø 1 age 1 º(age) 0.4 Log odds of obesity Age (years) 47 Interpreting slope in logistic regression On log-odds scale, logistic function is a line: log odds at x = Ø 0 + Ø 1 x logistic regression log odds at x + 1 slope Ø 1 log odds at x x x+ 1 Predictor 48

25 Slope is change in log(odds) for unit change in x slope Ø 1 = log odds at x + 1 log odds at x On log scale, log A logb = log A ± B, so slope Ø 1 = log odds at x + 1 log odds at x µ odds at x + 1 = log odds at x Apply exponential function as inverse of log: exp slope Ø 1 = odds at x + 1 odds at x exponential of slope = odds ratio for unit increase in x 49