Arjas plots with Stata

Similar documents
Introduction. Survival Analysis. Censoring. Plan of Talk

Competing-risks regression

Relative survival an introduction and recent developments

Correlation and Regression

especially with continuous

Ordinal Regression. Chapter

Gamma Distribution Fitting

Lecture 2 ESTIMATING THE SURVIVAL FUNCTION. One-sample nonparametric methods

Comparing Means in Two Populations

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Development of the nomolog program and its evolution

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Module 2 Basic Data Management, Graphs, and Log-Files

Missing data and net survival analysis Bernard Rachet

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

SUMAN DUVVURU STAT 567 PROJECT REPORT

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Distribution (Weibull) Fitting

Regression III: Advanced Methods

Nonlinear Regression Functions. SW Ch 8 1/54/

III. INTRODUCTION TO LOGISTIC REGRESSION. a) Example: APACHE II Score and Mortality in Sepsis

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015

Nonlinear Regression:

Chapter 5 Analysis of variance SPSS Analysis of variance

How to set the main menu of STATA to default factory settings standards

Fairfield Public Schools

MULTIPLE REGRESSION EXAMPLE

Organizing Your Approach to a Data Analysis

Survival analysis methods in Insurance Applications in car insurance contracts

Review of Fundamental Mathematics

Comparison of Survival Curves

Scatter Plots with Error Bars

Slope-Intercept Equation. Example

Session 7 Bivariate Data and Analysis

The Central Idea CHAPTER 1 CHAPTER OVERVIEW CHAPTER REVIEW

PLOTTING DATA AND INTERPRETING GRAPHS

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 6: Constructing and Interpreting Graphic Displays of Behavioral Data

Confidence Intervals for Exponential Reliability

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Comparing the predictive power of survival models using Harrell s c or Somers D

Reflection and Refraction

13. Poisson Regression Analysis

11. Analysis of Case-control Studies Logistic Regression

Simple linear regression

Using R for Linear Regression

Topic 3 - Survival Analysis

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Nonlinear relationships Richard Williams, University of Notre Dame, Last revised February 20, 2015

The Cox Proportional Hazards Model

Updates to Graphing with Excel

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

The first three steps in a logistic regression analysis with examples in IBM SPSS. Steve Simon P.Mean Consulting

Simple Regression Theory II 2010 Samuel L. Baker

Hands-On Math Algebra

Handling missing data in Stata a whirlwind tour

Research Methods & Experimental Design

A Basic Introduction to Missing Data

AP STATISTICS REVIEW (YMS Chapters 1-8)

Checking proportionality for Cox s regression model

Module 14: Missing Data Stata Practical

STATISTICAL ANALYSIS OF SAFETY DATA IN LONG-TERM CLINICAL TRIALS

Descriptive Statistics

Data Analysis, Research Study Design and the IRB

Dealing with Missing Data

SEQUENCES ARITHMETIC SEQUENCES. Examples

Part 1: Background - Graphing

Simple Linear Regression Inference

outreg help pages Write formatted regression output to a text file After any estimation command: (Text-related options)

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Chapter 5: Working with contours

Curve Fitting, Loglog Plots, and Semilog Plots 1

LOGIT AND PROBIT ANALYSIS

LOGISTIC REGRESSION ANALYSIS

25 Working with categorical data and factor variables

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Formulas, Functions and Charts

Module 5: Multiple Regression Analysis

Module 2: Introduction to Quantitative Data Analysis

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Confidence Intervals for Cp

Rockefeller College University at Albany

Additional sources Compilation of sources:

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

A Guide to Imputing Missing Data with Stata Revision: 1.4

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces

Guide to Biostatistics

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Advanced Quantitative Methods for Health Care Professionals PUBH 742 Spring 2015

Geometric Optics Converging Lenses and Mirrors Physics Lab IV

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Using Stata for Categorical Data Analysis

Introduction to Quantitative Methods

DATA INTERPRETATION AND STATISTICS

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Standards for Mathematical Practice: Commentary and Elaborations for 6 8

SAS Software to Fit the Generalized Linear Model

What Does the Normal Distribution Sound Like?

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Transcription:

Fourth Italian Stata Users Group Meeting September 24 25, 2007 Rome Arjas plots with Stata Enzo Coviello Outline of the talk Graphical checks of the proportional hazards assumption Brief digression on the hazard plot Arjas plots by examples -starjas- 1

Graphical checks of the proportional hazards assumption The validity of Cox s regression analysis relies on the assumption of proportionality of the hazard rates of individuals with distinct values of covariates. There are several graphical techniques for checking this assumption. Some of them are readily available by using official Stata commands or can be easily produced by just a few instructions. The effect of two categorical covariates are introduced as examples: Distant metastases at diagnosis in Finnish colon cancer (available from the Paul Dickman web site net get http://www.pauldickman.com/rsmodel/stata_colon/strs), where hazard rates are not proportional Detection method of diagnosis in a sample of Italian breast cancer screening study, where the proportionality assumption holds 2

Distant Metastases in Colon Cancer. stcox distant, sch(sch) sca(sca) nolog. estat phtest Test of proportional-hazards assumption Time: Time ---------------------------------------------------------------- chi2 df Prob>chi2 ------------+--------------------------------------------------- global test 80.60 1 0.0000 ---------------------------------------------------------------- Invited to breast cancer screening test. stcox invited, sca(sca) sch(sch). estat phtest Test of proportional-hazards assumption Time: Time --------------------------------------------------------------- chi2 df Prob>chi2 -----------+--------------------------------------------------- global test 0.05 1 0.8281 --------------------------------------------------------------- scaled Schoenfeld - distant -2 0 2 4 6 Distant Metastases in Colon Cancer Time from diagnosis bandwidth =.8 Invited to breast cancer screening test scaled Schoenfeld - Invited -2-1 0 1 2 Time from diagnosis 3

ln Cumulative Hazard -4-3 -2-1 0 1 stphplot, by(distant) noneg noln distant = 0 distant = 1 analysis time Distant Metastases in Colon Cancer ln Cumulative Hazard -8-6 -4-2 0 Not Invited stphplot, by(invited) noneg nosh These curves should be approximately parallel if the hazard rates are proportional invitate = All Invited analysis tim e Invited to breast cancer screening test An alternative to the previous approach is to plot log cumulative hazard difference. If the proportionality holds, we should see a roughly horizontal line that is easier to assess than comparing parallel curves. Differences in Log Cumulative Hazard 1.6 1.8 2 2.2 2.4 Distant Metastases in Colon Cancer Time from diag nosis Differences in Log Cumulative Hazard -2-1.5-1 -.5 0 Invited to breast cancer screening test Time from diag nosis 4

Andersen Plot Another graphical check based on cumulative hazard is the so-called Andersen plot. For a binary covariate as distant (or invited ) we simply plot: H (distant=1) versus H (distant=0) There is not a specific command to obtain this plot but the codes to be used are straightforward: stcox, estimate strata(distant) basecha(hdist) g double Hdist1 = Hdist if distant==1 g double Hdist0 = Hdist if distant==0 g Hhat= Hdist0 * 7.225 sort _t replace Hdist1 = Hdist1[_n-1] if Hdist1==. replace Hdist0 = Hdist0[_n-1] if Hdist0==. twoway line Hdist1 Hhat Hdist0 Note that the variable we checked has been specified as strata() in - stcox-. Andersen plot If the proportionality of hazards hold, we should see a straight line passing through the origin Hazard Not Invited 0.05.1.15.2.25 Invited to breast cancer screening test 0.05.1.15.2 Hazard All Invited 5

Andersen plot Hazard distant = 1 0 1 2 3 4 Distant Metastases in Colon Cancer Otherwise we see a concave or convex curve. A dashed line corresponding to the hazard ratio estimated in the Cox model has been drawn. For example, the hazard ratio of distant metastases at diagnosis is 7.2. The Andersen plot tells us that the actual effect is greater than the estimated effect in the first part of the follow-up and less than the estimated effect in the rest. 0.1.2.3.4.5 Hazard distant = 0 Brief digression on the hazard plot 6

Plotting the hazard function for each level of a categorical covariate is a direct approach to check the proportionality of hazards. To obtain this plot we have to use -sts graph, hazardand not -stcurve, hazardbecause we must compare non parametrical hazard estimates rather than model based hazard estimates. If the hazards are proportional, then the hazard curves are approximately parallel on a log scale. Hazard estimates are obtained by a kernel-smoothing technique applied to the cumulative hazard estimates. In Stata 10 smoothed hazard are bias-adjusted to take into account the restricted range of data available near the boundaries of the analysis time. Let me remember -stkerhaz-, a version 7 command available on SSC Archive, that computes the same bias-adjusted estimates by applying the so-called asymmetric kernel near the boundaries. Has -stkerhaz- still some utility since version 10 has been released? 7

In some case yes: -stkerhaz- allows non parametrical hazard estimates to be saved in a file (-stcurve- saves estimates, but -sts graph- does not). This could be useful for non standard graphs. For example, the bandwidth is a key parameter in the kernel smoothing method. This may be a sensible choice to employ short bandwidth at the start of the follow-up and larger at the end when the number of subjects still observed can be seriously reduced. By means of -stkerhaz- we can save smoothed hazard estimates obtained by using two (or more) bandwidths, we can combine the files and then produce a hazard plot with different degree of smoothing in different parts of the same curve. Excess risk model compares the mortality of a study population to the mortality of a reference population. By subtracting the cumulative expected hazard of the reference population (-stexpect-) to the cumulative hazard (-sts gen-) we can estimate the cumulative excess hazard. From the cumulative excess hazard, via -stkerhaz-, we can get and plot smoothed excess hazard estimates. 8

sts graph, hazard by(distant) width(3 3) k(epan2) ysca(log) /// yla(.012.025.05 0.1 0.2 0.4 0.8 1.6, angle(0)) 1600 800 Distant Metastases in Colon Cancer 400 hazard x 1000 200 100 50 25 12 analysis time distant = 0 dista nt = 1 bandwidth = 3 years Here we see a decreasing distance of the hazard plots, i.e. a decreasing hazard ratio. We should be cautious in evaluating the distance in the right tail of the curves since confidence intervals may be very large there. 1600 800 Distant Metastases in Colon Cancer 400 hazard x 1000 200 100 50 25 12 analysis time distant = 0 dista nt = 1 bandwidth = 3 years 9

sts graph, hazard by(invited) width(3 3) k(epan2) ti("") ysca(log) /// yla(.008.012.018.027.04, angle(0)) 40 Invited to breast cancer screening test 27 hazard x 1000 18 12 8 analysis time Not Invited All Invited bandwidth = 3 years Here we see a constant distance of the hazard plots, i.e. a constant hazard ratio. 40 Invited to breast cancer screening test 27 hazard x 1000 18 12 8 analysis time Not Invited All Invited bandwidth = 3 years 10

Arjas plot by examples Arjas plot compares the total number of expected events to the total number of observed events occurring up to each event time. The cumulative hazard, calculated by the Nelson-Aalen estimator or by fitting a Cox model, is the base for working out the expected number of events. The Arjas plot allows to check: 1. whether a covariate should be included in a proportional hazards model 2. whether a covariate has proportional hazards effect. 11

1. Should a covariate be included in the model? Assuming that a Cox model has been fitted with covariates x1 and x2, we wish to check whether a categorical covariate x3 should be included in the model. To this aim, we should first fit a Cox model with x1 and x2 covariates, ignoring x3. The resulting cumulative hazard estimate may be used to compute the total expected number of events at each event time and for each level of x3. If plotting the above estimate versus the total number of events for each level of x3 gives linear curves with slopes differing from 1, then x3 should be included in the model. Let we consider the effect of method of detection of cases from the breast cancer screening study. stcox invited, nolog nosh Cox regression -- Breslow method for ties No. of subjects = 2962 No. of failures = 456 -------------------------- _t Haz. Ratio ------------+------------- invited.6283022 12

Let we consider the effect of method of detection of cases from the breast cancer screening study starjas invited, twoway options In the plot we see a curve for each level of the invited covariate linearly differing from 45 Estimated Cumulative Number of Events 0 50 100 150 200 250 Arjas Plots - Invited 0 50 100 150 200 250 Cumulative Number of Events Not Invited All Invited In the breast cancer screening study we investigated whether the improved survival of cases invited to the screening test depended on the better tumour characteristics of cases diagnosed at the screening test or was biased by spurious factors (selection bias, lead time bias). To this aim we compared the survival experience of invited and not invited cases after adjusting for tumour characteristics: T, N and grading. If spurious factors play a role we should see some residual independent effect of the method of detection. In the Arjas plot we can investigate if a covariate should be included in the model after adjustment for other covariates. In the next slide Arjas plot for method of detection of cases is shown after adjusting for T, N and grading. 13

starjas invited, adjust(t2 T3 T4 T5 N2 N3 Gr2 Gr3 Gr9 yydx_m) The plot tell us that the method of detection has not residual effect after adjusting for T N and grading Arjas Plots Estimated Cumulative Number of Events 0 50 100 150 200 250 0 50 100 150 200 250 Cumulative Number of Events Not invited All Invited starjas invited, adjust(t2 T3 T4 T5 N2 N3 Gr2 Gr3 Gr9 yydx_m) The plot tell us that the method of detection has not residual effect after adjusting for T N and grading. xi: stcox invited T2 T3 T4 T5 N2 N3 Gr2 Gr3 Gr9 yydx_m Cox regression -- Breslow method for ties No. of subjects = 2962 Number of obs = 2962 No. of failures = 456 ------------------------------------------------------------------------------ _t Haz. Ratio Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- invited.9922653.1020464-0.08 0.940.8111267 1.213855 omisis ------------------------------------------------------------------------------ Thus, using Arjas plot allows us to give a graphical expression to the effect of confounders on the main factor 14

2. Does a covariate have proportional hazards effect? Let we refer to the Bone Marrow Transplants data set (Klein and Moeschberger, p. 10 and p. 373) where the hazards of the autologous and allogeneic bone marrow transplant are clearly non-proportional Smoothed hazard estimates.08.06.04 The hazard ratio of the allotransplant variable is greater than 1 at the beginning of the follow-up,.02 but it is sharply declining and becomes less than 1 after 6 months 0 5 10 15 analysis time allo transplan t auto tra nsplant starjas allotransplant Estimated Cumulative Number of Events 25 20 15 10 5 Arjas Plots - alloauto The curves differ nonlinearly from the 45 line show a decreasing divergence until they become parallel to the 45 line when the hazard ratio approaches the value 1 and, finally, converge toward the 45 line when the hazard ratio turns < 1 0 0 10 20 30 Cumulative Number of Events allo transplant auto transplant 15

Estimated Cumulative Number of Events starjas distant 3000 2000 1000 0 Arjas Plots - Distant Metastases 0 1000 2000 3000 Cumulative Number of Events distant = 0 distant = 1 Cases without distant metastases produce a curve as expected Cases with distant metastases produce an approximately straight line. Why this happens? Estimated Cumulative Number of Events starjas distant 3000 2000 1000 0 Arjas Plots - Distant Metastases 0 1000 2000 3000 Cumulative Number of Events distant = 0 distant = 1 1.6.8.4.2.1.05.025.012 Smoothed hazard estimates 95% CI 95% CI distant = 0 distant = 1 4866 (824) 3291 (333) 1998 (132) 1155 (39) 570 (10) 150 2908 (2273) 462 (172) 194 (42) 96 (9) 44 (2) 6 In the group with distant metastases only a few events (number in parentheses) occur after 2 years of follow-up. Comparing observed and expected deaths after 2 years in the Arjas plot provides a very short line. 16

Estimated Cumulative Number of Events starjas distant 3000 2000 1000 0 Arjas Plots - Distant Metastases 0 1000 2000 3000 Cumulative Number of Events distant = 0 distant = 1 1.6.8.4.2.1.05.025.012 Smoothed hazard estimates 95% CI 95% CI distant = 0 distant = 1 4866 (824) 3291 (333) 1998 (132) 1155 (39) 570 (10) 150 2908 (2273) 462 (172) 194 (42) 96 (9) 44 (2) 6 Arjas plot rises the doubt that proportional hazards assumption for this variable is not so heavily questionable starjas 17

help for starjas Title starjas Arjas plot to check proportional hazards assumption Syntax starjas varname [if exp] [in range ] [, adjust(varlist ) levelplot(# [#].. ) atobs(#) rrglance(#) twoway_options ] options description Options adjust specify the variables fitted in the Cox model before checking if varname is proportional levelplot specify the levels of varname to be displayed in the plot atobs restrict the plot to the first # events rrglance draw a line approximately corresponding in the case of binary covariate to a # relative risk Y-Axis, X-Axis, Caption, Legend twoway_options some of the options documented in [G] twoway_options Add plot plot(plot ) add other plots to generated graph You must stset your data before using starjas; see stset. levelplot (# [#] ) can be useful when the variable we are checking has more than two levels. As an example, consider the effect of age at diagnosis in the Finnish colon cancer data set: egen agecat, cut(0 50 60 70 100) icodes label Estimated Cumulative Number of Events 0 500 1000 1500 2000 starjas Arjas Plots agecat - 0-49 50-59 60-69 70+ Estimated Cumulative Number of Events 0 500 1000 1500 2000, Arjas levelplot(2 Plots - agecat 3) 60-69 70+ 0 500 1000 1500 2000 2500 Cumulative Number of Events 0 500 1000 1500 2000 2500 Cumulative Number of Events 18

In interpreting plots for categorical covariates having more than two levels we must consider that the 45 line corresponds to the plot of expected and observed counts in the overall cohort of patients. The plots for the age categories 0-49, 50-59 and 60-69 are above this line because the hazards of these groups are lower than the hazard of the overall cohort and vice versa for cases in the age category > 70. Estimated Cumulative Number of Events 0 500 1000 1500 2000 Arjas Plots - agecat 0-49 50-59 60-69 70+ 0 500 1000 1500 2000 2500 Cumulative Number of Events In the case of a binary covariate, the dashed line in the Arjas plot indicates absence of effect, i.e. hazard ratio = 1. rrglance(#) allows to move this line so that it corresponds to a # hazard ratio.. stcox invited, nolog nosh Cox regression -- Breslow method for ties No. of subjects = 2962 ------------------------- _t Haz. Ratio -----------+------------- invited.6283022 Estimated Cumulative Number of Events 0 100 200 300 starjas invited, rrgla(0.6283022) 0 50 100 150 200 250 Cumulative Number of Events Not Invited All Invited 19

Arjas plot is useful to address two critical questions in modeling survival data by a Cox regression: Does the variable have to be included in the model? Is its effect proportional? Conclusions By this approach, we can try to answer to both questions using the same graph. -starjas- allows us to easily obtain Arjas plot and may be a complement to other Cox diagnostic plots available in the official package. It is freely downloadable from the SSC Archive by typing from within Stata: ssc install starjas and hope the users enjoy it. Thanks 20