Score Test of Proportionality Assumption for Cox Models Xiao Chen, Statistical Consulting Group UCLA, Los Angeles, CA



Similar documents
Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Developing Business Failure Prediction Models Using SAS Software Oki Kim, Statistical Analytics

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Regression Modeling Strategies

Survival analysis methods in Insurance Applications in car insurance contracts

Checking proportionality for Cox s regression model

Simple linear regression

Competing-risks regression

Notes on Applied Linear Regression

SAS Software to Fit the Generalized Linear Model

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Imputing Missing Data using SAS

Modeling the Claim Duration of Income Protection Insurance Policyholders Using Parametric Mixture Models

Gamma Distribution Fitting

Getting Correct Results from PROC REG

Statistics in Retail Finance. Chapter 6: Behavioural models

Simple Linear Regression Inference

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Family economics data: total family income, expenditures, debt status for 50 families in two cohorts (A and B), annual records from

How Does My TI-84 Do That

11. Analysis of Case-control Studies Logistic Regression

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Modeling Lifetime Value in the Insurance Industry

Linear Models in STATA and ANOVA

LOGISTIC REGRESSION ANALYSIS

Least Squares Estimation

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Module 3: Correlation and Covariance

Elements of statistics (MATH0487-1)

Homework 11. Part 1. Name: Score: / null

Predicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS

Introduction to Data Analysis in Hierarchical Linear Models

SUMAN DUVVURU STAT 567 PROJECT REPORT

Session 7 Bivariate Data and Analysis

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

SPSS Explore procedure

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Directions for using SPSS

Paper D Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Using SAS Proc Mixed for the Analysis of Clustered Longitudinal Data

CC03 PRODUCING SIMPLE AND QUICK GRAPHS WITH PROC GPLOT

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

Simple Regression Theory II 2010 Samuel L. Baker

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Review Jeopardy. Blue vs. Orange. Review Jeopardy

The KaleidaGraph Guide to Curve Fitting

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Using R for Linear Regression

Chapter 7: Simple linear regression Learning Objectives

Dongfeng Li. Autumn 2010

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Multiple Regression: What Is It?

9.2 User s Guide SAS/STAT. Introduction. (Book Excerpt) SAS Documentation

Package smoothhr. November 9, 2015

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Paper KEYWORDS PROC TRANSPOSE, PROC CORR, PROC MEANS, PROC GPLOT, Macro Language, Mean, Standard Deviation, Vertical Reference.

Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

SAS Certificate Applied Statistics and SAS Programming

INTRODUCTION TO MULTIPLE CORRELATION

Interpretation of Somers D under four simple models

Regression III: Advanced Methods

MATH. ALGEBRA I HONORS 9 th Grade ALGEBRA I HONORS

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

Example: Boats and Manatees

Simple Second Order Chi-Square Correction

HLM software has been one of the leading statistical packages for hierarchical

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Module 5: Multiple Regression Analysis

Technical report. in SPSS AN INTRODUCTION TO THE MIXED PROCEDURE

Introduction to Fixed Effects Methods

Alex Vidras, David Tysinger. Merkle Inc.

Introduction to proc glm

Predicting Customer Default Times using Survival Analysis Methods in SAS

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Introduction to Quantitative Methods

Multivariate Logistic Regression

Statistics 2014 Scoring Guidelines

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

5 Correlation and Data Exploration

Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]

Outline. Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test

10. Comparing Means Using Repeated Measures ANOVA

Basic Statistical and Modeling Procedures Using SAS

Transcription:

Score Test of Proportionality Assumption for Cox Models Xiao Chen, Statistical Consulting Group UCLA, Los Angeles, CA ABSTRACT Assessing the proportional hazards assumption is an important step to validate a Cox model for survival data. This paper provides a macro program of a score test based on scaled Schoenfeld residuals using SAS PROC IML with different choices of function forms of time variable. An example is presented to demonstrate the use of the score test and graphical tools in assessing the proportionality assumption. INTRODUCTION Cox proportional-hazards regression models are used widely for analyzing survival data and a key assumption in the Cox models is that the effect of any predictor variable is constant over time. There are two types of tests for proportionality assumption. One type is the Wald test for individual predictors and the partial likelihood ratio test for the global test. This can be performed using PROC PHREG in SAS by creating time varying covariates and using the test statement. The other type is the test based on the scaled Schoenfeld residuals, which will be presented here. In this case, testing the time dependent covariates is equivalent to testing for a non-zero slope in a generalized linear regression of the scaled Schoenfeld residuals on functions of time. A non-zero slope is an indication of a violation of the proportional hazard assumption. We can also perform an overall test on multiple predictor variables. The common choices for functions of time include log, rank and Kaplan-Meier together with the identity function, all of which have been included in the macro program, ph_score_test, presented here. As with any regression, it is very helpful to graph the scaled Schoenfeld residuals against a time variable so we can visually inspect possible patterns in addition to performing the tests of non-zero slopes. There are certain types on non-proportionality that will not be detected by the tests of non-zero slopes alone but that might become obvious when looking at the graphs of the residuals such as nonlinear relationship (i.e., a quadratic fit) between the residuals and the function of time or undue influence of outliers. SCORE TEST BASED ON SCALED SCHOENFELD RESIDUALS Schoenfeld residuals after a Cox model are defined for each predictor variable in the model. That is to say that the number of Schoenfeld residual variables is the same as the number of predictor variables. They are based on the contributions of each of the predictor variable to the log partial likelihood. Grambsch and Therneau (1994) show that scaled Schoenfeld residuals can be of a great use in diagnostics of Cox regression models, especially in assessing the proportional hazards assumption. In theory, the scaled Schoenfeld residuals are Schoenfeld residuals adjusted by the inverse of the covariance matrix of the Schoenfeld residuals. Grambsch and Therneau (1994) suggest that under the assumption that that the distribution of the predictor variable is similar in the various risk sets, the adjustment can be performed using the variance-covariance matrix of the parameter estimates divided by the number of events in the sample. The null hypothesis for the test on proportional hazards based on the scaled Schoenfeld residuals is that the slope of Schoenfeld residuals against a function of time is zero for each predictor variable. Once the scaled Schoenfeld residuals are created, we can perform this test using generalized linear regression approach. More precisely, the test statistic on an individual predictor variable is In this formula, r s is the variable of scaled Schoenfeld residuals, g(t) is the function of time predefined before the test, δ i is the indicator variable of event, Δ is the total number of events and V uu is the estimate for the variance of the parameter estimate of the predictor variable of interest. The sum is taken over all the observations in the data. It is asymptotically distributed as a χ 2 with 1 degree of freedom. The test statistic for the overall test on p predictor variables is as follows. where r i is the vector of the unscaled Schoenfeld residuals of interest. It has p degrees of freedom with asymptotically χ 2 distribution. 1

AN EXAMPLE The data set used for this example is taken Applied Survival Analysis: Regression Modeling of Time to Event Data, Chapter 6. The data set can be downloaded following the link. The time to event variable is lenfol and the censor variable is fstat. The predictor variables that we will use for the example are age, bmi, hr (heart rate) and gender. In this example, we will show how to manually create scaled Schoenfeld residuals and how to graphically inspect the possible deviation from the assumption of proportional hazards. We first run the Cox model using PROC PHREG. In this run, two data sets are created, the data set that contains the variance-covariance matrix, named est created using the outset option and another data set containing the Schoenfeld residuals for each predictor variable, named res, using the output statement. proc phreg data = whas500 outest=est covout; model lenfol*fstat(0) = age bmi hr gender; id id; output out=res ressch = age_r bmi_r hr_r gender_r; In order to create the scaled Schoenfeld residuals, we need to get the information on the total number of events. We use proc sql to sum up the censor variable and store the information in a macro variable called total. proc sql noprint; select sum(fstat) into :total from whas500; Now we have all the information we need for adjusting the Schoenfeld residuals using proc iml. proc iml; use res; read all variables {age_r bmi_r hr_r gender_r} into L where (fstat = 1); read all variables {lenfol fstat} into X where (fstat = 1); use est; read all var {age bmi hr gender} into V where (_type_ = "COV"); ssr = (&total)*l*v; W = X ssr; create p var {lenfol fstat sage_r sbmi_r shr_r sgender_r}; append from W; At this point, a data set called p has been created. This data set has the time variable, the censor variable and all the scaled Schoenfeld residual variables. To visually inspecting the trend, we can also make use some nonparametric smoothing technique such as provided by proc loess shown below for scaled Schoenfeld residual variable for the predictor variable hr (heart rate). This process will have to be done repeatedly for each Schoenfeld residual variable related to each predictor variable in the model. For the illustration purpose, we just show one. proc loess data=ats.p; model shr_r=lenfol /smooth=0.4; ods output OutputStatistics=myout; 2

Now we have done all the preparation for displaying the trend of scaled Schoenfeld residual for heart rate against the original time variable, lenfol. proc sort data = myout; by lenfol; symbol1 c = gray i = none v = circle h=.8 ; symbol2 c = black i = join v = none w=2.5; axis1 order=(-.1 to.15 by.05) minor=none label=(a=90 'Scaled Schonefeld Residuals') ; axis2 order=(0 to 2400 by 400) label=('time') minor=none; proc gplot data = myout; plot DepVar*lenfol=1 Pred*lenfol=2 /vaxis = axis1 haxis = axis2 vref=0 overlay; The plot does not show a strong trend along the original time variable, even though there is a slight sign of negative slope by the loess estimate. So far we have shown how to create the scaled Schoenfeld residuals from Schoenfeld residuals that SAS provided via PROC IML. We can also apply the macro program phreg_score_test to perform the test as shown below. %phreg_score_test(lenfol, fstat, age bmi hr gender, data=whas500); 3

The first column is the correlation of the scaled Schoenfeld residuals with the time variable. The second column is the test statistic defined previously. The global test is to test simultaneously all the slopes are zero. All the p-values are fairly large, indicating that the slopes are zero. REMARK Different common transformations of the time variable are available. These are rank, log and Kaplan-Meier estimate. The default transformation of the macro program phreg_score_test is the identity function. To specify other type of transformation of time, one can simply use the option type= as shown in the examples below. Even though, some simulation has been done to show that the log transformed time variable works pretty well, there are other situations where the behavior of the different time variables do differ. The decision on which time variable to use is case by case, largely depending on the theory and focus of the researchers. %phreg_score_test(lenfol, fstat, bmifp1 bmifp2, data=whas500); %phreg_score_test(lenfol, fstat, bmifp1 bmifp2, data=whas500, type="rank"); %phreg_score_test(lenfol, fstat, bmifp1 bmifp2, data=whas500, type="logtime"); %phreg_score_test(lenfol, fstat, bmifp1 bmifp2, data=whas500, type="km"); We will include a segment of the macro program to show what is involved in the computation. %macro phreg_score_test(time, event, xvars, strata, weight=, data=_last_, type="time"); %let xvar_r =; %let k = 1; %let v = %scan(&xvars, 1); %do %while ("&v"~=""); %let xvar_r = &xvar_r &v._r; %let k = %eval(&k + 1); %let v = %scan(&xvars, &k); %end; %let varnames = &time &xvars &xvar_r; ods listing close; proc phreg data=&data covout outest=_est_ (drop=_lnlike_); model &time*&event(0) = &xvars; strata &strata; output out = _res_ (where = (&event=1)) ressch = &xvar_r; proc sort data = _res_; by &time; /*counting the number of total events*/ proc sql noprint; select sum(&event) into :delta from &data; ods listing; 4

proc iml; reset noname printadv = 1; use _res_; read all variables {&xvar_r} into S; use _tvars_; read all variables {&time _logtime _Rtime s} into T; c = ncol(s); r = nrow(s); use _est_; read all var _num_ into V where (_TYPE_^="PARMS"); read all var {_name_} into N where (_TYPE_^="PARMS"); sv = J(r, c, 0); sv = &delta*s*v; %if (%upcase(&type)="time") %then %do; gbar = sum(t[,1])/δ top = J(c, 1, 0); top[i] = sum((t[, 1]-gbar)#sv[, i])**2; end; bottom = J(c, 1, 1); bottom[i] = &delta*t(t[,1]-gbar)*(t[,1]-gbar)*v[i,i]; end; chi2 = top/bottom; X = J(c+1, 4, 0); print "Score test of proportional hazards assumption"; print "Time variable: &time"; ct = T[, 1] - sum(t[,1])/δ norm_ct = sqrt(t(ct)*ct); csv = sv[, i] - sum(sv[,i])/δ n_csv = sqrt(t(csv)*csv); X[i, 1] = t(ct)*csv/(norm_ct*n_csv); /*correlation*/ X[i, 2] = chi2[i]; /*cstat*/ X[i, 3] = 1; X[i, 4] = 1- probchi(chi2[i], 1); /*probchi2*/ end; /* individual test*/ rname = N//"Global test"; cname={"rho" "Chi-Square" "df" "P-value"}; rowmat = J(1,c,1); a = (T[,1]-gbar)#S; rowmat[i] =sum(a[, i]); end; global = &delta*(rowmat*v*t(rowmat))/(t(t[,1]-gbar)*(t[,1]-gbar)); probchi2=1-probchi(global,c); X[c+1, 1] =.; X[c+1, 2] = global; X[c+1, 3] = c; X[c+1, 4] = probchi2; print x[rowname=rname colname=cname format=12.3]; %end; CONCLUSION This paper offers an implementation of the test on proportional hazards based on scaled Schoenfeld residuals. The implementation uses PROC IML and is embedded in a macro program. It offers both test on individual predictors and a global test on collectively all the variables of interest at once. It offers four different transformations of the time variable. The macro program can be downloaded following the link. For more examples on using this macro program, visit the textbook example page Chapter 6 of Applied Survival Analysis created by the Statistical Consulting Group at UCLA. 5

REFERENCES P. M. Grambsch, T. M. Therneau, Proportional hazards tests and diagnostics based on weighted residuals. Biometrika, 81: 515-526, 1994 D. W. Hosmer, Jr., S. Lemeshow and S. May, Applied Survival Analysis: Regression Modeling of Time to Event Data, 2 nd Edition, 2008 T. M. Therneau, P. M. Grambsch, Modeling Survival Data Extending the Cox Model, Springer-Verlag, New York 2000 CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Xiao Chen Statistical Consulting Group UCLA Academic Technology Services 5308 Math Sciences Box 951557 Los Angeles, CA 90095 Work Phone: (310) 825-7431 Fax: (310) 206-7025 E-mail: xiao.chen@ucla.edu Web: www.ats.ucla.edu/stat/ SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 6