# Survival Analysis And The Application Of Cox's Proportional Hazards Modeling Using SAS

Save this PDF as:

Size: px
Start display at page:

Download "Survival Analysis And The Application Of Cox's Proportional Hazards Modeling Using SAS"

## Transcription

1 Paper Survival Analysis And The Application Of Cox's Proportional Hazards Modeling Using SAS Tyler Smith, and Besa Smith, Department of Defense Center for Deployment Health Research, Naval Health Research Center, San Diego, CA Abstract In recent papers published in the American Journal of Epidemiology, the authors used Cox's proportional hazards regression modeling to model the time until an event of interest and compare the cumulative probability of hospitalization over time for two or more cohorts while adjusting for other influential covariates. In this presentation these statistical procedures will be looked at more closely by using SAS. The usefulness of the baseline option in PROC PHREG will be demonstrated with the creation and output of survival function estimates, which are a function of the cumulative probability estimates over time. This real data example using Cox modeling will show what increased risk for hospitalization for an event of interest might look like graphically and in a risk of event type ratio. Further analyses using the survivor function estimates will look for graphical representation of a temporal bias during the observation period. The SAS system's PROC PHREG with baseline option was instrumental in searching for possible temporal biases and dealing with attrition of subjects over our study period. Introduction to Survival Analysis The term "survival analysis" pertains to a statistical approach designed to take into account the amount of time an experimental unit contributes to a study. That is, it is the study of time between entry into observation and a subsequent event. Originally, the event of interest was death hence the term, "survival analysis." The analysis consisted of following the subject until death. The uses in the survival analysis of today vary quite a bit. Applications now include time until onset of disease, time until stockmarket crash, time until equipment failure, time until earthquake, and so on. The best way to define such events is simply to realize that these events are a transition from one discrete state to another at an instantaneous moment in time. Of course, the term "instantaneous", which may be years, months, days, minutes, or seconds, is relative and has only the boundaries set by the researcher. The History of Survival Analysis The origin of survival analysis goes back to mortality tables from centuries ago. However, it was not until World War II that a new era of survival analysis emerged. This new era was stimulated by interest in reliability (or failure time) of military equipment. At the end of the war these newly developed statistical methods emerging from strict mortality data research to failure time research, quickly spread through private industry as customers became more demanding of safer, more reliable products. As the uses of survival analysis grew, parametric models gave way to nonparametric and semiparametric approaches for their appeal in dealing with the ever-growing field of clinical trials in medical research. Survival analysis was well suited for such work because medical intervention follow-up studies could start without all experimental units enrolled at start of observation time and could end before all experimental units had experienced an event. This is extremely important because even in the best-developed studies, there will be subjects who choose to quit participating, who move too far away to follow, or who will die from some unrelated event. The researcher was no longer forced to withdraw the experimental unit and all associating data from the study, instead techniques called censoring enabled researchers to analyze incomplete data due to delayed entry or withdrawal from the study. This was important in allowing each experimental unit to contribute all of the information possible to the model for the amount of time the researcher was able to observe the unit. The last great strides in the application of survival analysis techniques has been a direct result of the

2 availability of software packages and high performance computers which are now able to run these difficult and computationally intensive algorithms relatively efficiently. Some Tools Used in Survival Analysis First, recall that time is continuous, which results in the probability of an event at a single point of a continuous distribution being zero. We are challenged to define the probability of these events over distribution. This is best described by graphing the distribution of event times. To ensure the readers will start with the same fundamental tools of survival analysis, a brief descriptive section of these important concepts will follow. A more detailed description of the probability density function, the cumulative distribution function, the hazard function, and the survivor function, can be found in any intermediate level statistical textbook. So that the reader will be able to look for certain relationships while reading, it is important to note before the brief descriptions the one-to-one relationship that these four functions possess. The pdf can be obtained by taking the derivative of the cdf and likewise, the cdf can be obtained by taking the integral of the pdf. The survivor function is simply 1 minus the cdf. Which leaves the hazard function as simply being the pdf over the survivor function. It will be these relationships later that will allow us to calculate the cdf from the survivor function estimates that the SAS procedure PROC PHREG will output. The Cumulative Distribution Function The cumulative distribution function (cdf) is very useful in describing the continuous probability distribution of a random variable, such as time, in a survival analysis. The cdf of a random variable T, denoted F T (t), is defined by F T (t) = P T (T t). This is interpreted as a function that will give the probability that the variable T will be less than or equal to any value t that we choose. Several properties of a distribution function F(t) can be listed as a consequence of the knowledge of probabilities. Because F(t) has the probability 0 F(t) 1, then F(t) is a nondecreasing function of t, and as t approaches ", F(t) approaches 1. The Probability Density Function The probability density function (pdf) is also very useful in describing the continuous probability distribution of a random variable. The pdf of a random variable T, denoted f T (t), is defined by f T (t) = d F T (t) / dt. That is, the pdf is the derivative or slope of the cdf. Every continuous random variable has its own density function, the probability P(a T b) is the area under the curve between times a and b. The Survival Function Let T 0 have a pdf f(t) and cdf F(t). Then the survival function takes on the following form: S(t) = P{T > t} = 1 - F(t) That is, the survival function gives the probability of surviving or being event-free beyond time t. Because S(t) is a probability, it is positive and ranges from 0 to 1. It is defined as S(0) = 1 and as t approaches ", S(t) approaches 0. The Kaplan- Meier estimator, or product limit estimator, is the estimator used by most software packages because of the simplistic step idea. The Kaplan-Meier estimator incorporates information from all of the observations available, both censored and uncensored, by considering any point in time as a series of steps defined by the observed survival and censored times. The survival curve describes the relationship between the probability of survival and time. The Hazard Function The hazard function h(t) is given by the following: h(t) = P{ t < T < (t + ) T >t} = f(t) / (1 - F(t)) = f(t) / S(t) The hazard function describes the concept of the risk of an outcome (e.g., death, failure, hospitalization) in an interval after time t, conditional on the subject having survived to time t. It is the probability that an individual dies somewhere between t and t +, divided by the probability that the individual survived beyond time t. The hazard function seems to be more intuitive to use in

3 survival analysis than the pdf because it attempts to quantify the instantaneous risk that an event will take place at time t given that the subject survived to time t. Incomplete Data Observation time has two components that must be carefully defined in the beginning of any survival analysis. There is a beginning point of the study where time=0 and a reason or cause for the observation of time to end. For example, in a complete observation cancer study, observation of survival time may begin on the day a subject is diagnosed with the cancer and end when that subject dies as a result of the cancer. This subject is what is called an uncensored subject, resulting from the event occurring within the time period of observation. Complete observation time data like this example are desired but not realistic in most studies. There is always a good possibility that the patient might recover completely or the patient might die due to an entirely unrelated cause. In other words, the study cannot go on indefinitely, waiting for an event from a participant and unforeseen things happen to study participants that make them unavailable for observation. The censoring of study participants therefore deals with the problems of incomplete observations of time due to assumed random factors not related to the study design. Note: This differs from truncation where observations of time are incomplete due to a selection process inherent to the study design. Left and Right Censoring The most common form of incomplete data is right censoring. This occurs when there is a defined time (t=0) where the observation of time is started for all subjects involved in the study. A right censored subject's time terminates before the outcome of interest is observed. For example, a subject could move out of town, die of an unexpected cause, or could simply choose not to participate in the study any longer. Right censoring techniques allow subjects to contribute to the model until they are no longer able to contribute (end of the study, or withdrawal), or they have an event. Conversely, an observation is left censored if the event of interest has already occurred when observation of time begins. For the purposes of this study we focused on right censoring. The following graph shows a simple study design where the observation times start at a consistent point in time (t=0). The X's represent events and the O's represent censored observations. Notice that all observations are classified with an event, or they are censored at time of separation or at the end of the study period. Some subjects have events early in the study period and others have events at the end of the study period. Likewise some subjects leave early, but most do not have an event during the entire study and are simply right censored at the end. There is no need for left censoring or truncation techniques in this simple example. Time Dependencies In some situations the researcher may find that the dynamical nature of a variable causes changes in value over the observation time. In other instances the researcher may find that certain trends effect the probability of the event of interest over time. There are easy ways to test and account for these temporal biases within PROC PHREG but be careful if you have a large number of observations as the computation of the subsequent partial likelihood is very taxing and time consuming. An easier way to see if there are existing temporal biases is to look at the plots of the cumulative distributions of the probability of event. If there is a steep increase or decrease in the cumulative probability, it may suggest more investigation is needed. It is important to note here that when a time dependent variable is introduced into the model, the ratios of the hazards will not remain steady. This only effects the model structure. We will still be doing a Cox regression but instead the model used is called the extended Cox model.

5 thereby not allowing a temporal bias to become an influential player on the endpoint. Use of the Partial Likelihood Function The Cox model has the flexibility to introduce timedependent explanatory variables and handle censoring of survival times due to its use of the partial likelihood function. This was important to our study in that any temporal biases due to differences in hospitalization practices for different strata of the significant covariates over the years of study needed to be handled correctly. This ensured that any differences in hospitalization experiences between the exposed and nonexposed would not be coming from these temporal differences. Survivor Function Estimates With the SAS option BASELINE, a SAS dataset containing survival function estimates can be created and output. These estimates correspond to the means of the explanatory variables for each stratum. Univariate Analyses Analysis Using PROC FREQ, and PROC UNIVARIATE, an initial univariate analysis of the demographic variables crossed with hospitalization experience was carried out to determine possible significant explanatory variables to be included in the model runs. All variables with a chi-square value or t statistic of.15 or less were considered possibly significant and were therefore retained for the model analysis. Additionally the distributions of attrition were checked to see if the cohorts separated from active duty military service equally. Modeling Approach Using PROC PHREG, a saturated Cox model was run after creating dummy variables, necessary for the output of hazard ratios for the categorical explanatory variables. A manual backward stepwise analysis was carried out to create a model with statistically significant effects of explanatory variables on survival times. Programming PROC PHREG DATA=ANALYDAT; MODEL INHOSP*CENSOR(0)= expose1 pwhsp status1 sex1 age1-age3 ms1 paygr1-paygr2 oc_cat1-oc_cat9 ccep /RL TIES=EFRON ; TITLE1 'Cox Regression With Exposure Status In the Model '; RUN; The options used in this survival analysis procedure are described below: DATA=ANALYDAT names the input data set for the survival analysis. RL requests for each explanatory variable, the 95% (the default alpha level because the ALPHA= option is not invoked) confidence limits for the hazard ratios. TIES=EFRON gives the researcher the approximations to the EXACT method without using the tremendous CPU it takes to run the EXACT method. Both the EFRON and the BRESLOW methods do reasonably well at approximating the EXACT when there are not a lot of ties. If there are a lot of ties, then the BRESLOW approximation of the EXACT will be very poor. If the time scale is not continuous and is therefore discrete, the option TIES=DISCRETE should be used. Stratification By Exposure Status These data were then stratified by exposure and the models were run with the exposure flag covariate withdrawn from the model. This allowed for inspection of interaction between exposure status and covariates. Running these separate models also allowed for the computation of survival function estimates using the BASELINE function in PROC PHREG. The survival curves (which are really step functions for such numerous events that they appear continuous) were now available to compute the cumulative distribution function for the separate cohorts. Time Dependent Covariates After the final model of significant explanatory variables was created, it was necessary to validate the proportional hazards assumption. If the researcher believes that there may be a time dependency from a certain variable then simply add

6 x1time to the list of independent variables and the following below the model statement. x1time=x1*(t) Where t is the time variable and x1 is the suspected time dependent variable. If the interaction term is found to be insignificant we can conclude that the proportional hazards assumption holds. This is necessary to ensure that there was no adverse effect from time-dependent covariates creating different rates for different subjects, thus making the ratios of their hazards nonconstant. Survivor Function Estimates By Exposure whether or not the proportional hazards assumption had been violated. The Plots Figure 1 is what the cumulative distribution function would look like if there were a violation of the proportional hazards assumption. Note the sharp increase in probability of hospitalization beginning right before the third year and lasting for approximately 1 year. After this one year period the top curve then levels off and becomes parallel again with the bottom curve. Figure 1. The following is the code used after the ANALYDAT was stratified into exposed or nonexposed. This produces the survivor function estimates by exposure while simultaneously checking to see if there were any interactions between the covariates and the exposure status. PROC PHREG DATA=EXPOSE1; MODEL INHOSP*CENSOR(0)=pwhsp status1 sex1 age1-age3 ms1 paygr1-paygr2 oc_cat1- oc_cat9 ccep /RL TIES=EFRON ; BASELINE OUT=SURVS SURVIVAL=S; RUN; The new options used in this survival analysis procedure are described below: BASELINE without the COVARIATES= option produces the survivor function estimates corresponding to the means of the explanatory variables for each stratum. OUT=SURVS names the data set output by the BASELINE option. Figure 2 shows what the cumulative distribution function would look like if there were no violation of the assumption of proportional hazards but there did happen to be an observed significant difference in the disease experience between the two cohorts over the length of the study period. Figure 2. SURVIVAL=S tells SAS to produce the survivor function estimates in the output data set. A simple calculation of 1-Survivor function estimates in SURVS, obtained from running the BASELINE option, produced the cumulative distribution functions. We could now see the cumulative probability estimates of hospitalization over time. We were then able to visually scan for differences in hospitalization experiences of between the cohorts and look for insight as to

7 Figure 3 shows what the cumulative distribution function would look like if there were no problem with the assumption of proportional hazards. The figure also shows what the curves would look like it there were not a significant difference observed between the diagnosis experience of the two cohorts. Figure 3. Probability of Hospitalization Years from March 10, 1991 Computing the Generalized R 2 Recently I was asked whether SAS computed the R 2 value and what it was for that particular model. If the researcher desires, the R 2 value can be computed easily from the output of your regression, although it is not an option of PROC PHREG. Simply compute R 2 = 1 - exp(lr 2 /n) Where LR is the Likelihood-ratio chi-square statistic for testing the null hypothesis that all variables included in the model have coefficients of 0, and n is the number of observations. The researcher needs to take extreme caution when comparing the R 2 values of Cox regression models. Remember from linear regression analysis, R 2 can be artificially increased by simply adding explanatory variables to the regression model (ie; more variables does not equal a better model necessarily). Also, the above computation does not give you the proportion of variance of the dependent variable explained by the independent variables as it would in linear regression, but does give you a measure of how associated the independent variables are with the dependent variable Residual Analysis A residual analysis is very important especially if the sample size is relatively small. Add the following after your model statement to output the martingale and deviance residuals: BASELINE OUT=SURVS SURVIVAL=S XBETA=XBET RESMART=MARTING RESDEV=RDEV; Then a simple plot of the residuals against the linear predictor scores will give the researcher an idea of the fit or lack of fit of the model to individual observations. PROC GPLOT DATA=SURVS; PLOT (MARTING RDEV) * XBET / VREF=0; SYMBOL1 VALUE=CIRCLE; Results Using the initial univariate comparisons for events occurring during the study, the following variables were selected for the subsequent model analyses: gender, age group, marital status, race/ethnicity, military occupational category, military pay grade, salary, service branch, pre-exposure period hospitalization, and exposure status. Home of record was not shown to be significantly affecting the endpoints in any either of the studies' models and was dropped. Salary and length of service were dropped from analyses due to colinearity with age. The subjects in cohort 1 had similar risks for two of the three diseases during the August 1, 1991, to July 31, 1997 study period compared with subjects in cohort 2. The corresponding cumulative probability plots (above) were nearly parallel for the follow-up period. However, the Cox model did reveal some consistently better predictors of hospitalization with the two diseases, which included female gender, preexposure period hospitalization, enlisted pay grade, and US Reserve service type. The subjects in cohort 1 had significantly different risks for the third disease during the August 1, 1991, to July 31, 1997 study period compared with subjects in cohort 2. The corresponding cumulative probability plots (above) were nearly parallel for the first three years of follow-up, then there was a drastic increase in hospitalization for a period of about 1 year and then once again the curves became nearly parallel again. Time-dependent

8 variables included in the modeling also confirmed this result. The hazards ratio was not significantly greater than 1 for the first 3 years of follow-up and the last 3 years as well. However, during the 1 year in question, though, the risk of hospitalization with that particular disease was almost 3 times that of cohort 1. Further investigation found that certain treatment facilities had adopted an approach of administratively hospitalizing these subjects for extensive clinical evaluations. This approach was later dropped almost 1 year to the date of the start. Conclusions The Cox proportional hazard model's robust nature allows us to closely approximate the results for the correct parametric model when the parametric is unknown or in question. Using the SAS system procedure PROC PHREG, Cox's proportional hazards modeling was used to compare the hospitalization experiences of two or more cohorts. The two studies produced models suggesting no increase in risk among the exposed which were later confirmed by producing the cumulative distribution function. The study also found at least one model which initially suggested an increase in risk among the exposed. Further analysis revealed that this sharp increase in risk for approximately 1 year was likely due to an outside factor affecting the process of hospitalizing personnel for this particular disease event. The SAS system's PROC PHREG with censoring and the baseline option is a powerful tool for handling early departure of subjects during the study period. It is also useful for producing data sets, including survival function estimates, which can be used in a simple equation to produce estimates of probability of events. When graphed, these show cumulative probability of event curves as a function of time. If it were not for the graphs of the cumulative distribution functions, which showed a sharp temporal bias, results may have been reported and interpreted with much different results. References Hosmer JR. DW, Lemeshow S. Applied Survival Analysis; Regression Modeling of Time to Event Data. New York: John Wiley & Sons; 1999 Kleinbaum DG, Survival Analysis: A self-learning Text. New York: Springer-Verlag; 1996 SAS Institute Inc., SAS/STAT User's Guide, Version 6, Fourth Edition, Volume 1, Cary, NC: SAS Institute Inc., pp. SAS Institute Inc., SAS/STAT User's Guide, Version 6, Fourth Edition, Volume 2, Cary, NC: SAS Institute Inc., pp. SAS Institute Inc. SAS/STAT Software: Changes and Enhancements through Release Cary, NC: SAS Institute Inc., pp. Allison, Paul D., Survival Analysis Using the SAS system: A Practical Guide, Cary, NC: SAS Institute Inc., pp. Smith TC, Gray GC, Knoke JD. Is systemic lupus erythematosus, amyotrophic lateral sclerosis, or fibromyalgia associated with Persian Gulf War service? An examination of Department of Defense hospitalization. Amer J of Epidemiol; June 1, 2000, Vol 151. Gray GC, Smith TC, Knoke JD, Heller JM. The postwar hospitalization experience among Gulf War veterans exposed to chemical munitions destruction at Khamisiyah, Iraq. Amer J of Epidemiol; September 1, Volume 150 No 5. Acknowledgments Thank you to Navy CAPT Greg Gray, Director of the Department of Defense Center for Deployment Health Research at the Naval Health Research Center, San Diego. CAPT Gray's support and encouragement of learning new concepts and devising better ways to tackle statistical problems in pursuit of the best possible research answers. Approved for public release: distribution unlimited. This research was supported by the Department of Defense, Health Affairs, under work unit no SAS software is a registered trademark of SAS Institute, Inc. in the USA and other countries. About The Authors Besa Smith has used SAS for 4 years including work as an undergraduate in Biology and Chemistry at CSU, Chico, a graduate at the SDSU Graduate School of Public Health in Biostatistics, and currently as a biostatistician with the DoD Center for Deployment Health Research at NHRC.

9 Responsibilities include management of large military and demographic data, mathematical modeling and statistical analysis. She has been an invited presenter for WUSS 2000, and the San Diego Area's SAS users Group fall 2000 meeting. Besa Smith, MPH Biostatistician, Henry Jackson Foundation Department of Defense Center for Deployment Health Research, at the Naval Health Research Center, San Diego (619) Tyler Smith has used SAS for 9 years, including work as a student in Math and Statistics, as a graduate student at the University of Kentucky Department of Statistics, and currently as a senior statistician and data analyst with the Naval Health Research Center. His responsibilities include mathematical modeling, analysis, management, and documentation of large hospitalization and demographic data sets. He has been invited to speak at the International Biometrics Society meetings, WUSS 99, SUGI 2000, WUSS 2000, and the San Diego SAS users group 1999 and 2000 fall meetings. Tyler C. Smith, MS Senior Statistician, Henry Jackson Foundation Department of Defense Center for Deployment Health Research, at the Naval Health Research Center, San Diego (619)

### Survival Analysis Using Cox Proportional Hazards Modeling For Single And Multiple Event Time Data

Survival Analysis Using Cox Proportional Hazards Modeling For Single And Multiple Event Time Data Tyler Smith, MS; Besa Smith, MPH; and Margaret AK Ryan, MD, MPH Department of Defense Center for Deployment

### Survival Analysis Using SPSS. By Hui Bian Office for Faculty Excellence

Survival Analysis Using SPSS By Hui Bian Office for Faculty Excellence Survival analysis What is survival analysis Event history analysis Time series analysis When use survival analysis Research interest

### Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Tips for surviving the analysis of survival data Philip Twumasi-Ankrah, PhD Big picture In medical research and many other areas of research, we often confront continuous, ordinal or dichotomous outcomes

### Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry

Paper 12028 Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry Junxiang Lu, Ph.D. Overland Park, Kansas ABSTRACT Increasingly, companies are viewing

### Predicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS

Paper 114-27 Predicting Customer in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS Junxiang Lu, Ph.D. Sprint Communications Company Overland Park, Kansas ABSTRACT

### Introduction. Survival Analysis. Censoring. Plan of Talk

Survival Analysis Mark Lunt Arthritis Research UK Centre for Excellence in Epidemiology University of Manchester 01/12/2015 Survival Analysis is concerned with the length of time before an event occurs.

### SAS and R calculations for cause specific hazard ratios in a competing risks analysis with time dependent covariates

SAS and R calculations for cause specific hazard ratios in a competing risks analysis with time dependent covariates Martin Wolkewitz, Ralf Peter Vonberg, Hajo Grundmann, Jan Beyersmann, Petra Gastmeier,

### Predicting Customer Default Times using Survival Analysis Methods in SAS

Predicting Customer Default Times using Survival Analysis Methods in SAS Bart Baesens Bart.Baesens@econ.kuleuven.ac.be Overview The credit scoring survival analysis problem Statistical methods for Survival

### Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

### Competing-risks regression

Competing-risks regression Roberto G. Gutierrez Director of Statistics StataCorp LP Stata Conference Boston 2010 R. Gutierrez (StataCorp) Competing-risks regression July 15-16, 2010 1 / 26 Outline 1. Overview

### MATERIALS AND METHODS Population. Tyler C. Smith, Gregory C. Gray, and James D. Knoke

American Journal of Epidemiology Copyright O 2000 by The Johns Hopkins University School of Hygiene and Public Health All rights reserved Vol. 151, No. 11 Printed/n USA. Is Systemic Lupus Erythematosus,

### Survival Analysis of Dental Implants. Abstracts

Survival Analysis of Dental Implants Andrew Kai-Ming Kwan 1,4, Dr. Fu Lee Wang 2, and Dr. Tak-Kun Chow 3 1 Census and Statistics Department, Hong Kong, China 2 Caritas Institute of Higher Education, Hong

### The Cox Proportional Hazards Model

The Cox Proportional Hazards Model Mario Chen, PhD Advanced Biostatistics and RCT Workshop Office of AIDS Research, NIH ICSSC, FHI Goa, India, September 2009 1 The Model h i (t)=h 0 (t)exp(z i ), Z i =

### Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln Log-Rank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics)

### X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

### Survival Analysis of the Patients Diagnosed with Non-Small Cell Lung Cancer Using SAS Enterprise Miner 13.1

Paper 11682-2016 Survival Analysis of the Patients Diagnosed with Non-Small Cell Lung Cancer Using SAS Enterprise Miner 13.1 Raja Rajeswari Veggalam, Akansha Gupta; SAS and OSU Data Mining Certificate

### Multivariate Logistic Regression

1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

### SUMAN DUVVURU STAT 567 PROJECT REPORT

SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.

### ˆx wi k = i ) x jk e x j β. c i. c i {x ik x wi k}, (1)

Residuals Schoenfeld Residuals Assume p covariates and n independent observations of time, covariates, and censoring, which are represented as (t i, x i, c i ), where i = 1, 2,..., n, and c i = 1 for uncensored

### Lecture 2 ESTIMATING THE SURVIVAL FUNCTION. One-sample nonparametric methods

Lecture 2 ESTIMATING THE SURVIVAL FUNCTION One-sample nonparametric methods There are commonly three methods for estimating a survivorship function S(t) = P (T > t) without resorting to parametric models:

### Raul Cruz-Cano, HLTH653 Spring 2013

Multilevel Modeling-Logistic Schedule 3/18/2013 = Spring Break 3/25/2013 = Longitudinal Analysis 4/1/2013 = Midterm (Exercises 1-5, not Longitudinal) Introduction Just as with linear regression, logistic

### Developing Business Failure Prediction Models Using SAS Software Oki Kim, Statistical Analytics

Paper SD-004 Developing Business Failure Prediction Models Using SAS Software Oki Kim, Statistical Analytics ABSTRACT The credit crisis of 2008 has changed the climate in the investment and finance industry.

### Estimating survival functions has interested statisticians for numerous years.

ZHAO, GUOLIN, M.A. Nonparametric and Parametric Survival Analysis of Censored Data with Possible Violation of Method Assumptions. (2008) Directed by Dr. Kirsten Doehler. 55pp. Estimating survival functions

Chapter 8 Hypothesis Tests Chapter Table of Contents Introduction...157 One-Sample t-test...158 Paired t-test...164 Two-Sample Test for Proportions...169 Two-Sample Test for Variances...172 Discussion

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Survival Analysis Approaches and New Developments using SAS. Jane Lu, AstraZeneca Pharmaceuticals, Wilmington, DE David Shen, Independent Consultant

PharmaSUG 2014 - Paper PO02 Survival Analysis Approaches and New Developments using SAS Jane Lu, AstraZeneca Pharmaceuticals, Wilmington, DE David Shen, Independent Consultant ABSTRACT A common feature

### Size of a study. Chapter 15

Size of a study 15.1 Introduction It is important to ensure at the design stage that the proposed number of subjects to be recruited into any study will be appropriate to answer the main objective(s) of

### Introduction to Event History Analysis DUSTIN BROWN POPULATION RESEARCH CENTER

Introduction to Event History Analysis DUSTIN BROWN POPULATION RESEARCH CENTER Objectives Introduce event history analysis Describe some common survival (hazard) distributions Introduce some useful Stata

### Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

### STATISTICAL ANALYSIS OF SAFETY DATA IN LONG-TERM CLINICAL TRIALS

STATISTICAL ANALYSIS OF SAFETY DATA IN LONG-TERM CLINICAL TRIALS Tailiang Xie, Ping Zhao and Joel Waksman, Wyeth Consumer Healthcare Five Giralda Farms, Madison, NJ 794 KEY WORDS: Safety Data, Adverse

### Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

### Statistics Graduate Courses

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Logistic regression is an increasingly popular statistical technique

### Cancer Survival Analysis

Cancer Survival Analysis 2 Analysis of cancer survival data and related outcomes is necessary to assess cancer treatment programs and to monitor the progress of regional and national cancer control programs.

### A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia

### II. DISTRIBUTIONS distribution normal distribution. standard scores

Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

### LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

### A Tutorial on Logistic Regression

A Tutorial on Logistic Regression Ying So, SAS Institute Inc., Cary, NC ABSTRACT Many procedures in SAS/STAT can be used to perform logistic regression analysis: CATMOD, GENMOD,LOGISTIC, and PROBIT. Each

### Gamma Distribution Fitting

Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

### Univariable survival analysis S9. Michael Hauptmann Netherlands Cancer Institute Amsterdam, The Netherlands

Univariable survival analysis S9 Michael Hauptmann Netherlands Cancer Institute Amsterdam, The Netherlands m.hauptmann@nki.nl 1 Survival analysis Cohort studies (incl. clinical trials & patient series)

### Yiming Peng, Department of Statistics. February 12, 2013

Regression Analysis Using JMP Yiming Peng, Department of Statistics February 12, 2013 2 Presentation and Data http://www.lisa.stat.vt.edu Short Courses Regression Analysis Using JMP Download Data to Desktop

### Kaplan-Meier Survival Analysis 1

Version 4.0 Step-by-Step Examples Kaplan-Meier Survival Analysis 1 With some experiments, the outcome is a survival time, and you want to compare the survival of two or more groups. Survival curves show,

### Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Chapter 730 Tests for Two Survival Curves Using Cox s Proportional Hazards Model Introduction A clinical trial is often employed to test the equality of survival distributions of two treatment groups.

### SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

### Distance to Event vs. Propensity of Event A Survival Analysis vs. Logistic Regression Approach

Distance to Event vs. Propensity of Event A Survival Analysis vs. Logistic Regression Approach Abhijit Kanjilal Fractal Analytics Ltd. Abstract: In the analytics industry today, logistic regression is

### 7.1 The Hazard and Survival Functions

Chapter 7 Survival Models Our final chapter concerns models for the analysis of data which have three main characteristics: (1) the dependent variable or response is the waiting time until the occurrence

### AP Statistics 1998 Scoring Guidelines

AP Statistics 1998 Scoring Guidelines These materials are intended for non-commercial use by AP teachers for course and exam preparation; permission for any other use must be sought from the Advanced Placement

### Prognostic Outcome Studies

Prognostic Outcome Studies Edwin Chan PhD Singapore Clinical Research Institute Singapore Branch, Australasian Cochrane Centre Duke-NUS Graduate Medical School Contents Objectives Incidence studies (Prognosis)

### Problem of Missing Data

VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

### Primary Prevention of Cardiovascular Disease with a Mediterranean diet

Primary Prevention of Cardiovascular Disease with a Mediterranean diet Alejandro Vicente Carrillo, Brynja Ingadottir, Anne Fältström, Evelyn Lundin, Micaela Tjäderborn GROUP 2 Background The traditional

### Binary Logistic Regression

Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

### Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

### 11. Analysis of Case-control Studies Logistic Regression

Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

### A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

### Regression Modeling Strategies

Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

### How to Conduct a Hypothesis Test

How to Conduct a Hypothesis Test The idea of hypothesis testing is relatively straightforward. In various studies we observe certain events. We must ask, is the event due to chance alone, or is there some

### Introduction to survival analysis

Chapter Introduction to survival analysis. Introduction In this chapter we examine another method used in the analysis of intervention trials, cohort studies and data routinely collected by hospital and

### Dealing with confounding in the analysis

Chapter 14 Dealing with confounding in the analysis In the previous chapter we discussed briefly how confounding could be dealt with at both the design stage of a study and during the analysis of the results.

### AP Statistics 2001 Solutions and Scoring Guidelines

AP Statistics 2001 Solutions and Scoring Guidelines The materials included in these files are intended for non-commercial use by AP teachers for course and exam preparation; permission for any other use

Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

### Using SAS Proc Mixed for the Analysis of Clustered Longitudinal Data

Using SAS Proc Mixed for the Analysis of Clustered Longitudinal Data Kathy Welch Center for Statistical Consultation and Research The University of Michigan 1 Background ProcMixed can be used to fit Linear

### Simple Linear Regression Chapter 11

Simple Linear Regression Chapter 11 Rationale Frequently decision-making situations require modeling of relationships among business variables. For instance, the amount of sale of a product may be related

### Measures of disease frequency

Measures of disease frequency Madhukar Pai, MD, PhD McGill University, Montreal Email: madhukar.pai@mcgill.ca 1 Overview Big picture Measures of Disease frequency Measures of Association (i.e. Effect)

### 13. Poisson Regression Analysis

136 Poisson Regression Analysis 13. Poisson Regression Analysis We have so far considered situations where the outcome variable is numeric and Normally distributed, or binary. In clinical work one often

### Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY

Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY ABSTRACT PROC FREQ is an essential procedure within BASE

### Study Design and Statistical Analysis

Study Design and Statistical Analysis Anny H Xiang, PhD Department of Preventive Medicine University of Southern California Outline Designing Clinical Research Studies Statistical Data Analysis Designing

### Lecture 15 Introduction to Survival Analysis

Lecture 15 Introduction to Survival Analysis BIOST 515 February 26, 2004 BIOST 515, Lecture 15 Background In logistic regression, we were interested in studying how risk factors were associated with presence

### CHAPTER 6 EXAMPLES: GROWTH MODELING AND SURVIVAL ANALYSIS

Examples: Growth Modeling And Survival Analysis CHAPTER 6 EXAMPLES: GROWTH MODELING AND SURVIVAL ANALYSIS Growth models examine the development of individuals on one or more outcome variables over time.

### Fairfield Public Schools

Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

### Simple Sensitivity Analyses for Matched Samples. CRSP 500 Spring 2008 Thomas E. Love, Ph. D., Instructor. Goal of a Formal Sensitivity Analysis

Goal of a Formal Sensitivity Analysis To replace a general qualitative statement that applies in all observational studies the association we observe between treatment and outcome does not imply causation

### Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.) Logistic regression generalizes methods for 2-way tables Adds capability studying several predictors, but Limited to

### Measures of Prognosis. Sukon Kanchanaraksa, PhD Johns Hopkins University

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

### Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]

Survival Analysis of Left Truncated Income Protection Insurance Data [March 29, 2012] 1 Qing Liu 2 David Pitt 3 Yan Wang 4 Xueyuan Wu Abstract One of the main characteristics of Income Protection Insurance

### 200609 - ATV - Lifetime Data Analysis

Coordinating unit: Teaching unit: Academic year: Degree: ECTS credits: 2015 200 - FME - School of Mathematics and Statistics 715 - EIO - Department of Statistics and Operations Research 1004 - UB - (ENG)Universitat

### Some Issues in Using PROC LOGISTIC for Binary Logistic Regression

Some Issues in Using PROC LOGISTIC for Binary Logistic Regression by David C. Schlotzhauer Contents Abstract 1. The Effect of Response Level Ordering on Parameter Estimate Interpretation 2. Odds Ratios

### CHILDHOOD CANCER SURVIVOR STUDY Analysis Concept Proposal

CHILDHOOD CANCER SURVIVOR STUDY Analysis Concept Proposal 1. STUDY TITLE: Longitudinal Assessment of Chronic Health Conditions: The Aging of Childhood Cancer Survivors 2. WORKING GROUP AND INVESTIGATORS:

Goals of This Course Be able to understand a study design (very basic concept) Be able to understand statistical concepts in a medical paper Be able to perform a data analysis Understanding: PECO study

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations

### INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) As with other parametric statistics, we begin the one-way ANOVA with a test of the underlying assumptions. Our first assumption is the assumption of

### Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

### INTRODUCTORY STATISTICS

INTRODUCTORY STATISTICS FIFTH EDITION Thomas H. Wonnacott University of Western Ontario Ronald J. Wonnacott University of Western Ontario WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore

### UNDERGRADUATE DEGREE DETAILS : BACHELOR OF SCIENCE WITH

QATAR UNIVERSITY COLLEGE OF ARTS & SCIENCES Department of Mathematics, Statistics, & Physics UNDERGRADUATE DEGREE DETAILS : Program Requirements and Descriptions BACHELOR OF SCIENCE WITH A MAJOR IN STATISTICS

### Life Tables. Marie Diener-West, PhD Sukon Kanchanaraksa, PhD

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

### %DISCIT Macro: Pre-screening continuous variables for subsequent binary logistic regression analysis through visualization

ABSTRACT Paper 92 %DISCIT Macro: Pre-screening continuous variables for subsequent binary logistic regression analysis through visualization Mohamed S. Anany, Videa (Cox Media Group subsidiary) In binary

### MEASURES OF LOCATION AND SPREAD

Paper TU04 An Overview of Non-parametric Tests in SAS : When, Why, and How Paul A. Pappas and Venita DePuy Durham, North Carolina, USA ABSTRACT Most commonly used statistical procedures are based on the

### Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

### IBM SPSS Complex Samples 22

IBM SPSS Complex Samples 22 Note Before using this information and the product it supports, read the information in Notices on page 51. Product Information This edition applies to version 22, release 0,

### Sample Size Calculation and Power Analysis for Design of Experiments Using Proc GLMPOWER Chii-Dean Joey Lin, SDSU, San Diego, CA

Sample Size Calculation and Power Analysis for Design of Experiments Using Proc GLMPOWER Chii-Dean Joey Lin, SDSU, San Diego, CA ABSTRACT How many samples do I need? When a statistician or a quantitative

### Permutation Tests for Comparing Two Populations

Permutation Tests for Comparing Two Populations Ferry Butar Butar, Ph.D. Jae-Wan Park Abstract Permutation tests for comparing two populations could be widely used in practice because of flexibility of

### An Application of the G-formula to Asbestos and Lung Cancer. Stephen R. Cole. Epidemiology, UNC Chapel Hill. Slides: www.unc.

An Application of the G-formula to Asbestos and Lung Cancer Stephen R. Cole Epidemiology, UNC Chapel Hill Slides: www.unc.edu/~colesr/ 1 Acknowledgements Collaboration with David B. Richardson, Haitao

### Ordinal Regression. Chapter

Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

### Title: Proton Pump Inhibitors and the risk of pneumonia: a comparison of cohort and self-controlled case series designs

Author's response to reviews Authors: Emmae Ramsay (emmae.ramsay@adelaide.edu.au) Nicole Pratt (nicole.pratt@unisa.edu.au) Philip Ryan (philip.ryan@adelaide.edu.au) Elizabeth Roughead (libby.roughead@unisa.edu.au)

### Data Analysis, Research Study Design and the IRB

Minding the p-values p and Quartiles: Data Analysis, Research Study Design and the IRB Don Allensworth-Davies, MSc Research Manager, Data Coordinating Center Boston University School of Public Health IRB

### SPSS: Descriptive and Inferential Statistics. For Windows

For Windows August 2012 Table of Contents Section 1: Summarizing Data...3 1.1 Descriptive Statistics...3 Section 2: Inferential Statistics... 10 2.1 Chi-Square Test... 10 2.2 T tests... 11 2.3 Correlation...

### HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

### Learning Objective. Simple Hypothesis Testing

Exploring Confidence Intervals and Simple Hypothesis Testing 35 A confidence interval describes the amount of uncertainty associated with a sample estimate of a population parameter. 95% confidence level

### , then the form of the model is given by: which comprises a deterministic component involving the three regression coefficients (

Multiple regression Introduction Multiple regression is a logical extension of the principles of simple linear regression to situations in which there are several predictor variables. For instance if we