# For usage of the functions, it is necessary to install the "survival" and the "penalized" package.



Similar documents
Linda Staub & Alexandros Gekenidis

# load in the files containing the methyaltion data and the source # code containing the SSRPMM functions

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Exploratory Data Analysis

Package survpresmooth

Statistical Models in R

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Graphics in R. Biostatistics 615/815

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

STATISTICA Formula Guide: Logistic Regression. Table of Contents

SUMAN DUVVURU STAT 567 PROJECT REPORT

Package smoothhr. November 9, 2015

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

An Application of Weibull Analysis to Determine Failure Rates in Automotive Components

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Vignette for survrm2 package: Comparing two survival curves using the restricted mean survival time

Introduction. Survival Analysis. Censoring. Plan of Talk

Package bigdata. R topics documented: February 19, 2015

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16

Journal of Statistical Software

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Introduction of geospatial data visualization and geographically weighted reg

Checking proportionality for Cox s regression model

Survival analysis methods in Insurance Applications in car insurance contracts


Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Notating the Multilevel Longitudinal Model. Multilevel Modeling of Longitudinal Data. Notating (cont.) Notating (cont.)

Computational Assignment 4: Discriminant Analysis

Statistical issues in the analysis of microarray data

Time Series Analysis AMS 316

Lasso on Categorical Data

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

BIOL 933 Lab 6 Fall Data Transformation

Multivariate Logistic Regression

A Brief Introduction to SPSS Factor Analysis

Lecture 6: Poisson regression

Getting started with qplot

Personalized Predictive Medicine and Genomic Clinical Trials

Cluster Analysis using R

Maximally Selected Rank Statistics in R

Viewing Ecological data using R graphics

Principal Component Analysis

R Graphics II: Graphics for Exploratory Data Analysis

Package SurvCorr. February 26, 2015

Chapter 6: Multivariate Cointegration Analysis

Package tagcloud. R topics documented: July 3, 2015

ISyE 2028 Basic Statistical Methods - Fall 2015 Bonus Project: Big Data Analytics Final Report: Time spent on social media

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Package trimtrees. February 20, 2015

Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]

A new score predicting the survival of patients with spinal cord compression from myeloma

Statistical Models in R

Generalized Linear Models

Study Design. Date: March 11, 2003 Reviewer: Jawahar Tiwari, Ph.D. Ellis Unger, M.D. Ghanshyam Gupta, Ph.D. Chief, Therapeutics Evaluation Branch

Nonparametric statistics and model selection

Logistic Regression (1/24/13)

Regression and Programming in R. Anja Bråthen Kristoffersen Biomedical Research Group

Time Series Database Interface: R MySQL (TSMySQL)

Statistical Machine Learning

2 intervals-package. Index 33. Tools for working with points and intervals

Predicting Market Value of Soccer Players Using Linear Modeling Techniques

SAS Syntax and Output for Data Manipulation:

Package copa. R topics documented: August 9, 2016

Easily Identify Your Best Customers

Package plsgenomics. May 11, 2015

Linear Discriminant Analysis

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Regression Modeling Strategies

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

PAM Prediction Analysis of Microarrays Users guide and manual

Chapter 15. Mixed Models Overview. A flexible approach to correlated data.

An Introduction to Machine Learning

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Modelling and added value

Survival Analysis: An Introduction

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Regression III: Advanced Methods

Practical Differential Gene Expression. Introduction

Psychology 205: Research Methods in Psychology

Package GSA. R topics documented: February 19, 2015

Lecture 3: Linear methods for classification

Package TSfame. February 15, 2013

!!!!!!!!!!!!!!!!!!!!!!!!!!

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Time Series Analysis with R - Part I. Walter Zucchini, Oleg Nenadić

Basic Statistical and Modeling Procedures Using SAS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Sun Li Centre for Academic Computing

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Lecture 19: Conditional Logistic Regression

Introduction to Matrix Algebra

Collections MAX Screen Pop Web Service

Transcription:

###################################################################### ### R-script for the manuscript ### ### ### ### Survival models with preclustered ### ### gene groups as covariates ### ### ### ### ### ###################################################################### # The following R-Code consists of four model selection and evaluation functions (L1, L2, univ., forw.). # In each function the evaluation procedure (see Section 3) is implemeted. # The input for each function is a data matrix "x" (rows: samples, columns: genes/go groups), # a vector "time" containing the survival times and a vector "status" containing the non-censoring indicators of all patients. # For usage of the functions, it is necessary to install the "survival" and the "penalized" package. ################################################ ########## function for L1 regression ########## ################################################ L1.reg <- function(x, time, status){ list.sel.l1 <<- list(null) # save all selected covariates num.l1 <<- NULL # save the number of chosen covariates lambda.l1 <<- NULL # save the optimal tuning parameter lambda p.median.l1 <<- NULL # save the p-value obtained from logrank test p.pi.l1 <<- NULL # save the p-value obtained from likelihood ratio test for the prognostic index count <- 0 for (j in 1:1000){ # in case of matrix failure caused by the profiling and/or optimation procedure (penalized package) if(!count == 100){ print(count) id <- sample(dim(x)[1], round(2/3*dim(x)[1])) # get IDs for training set (2/3 of all patients) ### training data x.train <- as.data.frame(x[id, ]) # training data matrix time.train <- matrix(time[id]) # training data survival times status.train <- matrix(status[id]) # training data non-censoring indicator S.train <- Surv(time.train, status.train) # survival object for training data ### test data x.test <- as.data.frame(x[-id, ]) # test data matrix time.test <- matrix(time[-id]) # test data survival times status.test <- matrix(status[-id]) # test data non-censoring indicator S.test <- Surv(time.test, status.test) # survival object for test data Seite 1

# number of folds for cross-validation (cv) k <- 10 ########## L1 model selection # calculate the profile for the cross-validated log likelihood (cvl) # set minlambda1 in case of matrix inversion failure (approx. between 3 and 5) prof.l1 <- try(profl1(s.train, x.train, minlambda1 = 5, fold = k)) if(!inherits(prof.l1, "try-error")){ # profile plot for the cvl plot(prof.l1$lambda, prof.l1$cvl, type = "l") # first approximation for optimal lambda lambda.opt <- prof.l1$lambda[which.max(prof.l1$cvl)] print(lambda.opt) print(max(prof.l1$cvl)) # define the range for the optimation procedure steplength <- prof.l1$lambda[2]-prof.l1$lambda[1] # find the optimal tuning parameter via cvl opt.l1 <- try(optl1(s.train, x.train, fold = prof.l1$fold, minlambda1 = lambda.opt-2*steplength, maxlambda1 = lambda.opt+2*steplength)) if(!inherits(opt.l1, "try-error")){ # optimal lambda1 lambda.l1 <- opt.l1$lambda # selected covariates sel.l1 <- names(coefficients(opt.l1$fullfit)) if(!is.null(sel.l1)){ + 1 # number of selected covariates num.l1 <- length(sel.l1) # penalized object pen.l1 <- penalized(s.train, x.train, lambda1 = opt.l1$lambda) ########## evaluation for L1 # risk score rs.l1 <- as.matrix(x.test[, sel.l1]) %*% as.matrix(coefficients(pen.l1)) # median time good.prog.l1 <- (rs.l1 < median(rs.l1)) fit.l1 <- survfit(s.test ~ good.prog.l1) # kaplan-meier curves (for visualization) Seite 2

plot(fit.l1, lwd = 2, lty = c(1,1), col = c("red","blue"), xlab = 'Time (years)', ylab = 'Estimated Survival Function') legend(60, 0.8, legend=c('pi > median', 'PI < median'), lty = c(1,1), col = c("red", "blue"), lwd = 2) title("kaplan-meier Curves") # logrank test and p-value logrank.l1 <- survdiff(s.test ~ good.prog.l1, data = x.test, rho = 0) p.median.l1 <- 1-pchisq(logrank.L1$chisq, 1) # prognostic index (risk score) and p-value pi.l1 <- coxph(s.test ~ rs.l1) p.pi.l1 <- 1-pchisq(pi.L1$score, 1) else { sel.l1 <- NA num.l1 <- NA lambda.l1 <- NA p.median.l1 <- NA p.pi.l1 <- NA else { sel.l1 <- NA num.l1 <- NA lambda.l1 <- NA p.median.l1 <- NA p.pi.l1 <- NA else { sel.l1 <- NA num.l1 <- NA lambda.l1 <- NA p.median.l1 <- NA p.pi.l1 <- NA # evaluation values list.sel.l1[[j]] <<- sel.l1 num.l1[j] <<- num.l1 lambda.l1[j] <<- lambda.l1 p.median.l1[j] <<- p.median.l1 p.pi.l1[j] <<- p.pi.l1 ################################################ ########## function for L2 regression ########## ################################################ Seite 3

L2.reg <- function(x, time, status){ lambda.l2 <<- NULL p.median.l2 <<- NULL p.pi.l2 <<- NULL count <- 0 for (j in 1:1000){ if(!count == 100){ print(count) id <- sample(dim(x)[1], round(2/3*dim(x)[1])) ### training data x.train <- as.data.frame(x[id, ]) time.train <- matrix(time[id]) status.train <- matrix(status[id]) S.train <- Surv(time.train, status.train) ### test data x.test <- as.data.frame(x[-id, ]) time.test <- matrix(time[-id]) status.test <- matrix(status[-id]) S.test <- Surv(time.test, status.test) # number of folds for cv k <- 10 ########## L2 model selection # find the optimal tuning parameter via cvl opt.l2 <- try(optl2(s.train, as.data.frame(x.train), fold = k)) if(!inherits(opt.l2, "try-error")){ lambda.l2 <- opt.l2$lambda pen.l2 <- try(penalized(s.train, x.train, lambda2 = opt.l2$lambda)) if(!inherits(pen.l2, "try-error")){ + 1 ########### evaluation for L2 # risk score rs.l2 <- as.matrix(x.test) %*% as.matrix(coefficients(pen.l2)) # median time good.prog.l2 <- (rs.l2 < median(rs.l2)) fit.l2 <- survfit(s.test ~ good.prog.l2) # kaplan-meier curves plot(fit.l2, lwd = 2, lty = c(1,1), col = c("red","blue"), xlab = 'Time (years)', ylab = 'Estimated Survival Function') legend(60, 0.8, legend=c('pi > median', 'PI < median'), lty = c(1,1), col = c("red", "blue"), lwd = 2) Seite 4

title("kaplan-meier Curves") # logrank test logrank.l2 <- survdiff(s.test ~ good.prog.l2, data = x.test, rho = 0) p.median.l2 <- 1-pchisq(logrank.L2$chisq, 1) # prognostic index (risk score) pi.l2 <- coxph(s.test ~ rs.l2) p.pi.l2 <- 1-pchisq(pi.L2$score, 1) else { lambda.l2 <- NA p.median.l2 <- NA p.pi.l2 <- NA else { lambda.l2 <- NA p.median.l2 <- NA p.pi.l2 <- NA # evaluation values lambda.l2[j] <<- lambda.l2 p.median.l2[j] <<- p.median.l2 p.pi.l2[j] <<- p.pi.l2 ####################################################### ########## function for univariate selection ########## ####################################################### uni.reg <- function(x, time, status){ list.sel.uni <<- list(null) num.uni <<- NULL lambda.uni <<- NULL p.median.uni <<- NULL p.pi.uni <<- NULL count <- 0 for (j in 1:1000){ if(!count == 100){ print(count) id <- sample(dim(x)[1], round(2/3*dim(x)[1])) ### training data Seite 5

x.train <- as.data.frame(x[id, ]) time.train <- matrix(time[id]) status.train <- matrix(status[id]) S.train <- Surv(time.train, status.train) ### test data x.test <- as.data.frame(x[-id, ]) time.test <- matrix(time[-id]) status.test <- matrix(status[-id]) S.test <- Surv(time.test, status.test) # number of folds for cv k <- 10 # number of covariates included m <- dim(x)[2] ########## univariate selection cox.p.uni <- NULL for (i in 1:m){ t <- coxph(s.train ~ x.train[, i], method = "breslow") cox.p.uni[i] <- 1-pchisq(t$score, 1) sel.uni <- order(cox.p.uni) # cvl for the first covariates according to smallest p-value cvl <- NULL i <- 1 fit <- cvl(s.train, x.train[, sel.uni[i]], fold = k) cvl[1] <- fit$cvl while (cvl[i] > -Inf) { i <- i+1 fit <- try(cvl(s.train, x.train[, sel.uni[1:i]], fold = k)) if(!inherits(fit, "try-error")){ cvl[i] <- fit$cvl else{ break # optimal tuning parameter lambda.uni <- which.max(cvl) # covariates included in the model sel.uni <- sel.uni[1:lambda.uni] num.uni <- length(sel.uni) # creat a penalized object (= coxph(method = "breslow")) with optimal lambda.uni pen.uni <- try(penalized(s.train, x.train[, sel.uni], data = x.train)) if(!inherits(pen.uni, "try-error")){ + 1 Seite 6

######### evaluation for univariate selection # risk score rs.uni <- as.matrix(x.test[, sel.uni]) %*% as.matrix(coefficients(pen.uni)) # median time good.prog.uni <- (rs.uni < median(rs.uni)) fit.uni <- survfit(s.test ~ good.prog.uni) # kaplan-meier curves plot(fit.uni, lwd = 2, lty = c(1,1), col = c("red","blue"), xlab = 'Time (years)', ylab = 'Estimated Survival Function') legend(15, 0.8, legend=c('pi > median', 'PI < median'), lty = c(1,1), col = c("red", "blue"), lwd = 2) title("kaplan-meier Curves") # logrank test logrank.uni <- survdiff(s.test ~ good.prog.uni, rho = 0, data = x.test) p.median.uni <- 1-pchisq(logrank.uni$chisq, 1) # prognostic index (risk score) pi.uni <- coxph(s.test ~ rs.uni, method = "breslow") p.pi.uni <- 1-pchisq(pi.uni$score, 1) else { sel.uni <- NA num.uni <- NA lambda.uni <- NA p.median.uni <- NA p.pi.uni <- NA # evaluation values list.sel.uni[[j]] <<- sel.uni num.uni[j] <<- num.uni lambda.uni[j] <<- lambda.uni p.median.uni[j] <<- p.median.uni p.pi.uni[j] <<- p.pi.uni ####################################################### ########## function for forward selection ############# ####################################################### forw.reg <- function(x, time, status){ list.sel.forw <<- list(null) num.forw <<- NULL lambda.forw <<- NULL Seite 7

p.median.forw <<- NULL p.pi.forw <<- NULL count <- 0 for (j in 1:100){ print(count) id <- sample(dim(x)[1], round(2/3*dim(x)[1])) ### training data x.train <- as.data.frame(x[id, ]) time.train <- matrix(time[id]) status.train <- matrix(status[id]) S.train <- Surv(time.train, status.train) ### test data x.test <- as.data.frame(x[-id, ]) time.test <- matrix(time[-id]) status.test <- matrix(status[-id]) S.test <- Surv(time.test, status.test) # number of folds for cv k <- 10 # number of covariates included m <- dim(x)[2] ######################### forward selection # first covariate selected via univariate approach cox.p.uni <- NULL for (i in 1:m){ t <- coxph(s.train ~ x.train[, i], method = "breslow") cox.p.uni[i] <- 1-pchisq(t$score, 1) # include covariates step by step (max 100 covariates) x.train <- as.matrix(x.train) active.set <- NULL for (z in 1:100){ cox.p <- rep(inf, m) for (i in (1:m)[!(1:m)%in%active.set]){ t <- coxph(s.train ~ x.train[, c(active.set, i)], method = "breslow") cox.p[i] <- 1-pchisq(t$score, 1) print(min(cox.p)) if(min(cox.p)>0){ active.set <- c(active.set, which.min(cox.p)) else{ active.set <- c(active.set, setdiff(order(cox.p.uni), active.set)) break Seite 8

# cvl for the first covariates according to smallest p-value cvl <- NULL i <- 1 fit <- cvl(s.train, x.train[, active.set[1:i]], fold = k) cvl[1] <- fit$cvl while (cvl[i] > -Inf) { i <- i+1 fit <- try(cvl(s.train, x.train[, active.set[1:i]], fold = k)) if(!inherits(fit, "try-error")){ cvl[i] <- fit$cvl else{ break # optimal tuning parameter lambda.forw <- which.max(cvl) # covariates included in the model sel.forw <- active.set[1:lambda.forw] num.forw <- length(sel.forw) # create a penalized object (= coxph(method = "breslow")) with optimal lambda.forw pen.forw <- try(penalized(s.train, x.train[, sel.forw], data = as.data.frame(x.train))) if(!inherits(pen.forw, "try-error")){ ########## evaluation for forward selection # risk score rs.forw <- as.matrix(x.test[, sel.forw]) %*% as.matrix(coefficients(pen.forw)) # median time good.prog.forw <- (rs.forw < median(rs.forw)) fit.forw <- survfit(s.test ~ good.prog.forw) # kaplan-meier curves plot(fit.forw, lwd = 2, lty = c(1,1), col = c("red","blue"), xlab = 'Time (years)', ylab = 'Estimated Survival Function') legend(15, 0.8, legend=c('pi > median', 'PI < median'), lty = c(1,1), col = c("red", "blue"), lwd = 2) title("kaplan-meier Curves") # logrank test logrank.forw <- survdiff(s.test ~ good.prog.forw, rho = 0, data = x.test) p.median.forw <- 1-pchisq(logrank.forw$chisq, 1) # prognostic index (risk score) pi.forw <- coxph(s.test ~ rs.forw, method = "breslow") p.pi.forw <- 1-pchisq(pi.forw$score, 1) Seite 9

else { sel.forw <- NA num.forw <- NA lambda.forw <- NA p.median.forw <- NA p.pi.forw <- NA # evaluation values list.sel.forw[[j]] <<- sel.forw num.forw[j] <<- num.forw lambda.forw[j] <<- lambda.forw p.median.forw[j] <<- p.median.forw p.pi.forw[j] <<- p.pi.forw + 1 Seite 10