Building patient-level predictive models Martijn J. Schuemie, Marc A. Suchard and Patrick Ryan 2015-11-01

Similar documents
Learning from observational databases: Lessons from OMOP and OHDSI

Oracle Data Miner (Extension of SQL Developer 4.0)

Linas Virbalas Continuent, Inc.

Next-generation Phenotyping Using Interoperable Big Data

Achilles a platform for exploring and visualizing clinical data summary statistics

In this tutorial, we try to build a roc curve from a logistic regression.

Package dsmodellingclient

Integrating VoltDB with Hadoop

Installation Guide for contineo

Understanding the Benefits of IBM SPSS Statistics Server

Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D.

How to extract transform and load observational data?

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Oracle Data Miner (Extension of SQL Developer 4.0)

Estimating a market model: Step-by-step Prepared by Pamela Peterson Drake Florida Atlantic University

Predictive Modelling of High Cost Healthcare Users in Ontario

Simple Regression Theory II 2010 Samuel L. Baker

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Object systems available in R. Why use classes? Information hiding. Statistics 771. R Object Systems Managing R Projects Creating R Packages

11. Analysis of Case-control Studies Logistic Regression

Tungsten Replicator, more open than ever!

Installing Cobra 4.7

Your Best Next Business Solution Big Data In R 24/3/2010

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Data Migration from Magento 1 to Magento 2 Including ParadoxLabs Authorize.Net CIM Plugin Last Updated Jan 4, 2016

Part 2: Analysis of Relationship Between Two Variables

Contents WEKA Microsoft SQL Database

SpagoBI exo Tomcat Installation Manual

Package TSfame. February 15, 2013

Azure Machine Learning, SQL Data Mining and R

Third-Party Software Support. Converting from SAS Table Server to a SQL Server Database

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package Data Federation Administration Tool Guide

STATISTICA Formula Guide: Logistic Regression. Table of Contents

GETTING STARTED WITH SQL SERVER

Data warehousing with PostgreSQL

Profile Data Mining with PerfExplorer. Sameer Shende Performance Reseaerch Lab, University of Oregon

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Advanced analytics at your hands

1 File Processing Systems

Speedy Organizer. SQL Server 2008 R2 RTM Express - Installation Guide Introduction

Data Mining Lab 5: Introduction to Neural Networks

AWS Schema Conversion Tool. User Guide Version 1.0

System Area Management Software Tool Tip: Integrating into NetIQ AppManager

Tutorial: How to Use SQL Server Management Studio from Home

CDD user guide. PsN Revised

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Install SQL Server 2014 Express Edition

i2b2 Installation Guide

Tutorials for Project on Building a Business Analytic Model Using Data Mining Tool and Data Warehouse and OLAP Cubes IST 734

IUCLID 5 Guidance and support. Installation Guide Distributed Version. Linux - Apache Tomcat - PostgreSQL

HP NonStop JDBC Type 4 Driver Performance Tuning Guide for Version 1.0

Lecture 10: Regression Trees

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Fortgeschrittene Computerintensive Methoden

Real SQL Programming 1

Logistic Regression (1/24/13)

Intensive Care Cloud (ICCloud) Venus-C pilot application

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

extreme Datamining mit Oracle R Enterprise

Karl Lum Partner, LabKey Software Evolution of Connectivity in LabKey Server

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Using the SQL Server Linked Server Capability

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

MONAHRQ Installation Permissions Guide. Version 2.0.4

Installing SQL Server Express 2008 Version /08/05 sdk

AklaBox. The Ultimate Document Platform for your Cloud Infrastructure. Installation Guideline

SQL Server 2005 Features Comparison

SAP Predictive Analysis Installation

FileMaker 11. ODBC and JDBC Guide

Examining a Fitted Logistic Model

System Requirements Table of contents

SQL Databases Course. by Applied Technology Research Center. This course provides training for MySQL, Oracle, SQL Server and PostgreSQL databases.

Package missforest. February 20, 2015

Intro to Databases. ACM Webmonkeys 2011

Geodatabase Programming with SQL

Using Netbeans and the Derby Database for Projects Contents

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

FileMaker 12. ODBC and JDBC Guide

Team Foundation Server 2013 Installation Guide

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Flatten from/to Relational

JDBC. It is connected by the Native Module of dependent form of h/w like.dll or.so. ex) OCI driver for local connection to Oracle

Using Microsoft Access Front End to NIMSP MySQL Database

Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file?

Package sjdbc. R topics documented: February 20, 2015

Online Backup - Installation and Setup

Microsoft SQL Server Staging

Download and Installation of MS SQL Server

ivos Technical Requirements V For Current Clients as of June 2014

How To Understand How Weka Works

SECURITY DOCUMENT. BetterTranslationTechnology

Transcription:

Building patient-level predictive models Martijn J. Schuemie, Marc A. Suchard and Patrick Ryan 2015-11-01 Contents 1 Introduction 1 1.1 Specifying the cohort of interest and outcomes.......................... 1 2 Installation instructions 2 3 Data extraction 3 3.1 Configuring the connection to the server.............................. 3 3.2 Preparing the cohort and outcome of interest........................... 3 3.3 Extracting the data from the server................................. 5 4 Fitting the model 7 4.1 Train-test split............................................ 7 4.2 Fitting the model on the training data............................... 7 4.3 Model evaluation........................................... 8 4.4 Inspecting the model......................................... 9 5 Acknowledgments 9 1 Introduction This vignette describes how you can use the PatientLevelPrediction package to build patient-level prediction models. We will walk through all the steps needed to build an exemplar model, and we have selected the well-studied topic of predicting re-hospitalization. To reduce the scope a bit, we have limited ourselves to a diabetes type 2 population. 1.1 Specifying the cohort of interest and outcomes The PatientLevelPrediction package requires longitudinal observational healthcare data in the OMOP Common Data Model format. The user will need to specify two things: 1. Time periods for which we wish to predict the occurrence of an outcome. We will call this the cohort of interest or cohort for short. One person can have multiple time periods, but time periods should not overlap. 2. Outcomes for which we wish to build a predictive model. The cohort and outcomes should be provided as data in a table on the server, where the table should have the same structure as the cohort table in the OMOP CDM, meaning it should have the following columns: 1

cohort_concept_id (CDM v4) or cohort_concept_id (CDM v5+), a unique identifier for distinguishing between different types of cohorts, e.g. cohorts of interest and outcome cohorts. subject_id, a unique identifier corresponding to the person_id in the CDM. cohort_start_date, the start of the time period where we wish to predict the occurrence of the outcome. cohort_end_date, which can be used to determine the end of the prediction window. Can be NULL for outcomes. The prediction window always starts on the cohort_start_date. When the usecohortenddate argument is set to TRUE, the prediction window will end on the cohort_end_date plus the number of days specified using the windowpersistence argument. If the usecohortenddate argument is set to FALSE, the prediction window will end on the cohort_start_date plus the number of days specified using the windowpersistence argument. The package will use data from the time period preceding (and including) the cohort_start_date to build a large set of features that can be used to predict outcomes during the prediction window (including the first and last day of the prediction window). These features can include binary indicators for the occurrence of any individual drug, condition, procedures, as well as demographics and comorbidity indices. Important: Since the cohort_start_date is included in both the time used to construct predictors, and the time when outcomes can occurr, it is up to the user to either remove outcomes occurring on the cohort_start_date, or to remove predictors that are in fact indicators of the occurrence of an outcome on the cohort_start_date. 2 Installation instructions Before installing the PatientLevelPrediction package make sure you have Java available. Java can be downloaded from www.java.com. For Windows users, RTools is also necessary. RTools can be downloaded from CRAN. The PatientLevelPrediction package is currently maintained in a Github repository, and has dependencies on other packages in Github. All of these packages can be downloaded and installed from within R using the devtools package: 2

install.packages("devtools") library(devtools) install_github("ohdsi/sqlrender") install_github("ohdsi/databaseconnector") install_github("ohdsi/ohdsirtools") install_github("ohdsi/cyclops") install_github("ohdsi/patientlevelprediction") Once installed, you can type library(patientlevelprediction) to load the package. 3 Data extraction The first step in running the PatientLevelPrediction is extracting all necessary data from the database server holding the data in the Common Data Model (CDM) format. 3.1 Configuring the connection to the server We need to tell R how to connect to the server where the data are. PatientLevelPrediction uses the DatabaseConnector package, which provides the createconnectiondetails function. Type?createConnectionDetails for the specific settings required for the various database management systems (DBMS). For example, one might connect to a PostgreSQL database using this code: connectiondetails <- createconnectiondetails(dbms = "postgresql", server = "localhost/ohdsi", user = "joe", password = "supersecret") cdmdatabaseschema <- "my_cdm_data" cohortsdatabaseschema <- "my_results" cdmversion <- "4" The last three lines define the cdmdatabaseschema and cohortsdatabaseschema variables,as well as the CDM version. We ll use these later to tell R where the data in CDM format live, where we want to create the cohorts of interest, and what version CDM is used. Note that for Microsoft SQL Server, databaseschemas need to specify both the database and the schema, so for example cdmdatabaseschema <- "my_cdm_data.dbo". 3.2 Preparing the cohort and outcome of interest Before we can start using the PatientLevelPrediction package itself, we need to construct a cohort of interest for which we want to perform the prediction, and the outcome, the event that we would like to predict. We do this by writing SQL statements against the CDM that populate a table containing the persons and events of interest. The resulting table should have the same structure as the cohort table in the CDM. For CDM v4, this means it should have the fields cohort_concept_id, cohort_start_date, cohort_end_date,and subject_id. For CDM v4, the cohort_concept_id field must be called cohort_definition_id. For our example study, we need to create the cohort of diabetics that have been hospitalized and have a minimum amount of observation time available before and after the hospitalization. We also need to defined re-hospitalizations, which we define as any hospitalizations occurring after the original hospitalization. For this purpose we have created a file called HospitalizationCohorts.sql with the following contents: 3

/*********************************** File HospitalizationCohorts.sql ***********************************/ IF OBJECT_ID('@cohortsDatabaseSchema.rehospitalization', 'U') IS NOT NULL DROP TABLE @cohortsdatabaseschema.rehospitalization; SELECT visit_occurrence.person_id AS subject_id, MIN(visit_start_date) AS cohort_start_date, DATEADD(DAY, @post_time, MIN(visit_start_date)) AS cohort_end_date, 1 AS cohort_concept_id INTO @cohortsdatabaseschema.rehospitalization FROM @cdmdatabaseschema.visit_occurrence INNER JOIN @cdmdatabaseschema.observation_period ON visit_occurrence.person_id = observation_period.person_id INNER JOIN @cdmdatabaseschema.condition_occurrence ON condition_occurrence.person_id = visit_occurrence.person_id WHERE place_of_service_concept_id IN (9201, 9203) AND DATEDIFF(DAY, observation_period_start_date, visit_start_date) > @pre_time AND visit_start_date > observation_period_start_date AND DATEDIFF(DAY, visit_start_date, observation_period_end_date) > @post_time AND visit_start_date < observation_period_end_date AND DATEDIFF(DAY, condition_start_date, visit_start_date) > @pre_time AND condition_start_date <= visit_start_date AND condition_concept_id IN ( SELECT descendant_concept_id FROM @cdmdatabaseschema.concept_ancestor WHERE ancestor_concept_id = 201826) /* Type 2 DM */ GROUP BY visit_occurrence.person_id; INSERT INTO @cohortsdatabaseschema.rehospitalization SELECT visit_occurrence.person_id AS subject_id, visit_start_date AS cohort_start_date, visit_end_date AS cohort_end_date, 2 AS cohort_concept_id FROM @cohortsdatabaseschema.rehospitalization INNER JOIN @cdmdatabaseschema.visit_occurrence ON visit_occurrence.person_id = rehospitalization.subject_id WHERE place_of_service_concept_id IN (9201, 9203) AND visit_start_date > cohort_start_date AND visit_start_date <= cohort_end_date AND cohort_concept_id = 1; This is parameterized SQL which can be used by the SqlRender package. We use parameterized SQL so we do not have to pre-specify the names of the CDM and result schemas. That way, if we want to run the SQL on a different schema, we only need to change the parameter values; we do not have to change the SQL code. By also making use of translation functionality in SqlRender, we can make sure the SQL code can be run in many different environments. library(sqlrender) sql <- readsql("hospitalizationcohorts.sql") sql <- rendersql(sql, cdmdatabaseschema = cdmdatabaseschema, cohortsdatabaseschema = cohortsdatabaseschema, 4

post_time = 30, pre_time = 365) )$sql sql <- translatesql(sql, targetdialect = connectiondetails$dbms)$sql connection <- connect(connectiondetails) executesql(connection, sql) In this code, we first read the SQL from the file into memory. In the next line, we replace four parameter names with the actual values. We then translate the SQL into the dialect appropriate for the DBMS we already specified in the connectiondetails. Next, we connect to the server, and submit the rendered and translated SQL. If all went well, we now have a table with the events of interest. We can see how many events per type: sql <- paste("select cohort_concept_id, COUNT(*) AS count", "FROM @cohortsdatabaseschema.rehospitalization", "GROUP BY cohort_concept_id") sql <- rendersql(sql, cohortsdatabaseschema = cohortsdatabaseschema)$sql sql <- translatesql(sql, targetdialect = connectiondetails$dbms)$sql querysql(connection, sql) cohort_concept_id count 1 1 395910 2 2 165350 3.3 Extracting the data from the server Now we can tell PatientLevelPrediction to extract all necessary data for our analysis: covariatesettings <- createcovariatesettings(usecovariatedemographics = TRUE, usecovariateconditionoccurrence = TRUE, usecovariateconditionoccurrence365d = TRUE, usecovariateconditionoccurrence30d = TRUE, usecovariateconditionoccurrenceinpt180d = TRUE, usecovariateconditionera = TRUE, usecovariateconditioneraever = TRUE, usecovariateconditioneraoverlap = TRUE, usecovariateconditiongroup = TRUE, usecovariatedrugexposure = TRUE, usecovariatedrugexposure365d = TRUE, usecovariatedrugexposure30d = TRUE, usecovariatedrugera = TRUE, usecovariatedrugera365d = TRUE, usecovariatedrugera30d = TRUE, usecovariatedrugeraoverlap = TRUE, usecovariatedrugeraever = TRUE, usecovariatedruggroup = TRUE, usecovariateprocedureoccurrence = TRUE, usecovariateprocedureoccurrence365d = TRUE, usecovariateprocedureoccurrence30d = TRUE, 5

plpdata <- getdbplpdata(connectiondetails = connectiondetails, cdmdatabaseschema = cdmdatabaseschema, oracletempschema = oracletempschema, cohortdatabaseschema = cohortsdatabaseschema, cohorttable = "rehospitalization", cohortids = 1, washoutwindow = 183, usecohortenddate = TRUE, windowpersistence = 0, covariatesettings = covariatesettings, outcomedatabaseschema = cohortsdatabaseschema, outcometable = "rehospitalization", outcomeids = 2, firstoutcomeonly = FALSE, cdmversion = cdmversion) usecovariateproceduregroup = TRUE, usecovariateobservation = TRUE, usecovariateobservation365d = TRUE, usecovariateobservation30d = TRUE, usecovariateobservationcount365d = TRUE, usecovariatemeasurement = TRUE, usecovariatemeasurement365d = TRUE, usecovariatemeasurement30d = TRUE, usecovariatemeasurementcount365d = TRUE, usecovariatemeasurementbelow = TRUE, usecovariatemeasurementabove = TRUE, usecovariateconceptcounts = TRUE, usecovariateriskscores = TRUE, usecovariateriskscorescharlson = TRUE, usecovariateriskscoresdcsi = TRUE, usecovariateriskscoreschads2 = TRUE, usecovariateriskscoreschads2vasc = TRUE, usecovariateinteractionyear = FALSE, usecovariateinteractionmonth = FALSE, excludedcovariateconceptids = c(), deletecovariatessmallcount = 100) There are many parameters, but they are all documented in the PatientLevelPrediction manual. In short, we are using the getdbplpdata to get all data on the cohort of interest, outcomes and covariates from the database. The resulting plpdata object uses the package ff to store information in a way that ensures R does not run out of memory, even when the data are large. We can get some overall statistics using the generic summary() method: summary(plpdata) PlPData object summary Cohort ID: 1 Outcome concept ID(s): 2 Using cohort end date: TRUE Window persistence: 0 6

Persons: 446375 Windows: 446375 Outcome counts: Event count Window count 2 118294 118294 Covariates: Number of covariates: 70539 Number of non-zero covariate values: 590416131 3.3.1 Saving the data to file Creating the plpdata object can take considerable computing time, and it is probably a good idea to save it for future sessions. Because plpdata uses ff, we cannot use R s regular save function. Instead, we ll have to use the saveplpdata() function: saveplpdata(plpdata, "rehosp_plp_data") We can use the loadplpdata() function to load the data in a future session. 4 Fitting the model 4.1 Train-test split We typically not only want to build our model, we also want to know how good it is. Because evaluation using the same data on which the model was fitted can lead to overestimation, one uses a train-test split of the data or cross-validation. We can use the splitdata() function to split the data in a 75%-25% split: parts <- splitdata(plpdata, c(0.75, 0.25)) The parts variable is a list of plpdata objects. We can now fit the model on the first part of the data (the training set): 4.2 Fitting the model on the training data model <- fitpredictivemodel(parts[[1]], modeltype = "logistic") The fitpredictivemodel() function uses the Cyclops package to fit a large-scale regularized regression. To fit the model, Cyclops needs to know the hyperparameter value which specifies the variance of the prior. By default Cyclops will use cross-validation to estimate the optimal hyperparameter. However, be aware that this can take a really long time. You can use the prior and control parameters of the fitpredictivemodel() to specify Cyclops behaviour, including using multiple CPUs to speed-up the cross-validation. 7

4.3 Model evaluation We can evaluate how well the model is able to predict the outcome, for example on the test data we can compute the area under the ROC curve: prediction <- predictprobabilities(model, parts[[2]]) computeauc(prediction, parts[[2]]) [1] 0.6841528 And we can plot the ROC curve itself: plotroc(prediction, parts[[2]]) 1.00 0.75 Sensitivity 0.50 0.25 0.00 0.00 0.25 0.50 0.75 1.00 1 specificity We can also look at the calibration: plotcalibration(prediction, parts[[2]], numberofstrata = 10) 8

Observed fraction 0.4 0.2 0.0 0.2 0.4 0.6 Predicted probability 4.4 Inspecting the model Now that we have some idea of the operating characteristics of our predictive model, we can investigate the model itself: modeldetails <- getmodeldetails(model, plpdata) head(modeldetails) coefficient id covariatename 1006 14.1141012 1006...in 365d on or prior to cohort index 0-1.1706427 0 Intercept 4092289252-0.9220897 4092289252...condition group: 4092289-Livebirth 321042102-0.4142039 321042102...ohort index: 321042-Cardiac arrest 4092289152-0.3537466 4092289152...condition group: 4092289-Livebirth 2313815702-0.2813409 2313815702..., without interpretation and report This shows the strongest predictors in the model with their corresponding betas. 5 Acknowledgments Considerable work has been dedicated to provide the PatientLevelPrediction package. 9

citation("patientlevelprediction") To cite package 'PatientLevelPrediction' in publications use: Martijn J. Schuemie, Marc A. Suchard and Patrick B. Ryan (2015). PatientLevelPrediction: Package for patient level prediction using data in the OMOP Common Data Model. R package version 1.0.0. A BibTeX entry for LaTeX users is @Manual{, title = {PatientLevelPrediction: Package for patient level prediction using data in the OMOP Comm author = {Martijn J. Schuemie and Marc A. Suchard and Patrick B. Ryan}, year = {2015}, note = {R package version 1.0.0}, } ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'. Further, PatientLevelPrediction makes extensive use of the Cyclops package. citation("cyclops") To cite Cyclops in publications use: Suchard MA, Simpson SE, Zorych I, Ryan P and Madigan D (2013). "Massive parallelization of serial inference algorithms for complex generalized linear models." _ACM Transactions on Modeling and Computer Simulation_, *23*, pp. 10. <URL: http://dl.acm.org/citation.cfm?id=2414791>. A BibTeX entry for LaTeX users is @Article{, author = {M. A. Suchard and S. E. Simpson and I. Zorych and P. Ryan and D. Madigan}, title = {Massive parallelization of serial inference algorithms for complex generalized linear mo journal = {ACM Transactions on Modeling and Computer Simulation}, volume = {23}, pages = {10}, year = {2013}, url = {http://dl.acm.org/citation.cfm?id=2414791}, } This work is supported in part through the National Science Foundation grant IIS 1251151. 10