REx: An Automated System for Extracting Clinical Trial Data from Oracle to SAS Edward McCaney, Centocor Inc., Malvern, PA Gail Stoner, Centocor Inc., Malvern, PA Anthony Malinowski, Centocor Inc., Malvern, PA Robert Karvois, Centocor Inc., Malvern, PA ABSTRACT A primary need in the reporting of clinical trial data is the ability to periodically extract clean data from a clinical database management system. This paper presents the methodology currently in use by Centocor. REx (Reporting Effort Extraction) is a Microsoft Access 97 based application that extracts clinical trial data from an Oracle database using SAS/ACCESS and creates permanent SAS data sets. REx uses SAS macros to perform additional processing based on modifications (actions) requested by users. REx provides an interface to define actions, create and run extract code, archive extractions, and run reports. Additionally, the system has the ability to obtain metadata, update it with user-defined actions, and pass it to downstream processes. lists and dictionaries from the CDMS. It also allows the user to define straightforward additions, modifications and deletions to the extracted data sets. The goal of these modifications or actions is to enable consistent creation of standard reporting variables, to perform dictionary merges, and to perform other processing to facilitate analysis and review. All actions are reflected in an updated data model that serves as the foundation for an electronic submission. The clinical data, format catalog, dictionary, and data model are then passed to another system for analysis data set definition. INTRODUCTION When Centocor switched its clinical data management system (CDMS), the need arose to design a new process to extract clinical data from an Oracle database into SAS data sets for analysis and reporting. The new extraction system, REx, is a component of a redesigned analysis and reporting infrastructure. Previously, SAS data sets and format catalogs were pushed by data management as straight dumps of the database, including standard fields that did not pertain to the trial to be analyzed. Data set documentation was provided separately in paper format and could easily become outdated. From the snapshots of the clinical data, programmers created SAS analysis data sets for reporting and submission. Analysis data set definitions were stored in Word documents, a method that made it difficult to maintain standards across studies. Data definition documentation was generated from SAS PROC CONTENTS and required a substantial manual effort to comply with FDA electronic submission guidelines. All systems were VAX-based legacy hardware and software that received limited support from the company s IT department. The new system (Figure 1) runs on a Windows NT platform and enables users (programmers) to pull and snapshot the data model (metadata), clinical data, code Figure 1 REx Process Overview
ACCESS DATABASE The information used by REx is accessed through MS Access tables. REx consists of both internal and external tables. The external tables are part of a larger Access database that is referred to as the Data Model, a metadata table which describes the clinical data. An example of a metadata table is a list of data sets that are included in a particular study. For instance, the metadata (data set list) is used to populate the list boxes used for screen entry. The Data Model is used throughout the REx process. A second table, probably the more pivotal, is an internal table - the Action table. The Action table records all user input to REx and is used to create the extract code that ultimately creates the SAS data sets. When the extract code is created, the action variables are assembled to call the SAS macros. REx acts as a code generator using the Action table data as the information source. Once REx applies the actions to the database, the Data Model must be updated to reflect the changes. For instance, if REx drops a variable from a SAS data set, the variable would also have to be dropped from the Data Model. REx performs this step automatically after the programmer has extracted the clinical data. The user is able to produce a post-rex Data Model reflecting the applied actions. This is systematically accomplished through the merging of the pre-rex Data Model and the Action table. ORACLE DATABASE As mentioned in the last section, the Data Model and the support tables for REx are stored in MS Access. However, the tables do not contain any clinical data. The clinical data that is used to create permanent data sets is stored in an Oracle database. While REx s primary goal is to extract clinical data, REx never actually accesses the Oracle database. Instead, it produces extract programs that are run in batch mode on a server. The extract programs contain code that reference the SAS views (discussed in the next section) which access the Oracle database. The data accessed through these SAS views is then manipulated and saved permanently as SAS data sets. SAS VIEWS REx accesses the clinical data, dictionaries, and code lists (formats) contained in Oracle via permanent SAS views stored in a production view directory. These views are defined in SAS/ACCESS scripts (automatically generated by the CDMS) which contain PROC SQL modules for each data set. When the view scripts are run (external to REx) the PROC SQL code connects to the Oracle database and creates the views by selecting the data model defined columns. Variable labeling and formatting are also defined by the view according to the data model. A sample PROC SQL script is shown below for a demographics (DEMOG) data set (Figure 2). View Creation Script /* Module: DEMOG */ PROC SQL; CONNECT TO ORACLE(USER=XXXXXXXX PASSWORD=XXXXXXXX PATH='XXXXX'); CREATE VIEW EXAMPLE.DEMOG as SELECT REC_ID label = 'Record Identifier', PNO label = 'Protocol Number', CNO label = 'Center ID', PATNO label = 'Patient Number', EVENT_ID label = 'Event Identifier', DOB_DT label = 'Date of birth' Format=DATE9., SEX label = 'Sex' Format=SEX., RACE label = 'Race' Format=RACE. FROM CONNECTION TO ORACLE (SELECT to_char(rec_id), PNO, CNO, PATNO, EVENT_ID, trunc(c_dob_dt)- to_date('01-01-1960', 'DD-MM-YYYY'), c_sex, c_race FROM TESTPROT.TEST_DEMOG ORDER BY CNO, PATNO, EVENT_ID, REC_ID ) AS S(REC_ID, PNO, CNO, PATNO, EVENT_ID, DOB_DT, SEX,RACE); DISCONNECT FROM ORACLE; ACTION ENTRY Figure 2 View Creation Script Actions are user-specified, parameter-driven, SAS macro modules invoked by the system to create SAS code that is applied to a selected variable. The actions are used to prep the data so that it becomes more user friendly. REx actions can be applied either on a single variable contained within a data set or on a single variable contained in multiple data sets. SAS macro parameters are specified through the action entry screens and stored in the Action table (Figure 3).
of an additional prompt where the user specifies the format used to decode the variable. The actions are implemented via SAS macros. Figure 5 shows the SAS macro code for the RESET action. %RESET Action Macro %RESET(DEMOG, SEX, $SEX). %MACRO RESET(DATASET,VARIABLE,FORMAT); /*Upcase the parameter VARIABLE */ %let VARIABLE = %upcase(&variable); Figure 3 Action Entry Variable Level Screen Variable level actions are applied to a single variable within a data set. A data set, variable, and action are selected from the entry screen. If additional information is required to complete the action, an action-specific prompt (example Figure 4 below) will be displayed requesting the user to enter the necessary required parameter information. The standard parameters created are: parameter 1 - the data set on which the action will be performed parameter 2 - the variable parameter 3 - the information specific to the action. /* Obtain the label from the original variable. */ Proc contents data = &DATASET out= _TLABEL (keep = NAME LABEL) noprint; /* Create a macro variable containing the original variable label. */ Data _TLABEL; Set _TLABEL (where = (NAME = "&VARIABLE")); Call symput("m_label",label); /* Use the format to create the new value for the varible. Attach the label from the original variable. */ Data &DATASET (drop = TEMPVAR) ; Set &DATASET (rename= (&VARIABLE =TEMPVAR)); &VARIABLE = put(tempvar,&format); label &VARIABLE = &M_LABEL; If left(&variable) = '.' then &VARIABLE=' '; /* Get rid of the work data set. */ Proc Datasets nolist; delete _TLABEL; quit; %MEND RESET; Figure 5 %RESET Action Macro Figure 4 - Action Entry Variable Level Screen Showing Action Specific Prompt An example of a variable level action is changing a coded value to a decoded value. In REx, this is known as a RESET action. The RESET action, requires: parameter 1- DEMOG (data set name) parameter 2 - SEX (variable to be processed) parameter 3 - $SEX. (the SAS format). The user selects the data set, variable, and action from the entry screen. The RESET action triggers the display Data set level action entry is very similar to variable level entry, the differentiation being the number of data sets to which the action is going to be applied. This screen allows the user to efficiently specify the same action on many data sets. The user first selects multiple data sets from the data set column, the variable, and then the action to be applied. Again, if additional information is needed to complete the actions, an action-specific screen will appear. An individual action is created in the action table for each data set. For example, to RESET a variable contained in multiple data sets, the user selects all of the data sets, the variable, and the action RESET. A prompt will request the user to enter the format. Variables that are not selected for action processing are passed through without an action being added to the action table.
CREATING EXTRACT CODE REx creates a single program for each data set. The code that is created by REx includes all the appropriate libname statements necessary to execute the program. The programs will include any actions that the user has specified in the Action Entry screens (Figures 3 and 4). When creating the extract code, the actions and the parameters are assembled so they can call an existing SAS macro module. The following example (Figure 6) displays the DEMOG extract program that was created by REx using the information stored in the Action table. Extract Code (REx_DEMOG) %macro REx_DEMOG(Raw_Data=,Rex_Data=); * Allocate libraries; Libname DS 'data library' ; Libname DEMOG 'SAS view library'; Libname Raw_Data "&Raw_Data"; Libname Rex_Data "&Rex_Data"; Libname Library 'SAS format library'; * Perform REx actions; Data DEMOG; set DEMOG.DEMOG; %DROP_VAR(DEMOG,REC_ID); %PATID(DEMOG); %RENAME(DEMOG,EVENT_ID,VISIT); %RESET(DEMOG,SEX,SEX.); %RESET(DEMOG,RACE,RACE.); %UNIQUE_P(DEMOG); * Create archived copy of the pre-rex data; Data Raw_Data.DEMOG; set DEMOG.DEMOG; * Create permanent copy of post-rex data for analysis; Data DS.DEMOG; set DEMOG; * Create archived date/time stamped copy of post-rex data for audit purposes; Data Rex_Data.DEMOG; set DEMOG; %mend REx_DEMOG; Figure 6 Extract Code (REx_DEMOG) Figure 7 Reporting Effort Extraction Screen The user then clicks the Run Program button and REx creates a macro call to the extract code as seen below in figure 8. Extract Code Macro Call %REx_demog (raw_data= C:\prod\archive\01oct20011205\data, rex_data= C:\prod\archive\01oct20011205\rex); Figure 8 Extract Code Macro Call This program contains the extract code macro call that passes the values of the archive parameters to the extract code macro (i.e. REX_DEMOG, see Figure 6). These parameters are the names of the archive directories that have been created by REx. A new directory and call are created every time REx is run. Once this file has been created, it is sent to the server to be executed. Notice that this SAS macro accepts two parameters. These parameters pass the archive directory names (which contain the date/time of the REx extraction) to the macro. Every time REx extracts data from the Oracle database a new archive directory is created thus giving the user an audit trail of REx activities. RUNNING EXTRACT CODE Once created, the REx system can use the extract code to create the SAS data sets. The user can select the data sets to be extracted from the Create Data Sets screen (Figure 7).
OUTPUT After running the REx extraction code, SAS data sets containing clinical and dictionary data, and a SAS format catalog are produced. These data sets reflect a snapshot of the raw clinical data contained in the CDMS with the REx actions applied. The clinical data is written to a production data directory. The dictionaries and formats are written to a production formats directory. These data sets are now available for use in creation of analysis data sets. REx also creates archived copies of the extracted data. Copies of both the pre-rex (raw clinical data prior to application of REx actions) and post-rex data sets are archived in a date/time stamped folder in a production archive directory. These archived snapshots provide an audit trail of REx activities. In addition to generating the SAS data sets, the clinical data model is updated to reflect the effects of the REx actions. This updated data model is used by a downstream application in generating analysis data set definitions. REx also provides a reporting function. These reports reflect the contents of the action table for a particular reporting effort. Users can generate reports of REx actions by data set or by action type. These reports are used to ensure that the desired actions and parameters have been defined for a data extraction. CONCLUSION The new system has been successfully designed and implemented to run on Windows NT. It has grown from a single-user, single-project prototype to a production system that supports multiple users across multiple projects. It has fulfilled requirements for pulling data and performing user-defined actions as well as updating the data model to reflect those actions. The REx process of specifying actions and the use of the macros to execute those requests has increased the consistency of variable definition across projects and brought uniformity to the code used to create these variables. The ready availability of the data model reduces the documentation burden and allows programmers and statisticians to focus on the definition, coding and quality control of more critical analysis variables. ACKNOWLEDGEMENTS SAS and all other SAS Institute, Inc. product or service names are registered trademarks or trademarks of SAS Institute, Inc. in the United States and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective company. CONTACT INFORMATION Edward McCaney mccaneye@centocor.com Gail Stoner stoner@centocor.com Anthony Malinowski malinowskia@centocor.com Robert Karvois karvoisr@centocor.com In summary, the implementation of the REx system has significantly enhanced the programming group s ability to efficiently perform downstream processing. This includes generation of analysis data sets, maintenance of standard variable definitions and automatic generation of data definition documentation for electronic submission to the FDA.