A Macro to Create Data Definition Documents Aileen L. Yam, sanofi-aventis Inc., Bridgewater, NJ ABSTRACT Data Definition documents are one of the requirements for NDA submissions. This paper contains a macro to automate the creation of the two parts of a Data Definition document: a Table of Contents and a set of Data Description tables. The macro has the flexibility to create Data Definition documents based on either Item 11 of the Guidance document of 1999, or the current CDISC data structures and terminology. Solutions to three major challenges in preparing Data Definition documents are provided. The challenges are: 1. Comments and derivation rules imported from a Statistical Analysis Plan or a standards document contain carriage return ('OD'x) and line feed ('OA'x) characters that split the comments and derivation rules into more than one column in SAS ; 2. Long comments or derivation rules output into RTF files do not automatically flow from one page onto the next page, thus the long text gets truncated and the next page begins with a new variable; 3. Data prepared according to CDISC SDTM SDS guidelines have a decoded variable for each coded variable, so it is no longer necessary to have formats attached to variables, and yet formats are to be provided in the Data Definition documents. I. INTRODUCTION A Data Definition document consists of a Table of Contents (TOC) and a collection of Data Description tables. The TOC lists out all the data included in a submission and provides the location and information on each data domain. The Data Description tables describe the attributes and origins or derivation of all the variables in each data domain. In other words, the former contains domain metadata, while the latter contains variable metadata. Metadata is information about the data. Here are some examples with major differences in styles highlighted in gray: 1. TOC based on Item 11 of the Guidance document: Datasets for Study ABC1234 Dataset Description of dataset Location DM Demographics C:\level1\level2\DM.xpt AE Adverse Events C:\level1\level2\AE.xpt 2. TOC based on CDISC data structures and terminology: Datasets for Study ABC1234 Dataset Description Structure Purpose Key Variables Location DM Demographics One record per subject Tabulation USUBJID C:\level1\level2\DM.xpt AE Adverse Events One record per subject per event Tabulation USUBJID, AETERM, AESEQ C:\level1\level2\AE.xpt - 1 -
3. Data Description tables based on Item 11 of the Guidance document: Study ABC1234 - data set variables for Demographics (DM.xpt) Variable Label Type Codes Comment STUDYID Study Identifier char DOMAIN Domain Abbreviation char USUBJID Unique Subject Identifier char 4. Data Description tables based on CDISC data structures and terminology: Study ABC1234 data set variables for Demographics (DM.xpt) Variable Label Type Controlled Terms or Format Origin Role Comment STUDYID Study Identifier text CRF page Identifier DOMAIN Domain Abbreviation text DM CRF page Identifier USUBJID Unique Subject Identifier text Derived Identifier II. FEATURES There are 3 major sections of the macro program: Section 1 verifies if domain names and variable names satisfy the SAS Transport (XPT) file requirements, checks if labels are blank, and provides an option to output warning messages if 1. domain names or variable names are longer than 8 characters, 2. domain labels or variable labels are longer than 40 characters, 3. domain labels or variable labels are missing. Section 2 creates TOC based on either Item 11 of the Guidance document or CDISC data structures and terminology, depending on user selection. Section 3 creates Data Description tables based on either Item 11 of the Guidance document or CDISC data structures and terminology, depending on user selection. Sometimes, testing needs to be done on a domain and the associated comment files before its integration with the rest of the Data Description tables, or at the end of a project, an additional domain needs to be prepared. The macro provides an option to create either a Data Description table on a single data domain for testing or ad-hoc use, or Data Description tables on all the domains in a directory for submission purposes. The macro can also be used to generate TOC and Data Description tables for raw data. III. CHALLENGES Comments that are copied and pasted directly from Word documents usually contain special characters such as carriage return ('OD'x) and line feed ('OA'x), which can also be combined to form an end of record marker. When the comments with special characters are pasted into a.csv file, and the.csv file is converted into a SAS data set for merging with other variable metadata to create Data Description tables, the special characters assume the role of a tab delimiter, breaking up the comments into different fields. To circumvent this problem, the special characters are located and replaced with a text string during the conversion from a.csv file into a SAS data set, and after the - 2 -
conversion, the text strings are turned back into special characters to get the proper carriage return, line feed, or end of record marker. Long comments or derivation rules output from SAS ODS report programs into RTF files do not automatically wrap to the next page, so the long text gets truncated when a page ends, and the next page begins with a new variable. To get around this problem, a data step is written to split long text into additional rows. Since every code variable has a corresponding decode variable according to the CDISC SDTM SDS guidelines, it is no longer necessary to have formats attached to variables. In order to obtain information for the "Codes" column or the "Controlled Terms or Format" column of the Data Description tables, format names are saved into a SAS data set before a permanent SDS domain is created. The format names and their codes and decodes from format libraries are then merged back into variable metadata. IV. SOME MAJOR PARTS OF THE MACRO The following code checks and outputs warning messages if domain names and labels or variable names and labels do not meet XPT requirements, or if the labels are blank. a.) domain names and labels; set sashelp.vtable; where libname="&lib" %if %length(&domname)>0 %then %do; and memname="&mem"; %; if memlabel=' ' then do; put 'WAR' 'NING: ' 'Domain labels missing: (' memname ') ' memlabel; %if %upcase(&xptchk)=yes %then %do; if length(memname)>8 then do; put 'WAR' 'NING:' 'Domain names >8 characters: ' memname; if length(memlabel)>40 then do; put 'WAR' 'NING: ' 'Domain labels >40 characters: (' memname ') ' memlabel; % b.) variable names and labels; set sashelp.vcolumn; where libname="&lib" %if %length(&domname)>0 %then %do; and memname="&mem"; %; if label=' ' then do; put 'WAR' 'NING: ' memname 'has variable labels missing: (' name ') ' label; %if %upcase(&xptchk)=yes %then %do; if length(name)>8 then do; put 'WAR' 'NING:' memname 'has variable names >8 characters: ' name; if length(label)>40 then do; put 'WAR' 'NING: ' memname 'has variable labels >40 characters: (' name ') ' label; % The first challenge, as mentioned before, was the special characters, such as carriage return ('OD'x), line feed ('OA'x), or end of record marker, that break up the comment field into multiple columns. The following code locates the special characters, replaces each special character with a text string during the conversion of the comment file - 3 -
into a SAS data set. After the conversion into a SAS data set, the text strings were turned back into special characters to get the proper carriage return, line feed or end of record marker. %***Read in a comment file (&cmtfile). Replace comments containing special characters with a user-defined text string in the &trace macro parameter, and output a temporary file ( tmpfile.csv); infile "&cmtfile" recfm=n; file " tmpfile.csv" recfm=n; input a $char1.; if a='0d'x then do; put '0D0A'x; input +1; else if a='0a'x then put "&trace"; else put a $char1.; %***Input the temporary comment file with a data step. Provide flexibility to input data where Origin and Role may or may not exist in the comment file; data comment; infile " tmpfile.csv" delimiter= ',' missover dsd lrecl=32767 firstobs=2; attrib memname length=$32 informat=$32. format=$32. name length=$10 informat=$10. format=$10. comments length=$30000 informat=$30000. format=$30000.; %if %upcase(&rpttype)=cdisc %then %do; attrib origin length=$200 informat=$200. format=$200. role length=$40 informat=$40. format=$40.; % input memname $ name $ comments $ %if %upcase(&rpttype)=cdisc %then %do; origin $ role $ %; %***Upcase memname and name for merging with the variable metadata file (varmeta); data comment(drop=xmemname xname); set comment(rename=(memname=xmemname name=xname)); length memname name $32; memname=upcase(xmemname); name=upcase(xname); proc sort data=comment; proc sort data=varmeta; %***Merge in comments, turn the traced text back into return characters or line breaks; data varmeta; merge varmeta(in=a drop=comments) comment(in=b); if a; comments=tranwrd(comments,"&trace"," \line"); %***Delete temporary file; options noxwait noxsync; x "del tmpfile.csv"; - 4 -
x = sleep(3); The second challenge is the long text being truncated. The solution, given below, is to split up the long text into chunks. Each chunk fills up a page. %***Split out long comments into chunks of 1200, and put the chunks into additional lines; data varmeta(drop=comment1--comment25 bin num i j); set varmeta; if first.name then seq=.; %***Allow length of comment to be $30,000; array comment(25) $1200; length bin $30000; num=int(length(comments)/1200)+1; if num=1 then output; else do; %***create holding bin of remaining text to be parsed when a chunk is removed; bin=comments; do j=1 to num; %***parse to break on space as a word boundary; do i=1200 to 0 by -1; comments=bin; %***when a word boundary is found, put the text before the boundary in an array variable, output, and keep the rest in the bin for further processing; if substr(comments,i,1)=' ' then do; comments=substr(bin,1,i); seq+1; output; %***assign the rest back into the bin, starting one byte after the location of the space; bin=substr(bin,i+1); %***no need to search for a space anymore, so leave this inner do loop; leave; The third challenge is that formats are not attached to variables in any domain. The following code saves the formats into a SAS data set for merging with metadata during the creation of Data Description tables. The code can be executed right before a domain is created as a permanent SAS data set. %macro msavefmt(inlib=work,indom=,outlib=,outdom=&domain); %***Upcase macro variables for merging purpose; %let libin=%upcase(&inlib); %let domin=%upcase(&indom); %let libout=%upcase(&outlib); %let domout=%upcase(&outdom); proc sql noprint; create table &domout as select %left(%trim("&domout")) label="domain name" as MEMNAME, upcase(name) label="variable" as NAME, upcase(format) label="format" as FORMAT from dictionary.columns where libname="&libin" and %upcase(memname) in ("&domin") order by memname; quit; - 5 -
%***Only character formats need to be saved for the Data Description tables; data &libout..fmt_&domout; set &domout; if indexc(upcase(format),'abcdefghijklmnopqrstuvwxyz')>0; %mend msavefmt; V. REFERENCES 1. Guidance for Industry Providing Regulatory Submissions in Electronic Format NDAs, prepared by U.S. Department of Health and Human Services Food and Drug Administration Center for Drug Evaluation and Research (CDER) IT 3, January 1999. 2. Case Report Tabulation Data Description Specification (define.xml), prepared by the CDISC define.xml Team, 2005. VI. ACKNOWLEDGMENT The author would like to thank Bari Lawhorn and Janice Bloom for their help in ODS RTF. SAS is registered trademark of SAS Institute Inc. in the USA and other countries. indicates USA registration. For additional information, contact: Aileen L. Yam sanofi-aventis, Inc. 200 Crossing Blvd. Bridgewater, NJ 08807 email: aileen.yam@sanofi-aventis.com - 6 -