SIMPLIFYING NDA PROGRAMMING WITH PROt SQL Aeen L. Yam, Besseaar Assocates, Prnceton, NJ ABSRACf The programmng of New Drug Appcaton (NDA) Integrated Summary of Safety (ISS) usuay nvoves obtanng patent counts, percentages, and other summary statstcs such as mean, standard devaton and range. Ths paper shows how to obtan a of these resuts wth the SQL procedure. Whe PROe SQL s often perceved as a data retreva too, ts unque features aow programmers to wrte compact codes to obtan data summares for any appcaton smar to the NDA ISS or the safety summary tabes n ndvdua new drug studes. Ths paper aso shows that severa DATA or other PRoe steps can be reduced to one or two steps wth PROe SQL. OVERVIEW At the end of ths paper are three tabes that represent the types of most rommony presented summary statstcs n safety tabes n pharmaceutca research. The data n those tabes are fcttous for ustraton purposes ony. The three types of summary tabes are: 1. rounts, percentages, mean, standard devaton, range and mssng vaue frequences of demographc data; 2. rounts and percentages of adverse events by body system; 3. rounts and percentages of adverse events by body system and COSTART term. Ths paper shows that the summary statstcs of each of these three types of tabes can be obtaned entrey wthn one or two PROC SQL steps. The unque features of PROC SQL make t possbe to reduce many DATA or other PROC steps n summng, groupng, sortng, seectng frst occurrences of each subgroup, mergng, concatenatng, condtona processng, and cacuatng percentages, mean, standard devaton, range and mssng vaues. Such unque features are bodfaced n the foowng programs. Repeated uses of the same features n a subsequent program are not bodfaced or dscussed agan. Varabe names, data set names, macro varabe names and macro varabe references from the programs are taczed n the dscusson. The ntenton of ths paper s not to advocate PROC SQL over DATA steps or other procedures, and there s no benchmark statstcs to compare ther performance dfferences. The objectve, however, s to present the SQL procedure as a vauabe aternatve for summarzng data wth fewer steps. TOTAL PATIENT COUNTS The foowng SQL procedure obtans tota patent rounts for drug 1, drug 2, and a drug groups (drug 1 and drug 2 combned, assgned as drug 3 for report wrtng purposes). Snce tota patent rounts appear n a three summary tabes, the rounts are cacuated once and saved n a permanent data set caed totpat. 133
r _... -... ~.. -.... %et numdrug=2;,proe sq; ~ create tabe perrn.totpat as, seect drug, ~ count(dstnct patent) as totn ~ from raw.data I:,:"~"~ unon seect group 'J'.eva(&numdrug+ by 1) as drug, count(dstnct patent) as totn from raw.data; The DISTINCT keyword emnates dupcate rows before countng. The GROUP BY cause s used to cassfy patent counts nto drug groups. The UNION operator combnes two queres, puttng the resut from the frst query on top of the resut from the second query. The AS keyword assgns vaues to a varabe. Assumng that there are two drug groups, 1 and 2. The frst SELECT statement counts the number of nondupcatng patents n each drug group. The second SELECT statement counts the number of nondupcatng patents wthout groupng patents by drug. Notce that the varabe drug s gven a vaue of 3 n the second SELECT statement. Snce PROC SQL aows the seecton of a tera numerc vaue or a character strng for a varabe, any arbtrary vaue can be assgned. The number of drug groups s set to 2 n the macro varabe reference, &numdrug; therefore, the two drug groups combned s assgned as 3, that s, the number of drug groups, &numdrug, pus one. The resuts from the two SELECT statements are concatenated nto a permanent data set, totpat. Totpat conssts of patent counts n drug 1, drug 2, as we as n drug 1 and drug 2 combned. The macro varabe reference, &numdrug, can be adjusted accordng to the number of drug groups n a study. Severa steps are saved. The data do not need to be sorted by drug group and patent. There s no need to set the data by the sorted varabes nto a DATA step to get the frst observaton of each patent for a count of nondupcatng patents. There s no need to count the patents n two steps, one wth by drug group and the other wthout by drug group. The resutng counts w not need to be passed nto another DATA step to be concatenated together, or to be reorganzed by _TYPE_ f a PROC step for summary statstcs s used. DEMOGRAPUC TABLE The demographc tabe conssts of two parts, so two SQL procedures are wrtten. The frst SQL procedure generates counts (ent) and percentages (pet) of gender and race groups n Tabe 1. f-......... - j%macro xx(outds=,var=); jproe sq; ; create tabe &outds as ~ seect It,, round(ent/ ~ case sum(cnt), when 0 then. E ese sum(cnt), end "100) as pet, from (seect orug, &var,., eount(dstnet patent) as co!, from raw.data j group by drug, &var), group by drug unon I". seect ':;und(ent/ case sum(cnt) when 0 then. ese sum(cnt) end '100) as pet from (seect %eva](&numdrug+ 1) as drug, &var, eount(dstnet patent) as ent from raw.data group by &var) ~ order by 1, Z; ~%mendxx; ~ ~%xx(outds=gencnt,var=gender); ~%xx(outds=racecnt,var=race); In the macro cas to xx, there are two major queres joned by the UNION 134
operator. In each of these queres, a subquay s used by nestng the second SELECT statement wthn the frst SELECT statement. CASE expresson s used to perform condtona processng. The SUM functon s used to cacuate the grand tota for the denomnator. The ORDER BY cause sorts the resuts by the order-by tems n a defaut sequence, from the owest vaue to the hghest vaue. An astersk (*) after the SELECT statement n the outer query ndcates that a the vaues, drug, &var and ent, returned by the second SELECT statement are used. In the second SELECT statement, the n1"1:a- (ent) of nondupcatng patents s counted by drug group and by the macro varabe reference, &var, when t s resoved. Percentages (pet) are cacuated n the outer query usng ent as numerator and the SUM of ent as denomnator. The CASE expresson s used to prevent error message when the denomnator s zero. WHEN the SUM of ent s zero, men t s set to mssng, ELSE the SUM of ent s the denomnator. Smar cacuatons are done after the UNION operator wthout groupng patents by drug. Thus, the counts (ent) and percentages (pet) of gender and race for the two drug groups separatey and combned are obtaned. The resuts are ordered by the vaues n the frst and second coumns, as ndca ted by 1 and 2 n the ORDER BY cause. The frst coumn s the frst varabe specfed n the SELECT statement, and the frst varabe s drug. Smary, the second coumn refers to the second varabe n the SELECT statement, and the second varabe s a macro varabe that vares dependng on the vaues supped n the macro cas. In other words, the resuts are ordered by drug and gender n the frst macro ca, and by drug and race n the second macro ca. Besdes the steps mentoned under the Tota Patent Counts secton, two addtona steps are saved. One s not havng to pass the patent counts nto a DATA statement for cacuatng percentages. The other s not havng to sort the resut tabe n ascendng order. The second SQL procedure generates mean, standard devaton, range and number of mssng vaues of age, weght and heght n Tabe 1. f.. _.... -.... I%m.acro yy(outds=,var=); proc sq; ~ create tabe &outds as 1 seect drug,. ~ n &var" as var ~ mean(&var) as mean, std(&var as std, ~ mn(&var) as mn, ; max(&var) as max, ~ nmss(&var) as nmss ~ from raw.data!:':"" unon seect group %eva(&numdrug+ by drug 1) as drug, H&var" as var, mean(&var) as mean,, std(&var) as std, mn(&var) as mn, max(&var) as max,. ~ nmss(&var) as nmss from raw.data;!%mendyy; ; ~%yy(outds=agestat/var=age); ~%yy(outds=wtstat/var=weght); %yy(outds=htstat,var=hegt); In the macro cas to yy, the functons MEAN, STD, MIN, MAX and NMISS are used to cacuate summary statstcs. The character strng when resoved from the macro varabe reference, &var, s used to assocate each varabe n the macro ca wth ts correspondng summary statstcs. A the summary statstcs for the two drug groups separatey and combned are cacuated and concatenated wthn one SQL procedure. ADVERSE EVENTS TABLES The summary statstcs 'for the two adverse events tabes, Tabe 2 and Tabe 3, 135
can be obtaned by cang the same snge PROC SQL statement beow..._..._..._... ~%macro zz(nds=,nds2=,outds=,var=,seectf=, 1 sortord=); 1proc sq; create tabe &outds (drop=totn) as ~ seect dstnct, round(cnt/ : case totn when 0 then. ese ton end "100) as pet, asseq. from (seect &nds..drug, ~ count( dstnct patent) as cnt, &nds2.. toin from raw.&nds eft jon perm.&nds2 on &ndsldrug=&nds2..drug where &seectf group by &nds..drug) ; outer unon correspondng. seect dstnct., round(ent/ case totn when 0 then. ese totn end ~ 00) as pet, 2 as seq from (seect &nds.. drug, &var, count(dstnct patent) as ent, &nds2.. totn from raw.&nds eft jon perm.&nds2 on &nds..drug=&nds2..drug where &seecttf group by &nds.. drug, &var) order by seq, drug, &sortord; %mendzz; %zz(nds=ae,nds2=totpat,outds=aebcnt, var=body,seectf=%str(perod=2),sortord= cnt desc); %zz(nds=ae,nds2=totpat,outds=aebccnt,var= %str(body,costart),seectf=%str(perod=2), sortord=%str(body, cnt dese»; The frst macro ca to zz groups adverse events by body system. The second macro ca to zz groups ad verse events by body system and COSTART term. The DISTINCT keyword emnates dupcate rows of data. The LEFT JOIN operator retreves matchng rows and nonmatchng rows based on the data specfed on the eft (raw.&nds1). The ON cause specfes the roumns for matchng rows n two data sets to be joned. The WHERE cause specfes a condton for seectng the data. The OUTER UNION CORRESPONDING operator concatenates resuts from SELECT statements smar to usng a DATA step wth a SET statement. The dfferences between UNION and OUTER UNION CORRESPONDING are: UNION matches roumns n a tabe expresson by ordna poston, keepng the roumn names n the resut tabe from the frst tabe. OUTER UNION CORRESPONDING, on the other hand, matches roumns by roumn names. In addton, when the OUTER UNION CORRESPONDING operator s used, the non-matchng roumns are retaned n the resut tabe. The DESC keyword sorts the resut tabe n descendng order. Two sets of queres, dentfed by the varabe seq as 1 and 2 for report wrtng purposes, 'are concatenated by the OUTER UNION CORRESPONDING operator. The frst set of queres rounts the tota number (ent) of nondupcatng patents wth adverse events, merges the resuts wth the totpat permanent data set by drug group, keepng ony the rows from the adverse events counts wth the LEFT JOIN operator, and cacuates the percentages (pet) of ent. Ony patents from the doube-bnd perod (perod=2) are seected n the WHERE cause. The serond set of queres performs smar cacuatons, except that the patent rounts (ent) and percentages (pet). are by body system or by body system and COST ART term. For the Adverse Events tabes, the DISTINCT opton s used n two dfferent ways: to rount the number of nondupcatng patent for each adverse event category, and to emnate dupcate rows as a resut of the LEFT JOIN. The DISTINCT opton s partcuary usefu for rountng patents wth adverse events, because patents wth mutpe Occurrences of the same adver~ event are to 136
be rounted once ony. Among the steps mentoned prevousy, the most mportant steps saved here are not havng to sort the adverse events data and to set the sorted data to get the frst occurrences of adverse events by patent. The seecton of frst occurrences of each adverse event, the condtona processng, the sortng, the summng, the cacuaton of percentages, the concatenaton of data sets, the sortng of the resut tabe by seq, drug, body system, and by descendng adverse event counts (ent) can a take pace wthn the same SQL procedure. For additiona nformation, contact: Aeen L. Yam Besseaar Assocates 210 Carnege Center Prnceton, NJ 08540-6681 Te.: (609) 452-4200 SUMMARy Ths paper uncovers the potenta of PROC SQL as a very usefu data sununary too, n addton to beng a data retreva too. The beauty of PROC SQL es n the smpcty and resourcefuness of the codes. Severa steps can be condensed to I)ake onestep programmng possbe. The tradeoff s t generay takes more tme to wrte and debug SQL programs, because wth PROC SQL, the ntermedate resuts from each step take pace nternay, and a the query expressons produce a snge output tabe. The programs n ths paper were orgnay deveoped for an NDA, but the programmng ogc and technques can be used for smar data summares. (Three sampe NDA Integrated Summary of Safety tabes are ncuded on the next two pages.) SAS s a regstered trademjjrk or trademark of SAS Insttute Inc. n the USA and other countres. ndcates USA regstraton. Other brand and P!'oduct names are. regstered trademjjrks or trademarks of ther respectve companes. 137
TABLE 1 SUMMARY OF DEMOGRAPHIC DATA Drug 1 Drug2 A Drug Groups Tota Patents 849 851 1700 Gender Mae 429 (51%) 432 (51%) 861 (51%) Femae 420 (49%) 419 (49%) 839 (49%) Race Whte 467 (55%) 471 (55%) 938 (55%) Back 362 (43%) 371 (44%) 733 (43%) Other 20 (2%) 9 (I') 29 (2%) Age Mean 362 37.1 36.8 Standard Devaton 16.3 16.1 162 Range 17-69 17-69 17-69 # Mssng 1 2 3 Weght (pounds) Mean 1552 156.1 155.8 Standard Devaton 10.6 10.9 10.8 Range 96-209 9&-212 96-212 # Mssng 0 1 1 Heght (nches) Mean 64.7 65.6 652 Standard Devaton 8.8 9.0 8.9 Range 59-72 60-76 59-76 # Mssng 0 0 0 TABLE 2 NUMBER AND PERCENT OF PATIENTS WITH ADVERSE EVENTS BY BODY SYSTEM Drug I Drug 2 Tota Patents 849 851 Tota Patents wth Adverse Events 420 (49%) 320 (38%) BODY AS A WHOLE 360 (42%) 277 (33%) DIGESTIVE SYSTEM 280 (33%) 230 (27%) SKIN AND APPENDAGES 200 (24%) 207 (24%) RESPIRATORYSYSTEM 39 (5%) 32 (4%) CARDIOVASCULAR SYSTEM 30 (4%) 28 (3%) ENDOCRINE SYSTEM 9 (1%) 3 (.4%) NERVOUS SYSTEM 2 (.2%) 1 (.1%) etc. 138
TABLE 3 NUMBER AND PERCENT OF PATIENTS WITH ADVERSE EVENTS BY BODY SYSTEM AND COSTART TERM Drug 1 Drug 2 Tota Patents 849 851 Tota Patents wth Adverse Events 420 (49%) 320 (38%) BODY AS A WHOLE HEADACHE 120 (14%) 110 (13%) CHILLS 70 (8%) 62 (7%) FLU SYNDROME 64 (8%) 52 (6%) ALLERGIC REACTION 52 (6%) 42 (5%) INFECTION 47 (6%) 30 (4%) FEVER 18 (2%) 12 (1%) PAIN 3 (.4%) 1 (.1%) Subtota 360 (42%) 277 (33%) DIGESTIVE SYSTEM DIARRHEA 120 (14%) 100 (12%) NAUSEA 80 (9%) 70 (8%) FLATULENCE 72 (8%) 34 (4%) STOMATITIS 40 (5%) 30 (4%) GASTRms 12 (1%) 8 (1%) ESOPHAGITIS 6 (1%) 3 (.4%) CONSTIPATION 2 (.2%) 1 (.1%) Subtota 280 (33%) 230 (27%) etc. 139