SIMPLIFYING NDA PROGRAMMING WITH PROt SQL



Similar documents
PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Calculation of Sampling Weights

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Expressive Negotiation over Donations to Charities

WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce

Lecture 3: Force of Interest, Real Interest Rate, Annuity

Cardiovascular Event Risk Assessment Fusion of Individual Risk Assessment Tools Applied to the Portuguese Population

8 Algorithm for Binary Searching in Trees

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) , info@teltonika.

Level Annuities with Payments Less Frequent than Each Interest Period

Addendum to: Importing Skill-Biased Technology

Agglomeration economies in manufacturing industries: the case of Spain

Question 2: What is the variance and standard deviation of a dataset?

What is Candidate Sampling

The Greedy Method. Introduction. 0/1 Knapsack Problem

1. Measuring association using correlation and regression

2) A single-language trained classifier: one. classifier trained on documents written in

PRACTICE 1: MUTUAL FUNDS EVALUATION USING MATLAB.

Pre-entry Review. Industry Applications. NESUG '96 Proceedings 330

Approximation Algorithms for Data Distribution with Load Balancing of Web Servers

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

In our example i = r/12 =.0825/12 At the end of the first month after your payment is received your amount in the account, the balance, is

BERNSTEIN POLYNOMIALS

Project Networks With Mixed-Time Constraints

DEFINING %COMPLETE IN MICROSOFT PROJECT

A Resources Allocation Model for Multi-Project Management

MONITORING METHODOLOGY TO ASSESS THE PERFORMANCE OF GSM NETWORKS

Recurrence. 1 Definitions and main statements

Reporting Forms ARF 113.0A, ARF 113.0B, ARF 113.0C and ARF 113.0D FIRB Corporate (including SME Corporate), Sovereign and Bank Instruction Guide

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

An Ensemble Classification Framework to Evolving Data Streams

Branch-and-Price and Heuristic Column Generation for the Generalized Truck-and-Trailer Routing Problem

Texas Instruments 30X IIS Calculator

IT09 - Identity Management Policy

Predicting Advertiser Bidding Behaviors in Sponsored Search by Rationality Modeling

Hollinger Canadian Publishing Holdings Co. ( HCPH ) proceeding under the Companies Creditors Arrangement Act ( CCAA )

Prediction of Success or Fail of Students on Different Educational Majors at the End of the High School with Artificial Neural Networks Methods

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

An Alternative Way to Measure Private Equity Performance

An MILP model for planning of batch plants operating in a campaign-mode

Extending Probabilistic Dynamic Epistemic Logic

A) 3.1 B) 3.3 C) 3.5 D) 3.7 E) 3.9 Solution.

An Interest-Oriented Network Evolution Mechanism for Online Communities

This circuit than can be reduced to a planar circuit

STATISTICAL DATA ANALYSIS IN EXCEL

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

General Auction Mechanism for Search Advertising

Effective Use of SQL in SAS Programming

The Dynamics of Wealth and Income Distribution in a Neoclassical Growth Model * Stephen J. Turnovsky. University of Washington, Seattle

A Performance Analysis of View Maintenance Techniques for Data Warehouses

SUPPLIER FINANCING AND STOCK MANAGEMENT. A JOINT VIEW.

CHAPTER 14 MORE ABOUT REGRESSION

Usage of LCG/CLCG numbers for electronic gambling applications

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Nordea G10 Alpha Carry Index

the Manual on the global data processing and forecasting system (GDPFS) (WMO-No.485; available at

21 Vectors: The Cross Product & Torque

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Simple Interest Loans (Section 5.1) :

How To Calculate The Accountng Perod Of Nequalty

A Structure Preserving Database Encryption Scheme

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

Performing Queries Using PROC SQL (1)

A GENETIC ALGORITHM-BASED METHOD FOR CREATING IMPARTIAL WORK SCHEDULES FOR NURSES

DEGREES OF EQUIVALENCE IN A KEY COMPARISON 1 Thang H. L., Nguyen D. D. Vietnam Metrology Institute, Address: 8 Hoang Quoc Viet, Hanoi, Vietnam

J. Parallel Distrib. Comput.

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

Multi-agent System for Custom Relationship Management with SVMs Tool

A DATA MINING APPLICATION IN A STUDENT DATABASE

On-Line Trajectory Generation: Nonconstant Motion Constraints

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

Generalizing the degree sequence problem

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

Logistic Regression. Steve Kroon

Fixed income risk attribution

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Rapid Estimation Method for Data Capacity and Spectrum Efficiency in Cellular Networks

A Simple Approach to Clustering in Excel

Traffic-light a stress test for life insurance provisions

Support Vector Machines

Section 5.4 Annuities, Present Value, and Amortization

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

Using Series to Analyze Financial Situations: Present Value

Trivial lump sum R5.0

Gestimate Of Value Added And Gross Trade Flows

Transcription:

SIMPLIFYING NDA PROGRAMMING WITH PROt SQL Aeen L. Yam, Besseaar Assocates, Prnceton, NJ ABSRACf The programmng of New Drug Appcaton (NDA) Integrated Summary of Safety (ISS) usuay nvoves obtanng patent counts, percentages, and other summary statstcs such as mean, standard devaton and range. Ths paper shows how to obtan a of these resuts wth the SQL procedure. Whe PROe SQL s often perceved as a data retreva too, ts unque features aow programmers to wrte compact codes to obtan data summares for any appcaton smar to the NDA ISS or the safety summary tabes n ndvdua new drug studes. Ths paper aso shows that severa DATA or other PRoe steps can be reduced to one or two steps wth PROe SQL. OVERVIEW At the end of ths paper are three tabes that represent the types of most rommony presented summary statstcs n safety tabes n pharmaceutca research. The data n those tabes are fcttous for ustraton purposes ony. The three types of summary tabes are: 1. rounts, percentages, mean, standard devaton, range and mssng vaue frequences of demographc data; 2. rounts and percentages of adverse events by body system; 3. rounts and percentages of adverse events by body system and COSTART term. Ths paper shows that the summary statstcs of each of these three types of tabes can be obtaned entrey wthn one or two PROC SQL steps. The unque features of PROC SQL make t possbe to reduce many DATA or other PROC steps n summng, groupng, sortng, seectng frst occurrences of each subgroup, mergng, concatenatng, condtona processng, and cacuatng percentages, mean, standard devaton, range and mssng vaues. Such unque features are bodfaced n the foowng programs. Repeated uses of the same features n a subsequent program are not bodfaced or dscussed agan. Varabe names, data set names, macro varabe names and macro varabe references from the programs are taczed n the dscusson. The ntenton of ths paper s not to advocate PROC SQL over DATA steps or other procedures, and there s no benchmark statstcs to compare ther performance dfferences. The objectve, however, s to present the SQL procedure as a vauabe aternatve for summarzng data wth fewer steps. TOTAL PATIENT COUNTS The foowng SQL procedure obtans tota patent rounts for drug 1, drug 2, and a drug groups (drug 1 and drug 2 combned, assgned as drug 3 for report wrtng purposes). Snce tota patent rounts appear n a three summary tabes, the rounts are cacuated once and saved n a permanent data set caed totpat. 133

r _... -... ~.. -.... %et numdrug=2;,proe sq; ~ create tabe perrn.totpat as, seect drug, ~ count(dstnct patent) as totn ~ from raw.data I:,:"~"~ unon seect group 'J'.eva(&numdrug+ by 1) as drug, count(dstnct patent) as totn from raw.data; The DISTINCT keyword emnates dupcate rows before countng. The GROUP BY cause s used to cassfy patent counts nto drug groups. The UNION operator combnes two queres, puttng the resut from the frst query on top of the resut from the second query. The AS keyword assgns vaues to a varabe. Assumng that there are two drug groups, 1 and 2. The frst SELECT statement counts the number of nondupcatng patents n each drug group. The second SELECT statement counts the number of nondupcatng patents wthout groupng patents by drug. Notce that the varabe drug s gven a vaue of 3 n the second SELECT statement. Snce PROC SQL aows the seecton of a tera numerc vaue or a character strng for a varabe, any arbtrary vaue can be assgned. The number of drug groups s set to 2 n the macro varabe reference, &numdrug; therefore, the two drug groups combned s assgned as 3, that s, the number of drug groups, &numdrug, pus one. The resuts from the two SELECT statements are concatenated nto a permanent data set, totpat. Totpat conssts of patent counts n drug 1, drug 2, as we as n drug 1 and drug 2 combned. The macro varabe reference, &numdrug, can be adjusted accordng to the number of drug groups n a study. Severa steps are saved. The data do not need to be sorted by drug group and patent. There s no need to set the data by the sorted varabes nto a DATA step to get the frst observaton of each patent for a count of nondupcatng patents. There s no need to count the patents n two steps, one wth by drug group and the other wthout by drug group. The resutng counts w not need to be passed nto another DATA step to be concatenated together, or to be reorganzed by _TYPE_ f a PROC step for summary statstcs s used. DEMOGRAPUC TABLE The demographc tabe conssts of two parts, so two SQL procedures are wrtten. The frst SQL procedure generates counts (ent) and percentages (pet) of gender and race groups n Tabe 1. f-......... - j%macro xx(outds=,var=); jproe sq; ; create tabe &outds as ~ seect It,, round(ent/ ~ case sum(cnt), when 0 then. E ese sum(cnt), end "100) as pet, from (seect orug, &var,., eount(dstnet patent) as co!, from raw.data j group by drug, &var), group by drug unon I". seect ':;und(ent/ case sum(cnt) when 0 then. ese sum(cnt) end '100) as pet from (seect %eva](&numdrug+ 1) as drug, &var, eount(dstnet patent) as ent from raw.data group by &var) ~ order by 1, Z; ~%mendxx; ~ ~%xx(outds=gencnt,var=gender); ~%xx(outds=racecnt,var=race); In the macro cas to xx, there are two major queres joned by the UNION 134

operator. In each of these queres, a subquay s used by nestng the second SELECT statement wthn the frst SELECT statement. CASE expresson s used to perform condtona processng. The SUM functon s used to cacuate the grand tota for the denomnator. The ORDER BY cause sorts the resuts by the order-by tems n a defaut sequence, from the owest vaue to the hghest vaue. An astersk (*) after the SELECT statement n the outer query ndcates that a the vaues, drug, &var and ent, returned by the second SELECT statement are used. In the second SELECT statement, the n1"1:a- (ent) of nondupcatng patents s counted by drug group and by the macro varabe reference, &var, when t s resoved. Percentages (pet) are cacuated n the outer query usng ent as numerator and the SUM of ent as denomnator. The CASE expresson s used to prevent error message when the denomnator s zero. WHEN the SUM of ent s zero, men t s set to mssng, ELSE the SUM of ent s the denomnator. Smar cacuatons are done after the UNION operator wthout groupng patents by drug. Thus, the counts (ent) and percentages (pet) of gender and race for the two drug groups separatey and combned are obtaned. The resuts are ordered by the vaues n the frst and second coumns, as ndca ted by 1 and 2 n the ORDER BY cause. The frst coumn s the frst varabe specfed n the SELECT statement, and the frst varabe s drug. Smary, the second coumn refers to the second varabe n the SELECT statement, and the second varabe s a macro varabe that vares dependng on the vaues supped n the macro cas. In other words, the resuts are ordered by drug and gender n the frst macro ca, and by drug and race n the second macro ca. Besdes the steps mentoned under the Tota Patent Counts secton, two addtona steps are saved. One s not havng to pass the patent counts nto a DATA statement for cacuatng percentages. The other s not havng to sort the resut tabe n ascendng order. The second SQL procedure generates mean, standard devaton, range and number of mssng vaues of age, weght and heght n Tabe 1. f.. _.... -.... I%m.acro yy(outds=,var=); proc sq; ~ create tabe &outds as 1 seect drug,. ~ n &var" as var ~ mean(&var) as mean, std(&var as std, ~ mn(&var) as mn, ; max(&var) as max, ~ nmss(&var) as nmss ~ from raw.data!:':"" unon seect group %eva(&numdrug+ by drug 1) as drug, H&var" as var, mean(&var) as mean,, std(&var) as std, mn(&var) as mn, max(&var) as max,. ~ nmss(&var) as nmss from raw.data;!%mendyy; ; ~%yy(outds=agestat/var=age); ~%yy(outds=wtstat/var=weght); %yy(outds=htstat,var=hegt); In the macro cas to yy, the functons MEAN, STD, MIN, MAX and NMISS are used to cacuate summary statstcs. The character strng when resoved from the macro varabe reference, &var, s used to assocate each varabe n the macro ca wth ts correspondng summary statstcs. A the summary statstcs for the two drug groups separatey and combned are cacuated and concatenated wthn one SQL procedure. ADVERSE EVENTS TABLES The summary statstcs 'for the two adverse events tabes, Tabe 2 and Tabe 3, 135

can be obtaned by cang the same snge PROC SQL statement beow..._..._..._... ~%macro zz(nds=,nds2=,outds=,var=,seectf=, 1 sortord=); 1proc sq; create tabe &outds (drop=totn) as ~ seect dstnct, round(cnt/ : case totn when 0 then. ese ton end "100) as pet, asseq. from (seect &nds..drug, ~ count( dstnct patent) as cnt, &nds2.. toin from raw.&nds eft jon perm.&nds2 on &ndsldrug=&nds2..drug where &seectf group by &nds..drug) ; outer unon correspondng. seect dstnct., round(ent/ case totn when 0 then. ese totn end ~ 00) as pet, 2 as seq from (seect &nds.. drug, &var, count(dstnct patent) as ent, &nds2.. totn from raw.&nds eft jon perm.&nds2 on &nds..drug=&nds2..drug where &seecttf group by &nds.. drug, &var) order by seq, drug, &sortord; %mendzz; %zz(nds=ae,nds2=totpat,outds=aebcnt, var=body,seectf=%str(perod=2),sortord= cnt desc); %zz(nds=ae,nds2=totpat,outds=aebccnt,var= %str(body,costart),seectf=%str(perod=2), sortord=%str(body, cnt dese»; The frst macro ca to zz groups adverse events by body system. The second macro ca to zz groups ad verse events by body system and COSTART term. The DISTINCT keyword emnates dupcate rows of data. The LEFT JOIN operator retreves matchng rows and nonmatchng rows based on the data specfed on the eft (raw.&nds1). The ON cause specfes the roumns for matchng rows n two data sets to be joned. The WHERE cause specfes a condton for seectng the data. The OUTER UNION CORRESPONDING operator concatenates resuts from SELECT statements smar to usng a DATA step wth a SET statement. The dfferences between UNION and OUTER UNION CORRESPONDING are: UNION matches roumns n a tabe expresson by ordna poston, keepng the roumn names n the resut tabe from the frst tabe. OUTER UNION CORRESPONDING, on the other hand, matches roumns by roumn names. In addton, when the OUTER UNION CORRESPONDING operator s used, the non-matchng roumns are retaned n the resut tabe. The DESC keyword sorts the resut tabe n descendng order. Two sets of queres, dentfed by the varabe seq as 1 and 2 for report wrtng purposes, 'are concatenated by the OUTER UNION CORRESPONDING operator. The frst set of queres rounts the tota number (ent) of nondupcatng patents wth adverse events, merges the resuts wth the totpat permanent data set by drug group, keepng ony the rows from the adverse events counts wth the LEFT JOIN operator, and cacuates the percentages (pet) of ent. Ony patents from the doube-bnd perod (perod=2) are seected n the WHERE cause. The serond set of queres performs smar cacuatons, except that the patent rounts (ent) and percentages (pet). are by body system or by body system and COST ART term. For the Adverse Events tabes, the DISTINCT opton s used n two dfferent ways: to rount the number of nondupcatng patent for each adverse event category, and to emnate dupcate rows as a resut of the LEFT JOIN. The DISTINCT opton s partcuary usefu for rountng patents wth adverse events, because patents wth mutpe Occurrences of the same adver~ event are to 136

be rounted once ony. Among the steps mentoned prevousy, the most mportant steps saved here are not havng to sort the adverse events data and to set the sorted data to get the frst occurrences of adverse events by patent. The seecton of frst occurrences of each adverse event, the condtona processng, the sortng, the summng, the cacuaton of percentages, the concatenaton of data sets, the sortng of the resut tabe by seq, drug, body system, and by descendng adverse event counts (ent) can a take pace wthn the same SQL procedure. For additiona nformation, contact: Aeen L. Yam Besseaar Assocates 210 Carnege Center Prnceton, NJ 08540-6681 Te.: (609) 452-4200 SUMMARy Ths paper uncovers the potenta of PROC SQL as a very usefu data sununary too, n addton to beng a data retreva too. The beauty of PROC SQL es n the smpcty and resourcefuness of the codes. Severa steps can be condensed to I)ake onestep programmng possbe. The tradeoff s t generay takes more tme to wrte and debug SQL programs, because wth PROC SQL, the ntermedate resuts from each step take pace nternay, and a the query expressons produce a snge output tabe. The programs n ths paper were orgnay deveoped for an NDA, but the programmng ogc and technques can be used for smar data summares. (Three sampe NDA Integrated Summary of Safety tabes are ncuded on the next two pages.) SAS s a regstered trademjjrk or trademark of SAS Insttute Inc. n the USA and other countres. ndcates USA regstraton. Other brand and P!'oduct names are. regstered trademjjrks or trademarks of ther respectve companes. 137

TABLE 1 SUMMARY OF DEMOGRAPHIC DATA Drug 1 Drug2 A Drug Groups Tota Patents 849 851 1700 Gender Mae 429 (51%) 432 (51%) 861 (51%) Femae 420 (49%) 419 (49%) 839 (49%) Race Whte 467 (55%) 471 (55%) 938 (55%) Back 362 (43%) 371 (44%) 733 (43%) Other 20 (2%) 9 (I') 29 (2%) Age Mean 362 37.1 36.8 Standard Devaton 16.3 16.1 162 Range 17-69 17-69 17-69 # Mssng 1 2 3 Weght (pounds) Mean 1552 156.1 155.8 Standard Devaton 10.6 10.9 10.8 Range 96-209 9&-212 96-212 # Mssng 0 1 1 Heght (nches) Mean 64.7 65.6 652 Standard Devaton 8.8 9.0 8.9 Range 59-72 60-76 59-76 # Mssng 0 0 0 TABLE 2 NUMBER AND PERCENT OF PATIENTS WITH ADVERSE EVENTS BY BODY SYSTEM Drug I Drug 2 Tota Patents 849 851 Tota Patents wth Adverse Events 420 (49%) 320 (38%) BODY AS A WHOLE 360 (42%) 277 (33%) DIGESTIVE SYSTEM 280 (33%) 230 (27%) SKIN AND APPENDAGES 200 (24%) 207 (24%) RESPIRATORYSYSTEM 39 (5%) 32 (4%) CARDIOVASCULAR SYSTEM 30 (4%) 28 (3%) ENDOCRINE SYSTEM 9 (1%) 3 (.4%) NERVOUS SYSTEM 2 (.2%) 1 (.1%) etc. 138

TABLE 3 NUMBER AND PERCENT OF PATIENTS WITH ADVERSE EVENTS BY BODY SYSTEM AND COSTART TERM Drug 1 Drug 2 Tota Patents 849 851 Tota Patents wth Adverse Events 420 (49%) 320 (38%) BODY AS A WHOLE HEADACHE 120 (14%) 110 (13%) CHILLS 70 (8%) 62 (7%) FLU SYNDROME 64 (8%) 52 (6%) ALLERGIC REACTION 52 (6%) 42 (5%) INFECTION 47 (6%) 30 (4%) FEVER 18 (2%) 12 (1%) PAIN 3 (.4%) 1 (.1%) Subtota 360 (42%) 277 (33%) DIGESTIVE SYSTEM DIARRHEA 120 (14%) 100 (12%) NAUSEA 80 (9%) 70 (8%) FLATULENCE 72 (8%) 34 (4%) STOMATITIS 40 (5%) 30 (4%) GASTRms 12 (1%) 8 (1%) ESOPHAGITIS 6 (1%) 3 (.4%) CONSTIPATION 2 (.2%) 1 (.1%) Subtota 280 (33%) 230 (27%) etc. 139