Eliminating Tedium by Building Applications that Use SQL Generated SAS Code Segments David A. Mabey, Reader s Digest Association Inc., Pleasantville, NY ABSTRACT When SAS applications are driven by data-generated code segments, the programmer or operator is freed from tedious code modifications. SAS/SQL now adds powerful ways of generating these segments as simple macrovariables, macrovariable "arrays", and macrovariables that contain delimited strings up to 32K long. This paper will discuss three.busness needs and propose solutions that SQL in conjunction with other code generation techniques: 1. A macro that generates a MS/EXCEL friendly ".txt" file (tab delimited) from any SAS dataset. (%FLATFILE revisited) This solution is driven by the metadata in DICTIONARY.COLUMNS. 2. A macro that generates "dummy variables" with labels. This solution is driven by the data in the dataset or a user written format. 3. An application that does a full outer join on two or more datasets that have the same structure but different data. This solution can be either table or Frame driven. GOOD NEWS - BAD NEWS Most of us are familiar with the old "good news / bad news" scenarios. For example A man was flying in an airplane when: BN Awingfelloff GN He had a parachute BN The parachute did not open GN He was over a soft, fluffy haystack BN There was a pitchfork in the haystack GN He missed the pitchfork BN He missed the haystack More to the point, a "good news / bad news" scenario might go: Business (or research) often needs large volumes of data: GN Since the advent of the computer, we have amassed huge volumes of data. BN These data are often scattered all over, in different formats and on different platforms, but seldom, it seems, in the form needed. GN SAS is very good for extracting, transforming, manipulating, and analyzing most forms of data. building applications that generate code that is specific to the data, metadata, and tasks that are being manipulated. A SURVEY OF CODE GENERATING TOOLS Since no single tool will fit every application need, we will begin with a survey on common tools. Sometimes we "mix and match" tools as required for a specific application. SAS built-in capabilities: Examples of built-in capabilities might be selecting variables by category or a variable range e.g. _NUMMERIC or FIRST--LAST. Or we can copy code segments for a SAS catalog or include then from an external file. We can use the new import and export Wizards or pick from a multitude of data engines that make SAS interface smoothly with the rest of the world. There are new SAS products, like the SAS Data Warehouse Administrator, that simplify many data administration tasks. Views and data templates can change the way the data are presented to the application. The Macro Facility: This old standby for code segment generation can be broken down into two major components: MACROS and MACROVARIABLEs. Macros are programs that generate code. Macrovariables are named character strings up to 32k characters long that can contain code segments. SCL: Formerly Screen Control Language, SCL provides a convenient way to build OOP and/or GUI based application that generate and submit code segments. Data Step Code Generators: We can generate code based upon the contents of a dataset with a DATA step that puts the code out to a file to be included later. Alternatively, we can use CALL EXECUTE so that the code runs immediately after the DATA step finishes. SQL : Like the DATA step code generators, SQL provides a way to generate macrovariable arrays or macrovariables containing delimited strings. Although this technique lacks the flexibility of some of the other techniques, it provides a remarkably quick and simple way to get information out of a dataset (or if you prefer - a table) and into a SAS code segment. BN Performing these operations on many different data structures can become very tedious. GN SAS tools permit users to eliminate much of this tedium by 1
PREREQUISITES: To get the most from this paper, you will need an understanding of basic SAS syntax and the SAS macro facility. SAS/SQL is based on the ANSII SQL92 standard and may be somewhat unfamiliar to ANSII 87 SQL users. And to make things better, or worse depending upon the point of view, SAS has enhanced its SQL with many features unique to SAS, making it one of the most powerful and underutilized SQLs on the market. The examples here require SAS version 6.11 or higher on machines supporting those versions, and SAS version 6.09e on other machines. For clarity, the example code has been stripped of the error condition tests that would be required for a robust application. The focus is on coding technique, not program efficiency. The author feels these examples give acceptable performance with reasonable dataset size and hardware, but perceptions of what is "acceptable" and "reasonable" vary widely. The Appendix contains a macro for building some test datasets of various sizes. %FLATFILE REVISITED The popular macro %FLATFILE has been presented at several SAS user group conferences. (See H. Ian Whitlock s paper "How to Write A Macro to Make External Flat Files" from SUGI 19 and M. Michelle Buchecker s paper "%FLATFILE, and Make Your Life Easier" from SUGI 21). It reads a SAS dataset and produces a "flat" or "text" file that can be imported into some other application. Most versions of %FLATFILE produce column delimited output, but the macro can be easily modified to produce either comma delimited or tab delimited files. The %FLATFILE macro uses metadata from the input dataset to generate a PUT statement to output the "flattened data". An early version of %FLATFILE used the metadata output from a PROC CONTENTS, a technique that is still necessary if the input resides on tape. But more recent versions use the DICIONARY.COLUMNS table or SASHELP.VCOLUMNS view to read the input dataset s metadata. %FLATFILA We ll call the first revision of this macro FLATFILA. We will use SQL to generate sequentially numbered macro variables, sometimes referred to as a macro variable array or stub variables. %macro flatfila (lib=, /* libref for input dataset */ dsn=, /* memname for input dataset*/ file=); /* filename of output file */_ %let lib=%upcase(&lib); %let dsn=%upcase(&dsn); select name, _ case when format ne then format _ when type= num then Best10. else "$" put(length,z3.). _ end into :var1-:var9999,_ :fmt1-:fmt9999 from dictionary.columns_ where libname = "&lib" and memname = "&dsn"; data _null_; set &lib..&dsn; file &file; put %do i = 1 %to &SQLOBS;_ &&var&i &&fmt&i +1_ ; /* end put statement */ %mend; _ The output file name needs to be preassigned using a filename statement. You can FTP the results directly to a remote server by using the FTP filename engine. _ NAME will simply be copied into the macro variable. _ We can manipulate the format to be used on the output flatfile. For example, we may need to change the way date values are presented to meet the date input requirement of another application. In this example however, the FORMAT will be handled in one of three ways: 1)If the variable has a format assigned in the dataset, then that is the format used on the flatfile. 2)If a numeric variable has no assigned format, then it is given a format of BEST10. Note that the TYPE of "num" is in lower case, an exception to the general rule that dictionary tables use uppercase. 3)Character variables with no assigned format are given a format the same length as the variable. Note how we use concatenation and a SAS function to build the format. The first variable name will be placed in macrovariable VAR1, the second variable name will be placed in VAR2, and so on up through the number of variables on the dataset. Macro variables will not be generated beyond the scope of the metadata. If there are 3 variables on the dataset, then macrovariables VAR1, VAR2, and VAR3 will be generated but VAR4 through VAR9999 will not be created. By default, the macrovariables are LOCAL to the macro. DICTIONARY.COLUMNS is the source of the metadata. We could have used SASHELP.VCOLUMNS or the output from PROC CONTENTS, or even a special table that we maintain just for that purpose. In this case we use the WHERE clause to find the metadata in the table that applies to the input dataset. The WHERE clause could be expanded to subset the variables to be included or excluded. The macrovariable SQLOBS is automatically generated by SQL and contains the number of macrovariables in the "array". In some applications we store this value in the "stub-zero" macrovariable because SQLOBS is reset each time a select statement is e.g. %let var0 = &sqlobs;. For each iteration of I, a variable name/format pair is added to the PUT statement. The +1 will separate the columns on the output file by at least one space. The columns could be separated by a comma or a tab by replacing +1 with, or 09 x. 2
%FLATFILB else name Thenextexample,FLATFILB, is very similar except that we build the PUT statement code in the SQL instead of with a macroloop. Instead of an array of macrovariables, we have one macro variable that contains a long string of characters, in this case variable name/format pairs separated by +1. Until Version 7 is released, we are limited to 200 characters per observation, but the macrovariable itself can be up to 32K characters long. The output from FLATFILB is identical to the output from FLATFILA; so is much of the code. %macro flatfilb (lib=, /* libref for input dataset */ dsn=, /* memname for input dataset*/ file=); /* filename of output file */ %let lib=%upcase(&lib); %let dsn=%upcase(&dsn); select name_ case when format ne then format when type= num then Best10. else "$" put(length,z3.). end into :string separated by +1 _ from dictionary.columns where libname = "&lib" and memname = "&dsn"; quit; data _null_; set &lib..&dsn; file &file; put &string;_ %mend; In this example, we concatenate the variable and the format pairs. Using the SEPARATED BY clause, we build +1 into the string between each variable. Once again this does not have to be +1, it could be a comma or a tab. Just include the macrovariable that contains the string in the code. There is nothing in this example that cannot be done in open code (it does not have to be wrapped in a macro). 3) %FLATFILC %FLATFILC is really not so much a flat file as it is a MS/EXCEL friendly output file format. The variable names are placed over the columns in the first row, the variable labels are in the second row, and the columns are tab delimited. The code is similar to FLATFILB, but we build three separated long string macrovariables. %macro flatfilc (lib=, /* libref for input dataset */ dsn=, /* memname for input dataset*/ file=); /* filename of output file */_ %let lib=%upcase(&lib); %let dsn=%upcase(&dsn); select quote(name),_ quote(case when label ne then label end),_ name case when format ne then format when type= num then Best10. else "$" put(length,z3.). end into :names separated by "09"x, :labels separated by "09"x, :string separated by "09"x from dictionary.columns where libname = "&lib" and memname = "&dsn"; quit; data _null_; set &lib..&dsn; file "&file"; if _n_=1 then put &names / &labels;_ put &string; %mend; _ We use the QUOTE function so that the macrovariable string will contain the variable names in quotes. Therefore, the PUT statement will output the variable names, not the variables values. The variable names are not quoted on the output file. _ If a variable has no label, then the variable name is repeated on the second line of the output file. _ This statement puts the first two lines to the output file. DUMMY VARIABLES A dummy variable, for our purpose here, is a true/false variable representing one of the possible states of the original variable. For example, if the variables STATE can take a value of NY, NJ, or CT, then it can be replaced by three dummy variables, VAR_NY, VAR_NJ, and VAR_CT. Dummy variables are useful in building statistical models and reports. But when building regression models with dummy variables, we should use the DESIGN option in PROC TRANSREG. This option produces three value dummies (-1, 0, 1) and drops one variable so we end up with k-1 variables, where k is the number of unique values in the original variable. Using PROC TRANSREG method prevents oversaturation of the model and erroneous results. But it can be a challenge to build dummy variables, and all too often the meaning of the variable gets lost in the process. And for some reports, it is desirable to have a dummy variable for every value the original variable can take, as defined by a FORMAT, regardless of whether that value occurs in the dataset. The following two macros address these issues: %BLD_DUMY will read all values of a variable in a dataset and build a dummy variable for each discrete value. %FMT_DUMY will build a dummy variable of each LABEL in a user defined format. Lets work backwards for a while, looking first at the output from running %BLD_DUMY, then the generated code that produced the output, and lastly at the macro that generated the code that produced the output. 3
The output dataset (transposed): The input dataset is DAT1, which was generated by %BLD_SAMP in the appendix of this paper. The output below is produced by: %bld_samp(num_obs=100,num_ds=1); %bld_dumy(inds=dat1,var=state,fmt=$.,outds=out1); CT, NJ, and NY are the dummy variables built from the data in STATE. The other 6 variables are simply brought forward from the original data. Note that only three states are represented in the data. variable label first obs CT state $. value of CT 1 NJ state $. value of NJ 0 NY state $. value of NY 0 OBS_ID observaion identification number a7398502792 STATE state of residence CT DOB date of birth 27APR38 DFP date of first purchase 03JUN38 NOP number of purchases 1 TOT_DOL total dollars spent $438.55 The generated code that produced the output: PROC SQL NOPRINT; CREATE TABLE OUT1 AS SELECT PUT(STATE,$.) ="CT" AS CT LABEL="state $. value of CT", PUT(STATE,$.) ="NJ" AS NJ LABEL="state $. value of NJ", PUT(STATE,$.) ="NY" AS NY LABEL="state $. value of NY", * FROM DAT1; The first line of the SELECT statement reads as follows: If the STATE variable, when formatted as $., is CT then store 1 in the variable named CT, otherwise store a 0. The label for CT is "state $. value of CT". The macro that generated the code that produced the output (BLD_DUMY): %macro bld_dumy (inds=, /*input dataset */ var=, /*variable to dummy*/ fmt=, /*format to use */ outds=);/*output dataset */ select distinct(put(&var,&fmt)) into :val1 - :val99999_ from &inds; create table &outds as select %do i=1 %to &SQLOBS ; put(&var,&fmt) ="&&val&i" as &&val&i_ label="&var &fmt value of &&val&i", * from &inds; %mend bld_dumy; _ This builds a macrovariable array that contains all the unique values for STATE in the data. _ This builds the dummy variables using the macrovariable array that was just created. A USER WRITTEN FORMAT Suppose we want to build dummy variables based on a user written format. In this example we are assigning each observation one of five groups based upon TOT_DOL. The PROC FORMAT looks like this: proc format; value val_grp low-<0="neg" 0-50 ="POOR" 50-150 ="GOOD" 150-500 ="BETTER" 500 -High="BEST" ; So when we run: %bld_dumy(inds=dat1,var=tot_dol, fmt=val_grp.,outds=out2); We get this output: variable label first obs BEST: tot_dol val_grp. value of BEST 0 BETTER: tot_dol val_grp. value of BETTER 1 GOOD: tot_dol val_grp. value of GOOD 0 POOR: tot_dol val_grp. value of POOR 0 OBS_ID: observation identification number a7398502792 STATE: state of residence CT DOB: date of birth 27APR38 DFP: date of first purchase 03JUN38 NOP: number of purchases 1 TOT_DOL total dollars spent $438.55 %FMT_DUMY Notice that there are only four dummy variables, even though there are five different labels in the format. This is because there are no negative values in the data. But what if, for report consistency, we wanted a dummy variable for the fifth group? We can force that to occur by driving the macro from the format instead of from the data. Consider the following code: %macro fmt_dumy (inds=, /*input dataset */ var=, /*variable to dummy*/ fmt=, /*format to use */ outds=);/*output dataset */ proc format cntlout=fmt_cntl;_ select %scan(&fmt,1);_ select distinct label into :val1 - :val99999 from fmt_cntl_ ; create table &outds as select %do I=1 %to &SQLOBS ; put(&var,&fmt) ="&&val&i" as &&val&i label="&var &fmt value of &&val&i", * from &inds; %mend fmt_dumy; _ Extract the metadata from the format. 4
_ Remove the "." from the format name. _ Use the extracted format metadata to build the macrovariable array Now when we run: %fmt_dumy( inds=dat1,var=tot_dol, fmt=val_grp.,outds=out3); We get all five dummy variables. variable label first obs BEST tot_dol val_grp. value of BEST 0 BETTER tot_dol val_grp. value of BETTER 1 GOOD tot_dol val_grp. value of GOOD 0 NEG: tot_dol val_grp. value of NEG 0 POOR tot_dol val_grp. value of POOR 0 OBS_ID observation identification number a7398502792 STATE state of residence CT DOB date of birth 27APR38 DFP date of first purchase 03JUN38 NOP number of purchases 1 TOT_DOL total dollars spent $438.55 number of alternate techniques. Data Dirtable (label="directory table - example code"); length work memname $8 prefix $4; input libname $_ memname $ Prefix $;_ cards; work dat1 d1 work dat2 d2 work dat3 d3 work dat4 d4 work dat5 d5 ;;;; Data Vartable (label="variable table - example code"); length label $22 var $8 sufix $4; input Var $_ Suffix $_ label $22.;_ cards; NOP PRCS Num of Purchases TOT_DOL DOL Dollars Spent to Date ;;;; _ The libref and memname of the datasets to be joined. In some applications, it may be desirable to have "PATH" or other descriptive information. _ The prefix will be the first four characters of the new variable name and will identify which dateset contributed the value. The first character must be a letter or _. _ The variable that exists on all the input datasets to be included on the output dataset. _ The suffix is the final character string of the new variable name. _ The label will be combined with dataset information to document the variables meaning. %macro fulljoin (DIRTABLE=dirtable, /* Directory Table */ VARTABLE=vartable, /* Variable Table */ OUTTABLE=outtable, /* Output Table */ KEY =obs_id ); /* Join Key */ JOINING DATASETS In the next example, we will build code to do full outer joins on multiple datasets. We will use data generated by %BLD_SAMP (see the appendix of this paper) which simulates several "point in time" or "snap hot" datasets. The object is to create a macro to join several datasets on keys which have unique values within each dataset, but may not have been assigned a value on all datasets (referential integrity not guaranteed). The macro will be driven by two tables: The directory table, which contains the names of the datasets to be joined and the variable table, which contains the names of the variables to be saved. Since the variables have the same names and labels on all the input datasets, new variable names and labels will be generated from information in the driving tables. For this example, both the directory and the variable tables are hardcoded. In an actual application these tables would be maintained on the system. The driver tables can be manipulated using any of a select var,_ suffix, label into :var1 - :var9999, :sufx1 - :sufx9999, :labl1 - :labl9999 from &vartable; %let numvar = &sqlobs; select compress(libname. memname)_ prefix, memname, into :tblref1-:tblref99, :prefix1-:prefix99, :mn1 -:mn99 from &dirtable; %let numdir=&sqlobs; %do id=1 %to &numdir; proc sort data=&&tblref&id;_ by &key; data &outtable (sortedby=&key);_ merge %do id=1 %to &numdir; 5
&&tblref&id (keep = &key_ %doiv=1%to&numvar; keep = &&var&iv rename=&&var&iv =&&prefix&id.&&sufx&iv ) ; %do id=1 %to &numdir; %doiv=1%to&numvar; label &&prefix&id.&&sufx&iv =_ "&&prefix&id &&labl&iv"; ; by &key; %mend fulljoin; _ Build the macrovariable array pertaining to the variables to be included. _ Build the macrovariable array pertaining to the datasets to be used. _ Sort the datasets by the key. _ Create the output dataset. Since we know the sort order, we use the SORTEDBY option to save sorting on future steps. _ Put the input datasets in the MERGE statement. List the KEEP variables and assign new variable names as part of the input. _ Build the new labels for the variables. Author s note The author has built several versions of the macros in this paper. The examples used have been stripped of the checking and the "bells and whistles" that make macros robust and useful, but harder to follow. For example, %FULLJOIN can be made to perform calculations on data across datasets. The techniques shown here have been tested on OS/2 and UNIX systems. CONCLUSION Relatively simple SQL queries can be used in conjunction with other SAS code generating techniques to save time and effort. And it can be kind of fun...when it works! TRADEMARKS SAS is a registered trademark of SAS Institute Inc. in the USA and other countries. MS/EXCEL is a registered trademark of Microsoft, Inc. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies CONTACT AUTHOR AT David A. Mabey Reader's Digest Association Inc. Reader's Digest Road Pleasantville, NY 10570 APPENDIX SAMPLE DATA GENERATOR MACRO /************************************************/ /* macro BLD_SAMP builds 1 to NUM_DS sample */ /* datasets that have 0 to NUM_OBS observations */ /* each on SAS library LIBNAME. The datasets */ /* are populated with randomly generated values */ /* for variables named OBS_ID, STATE, DOB, */ /* DFP, NOP, and TOT_DOL (see code for variable */ /* labels and formats). Dataset are named */ /* DAT1, DAT2, DAT3,...DAT(NUM_DS) */ /* This macro is the source of scaleable sample */ /* data for Dave Mabey s 1997 NESUG paper */ /************************************************/ %macro bld_samp (num_ds=5, /* how many data sets */ num_obs=100, /* how many observations */ libname=work); /* where to build */ /* build first data set */ data &libname..dat1 (label="sample dataset 1 -- dummy data " type =sample ); do obsnum=1 to &num_obs; drop obsnum; attrib obs_id length=$12 label="observaion identification number" ; obs_id="a" put(int(ranuni(4)*1e10), z10.); attrib state length=$2 label= state of residence ; select(int(ranuni(8)*10)); when (1,2) state= NJ ; when (3,4,5,6) state= NY ; otherwise state= CT ; end; attrib dob format=date7. label="date of birth" ; dob= 4jul05 d + int(ranuni(5)*2e4); attrib dfp format=date7. label="date of first purchase"; dfp=dob + int(ranuni(6)*1e3); attrib nop format=5. label= number of purchases ; nop=ranbin(7,10,.1); attrib tot_dol format=dollar8.2 label= total dollars spent ; tot_dol=nop * round(ranuni(9)*500,.01); output; end; /* Make more datasets based on obs_ids from 1st */ %do i=2 %to &num_ds; data &libname..dat&i (label="sample dataset &i -- dummy data " type =sample ); set _last_; /* drop old and make new in 5% of the obs_ids */ if(ranbin(&i*3,1,.05)) then obs_id="a" put(int(ranuni(&i*4)*1e10), z10.); /* add purchases in 60% of the observations */ if(ranbin(&i*8,1,.65)) then do; nnp = ranbin(&i*7,10,.3);/*num new purchases*/ nop + nnp; tot_dol + nnp * round(ranuni(&i*9)*500,.01); drop nnp; end; %mend bld_samp; Telephone (914) 244-5513 Email dave.mabey@readersdigest.com 6