Proc SQL A Powerful Tool in SAS Swetha Kongara, PVR Technologies Inc Raja Panchumarthi, Smith Hanley Consulting Group Inc ABSTRACT Proc SQL is a powerful tool in the SAS system that can be used in a variety of ways. Its uses include creating SAS datasets or data views, macro variables and data listings. The power of SQL lies in its ability to combine the functionality of the procedures in to one set of programming. You can combine the data from multiple datasets, calculate and integrate multiple summary statistics and sort the resulting data set in one step. This paper will explain you how to create SAS data sets or data views, macro variables and data listings. INTRODUCTION The intent of this document is to describe the syntax of a PROC SQL step and to provide the user with a description of many of the useful statements and capabilities of the procedure. This document is not intended to be a complete reference for PROC SQL. For more detailed information, please refer to the SAS Guide to the SQL Procedure provided by SAS Institute. The document will cover how to create sas data sets, data views, data listings and macro variables. PROC SQL Syntax to Create a Data Set The syntax of the PROC SQL step is one continuous set of statements that end in a single semi-colon. You do not need to end each statement with a semi-colon. Also, there are no requirements for grouping statements on the same line or separate lines of the program, the statements are grouped below in order to logically separate the individual parts of a PROC SQL step. However, the order of these statements is relevant. The order of each statement must follow the order given below. Also, notice that the PROC SQL step ends with a QUIT statement and not with a RUN statement. PROC SQL; CREATE TABLE output-dataset AS SELECT variable <, variable <AS new-variable-name> > FROM input-dataset <AS alias> <, input-dataset <AS alias> > <WHERE expression > <GROUP BY variable <, variable > > <HAVING expression > <ORDER BY variable <, variable >; QUIT; 1
All data set and variable names must follow standard SAS naming conventions. You may use most of the data set options (such as KEEP, DROP, RENAME, and WHERE) with both the input and output data sets in a PROC SQL step. You may use the colon modifier to name variables with the KEEP and DROP data set options. When using these data set options, variable name lists are space delimited as they are in a DATA step. Unlike the DATA step, you do not have the use of the IN= data set operator with PROC SQL. CREATE The CREATE statement is used to name the data set that will contain the results of the SQL statements. The data set name in the CREATE statement can reference a temporary or permanent data set. If this statement is omitted, then PROC SQL creates a data listing like PROC PRINT. CREATE TABLE TEMP CREATE TABLE PERM. STATOUT CREATE TABLE COUNTS (DROP=PERCENT) SELECT The SELECT statement is used to name, rename, and/or create the variables that will make up the resulting data set. You may use SAS functions as well as summary statistical functions available in PROC SQL to create new variables. To specify which variables to keep in the resulting data set, you need to provide a comma separated list on the SELECT statement. To rename a variable or create a new variable using SAS functions or summary statistics, you need to use the AS keyword SELECT * SELECT AE.*, PERPAT.TRT, PERPAT.POP SELECT AE.PATNO, AE_LT, AESOC, TRT, POP FROM: The FROM statement is used to list the data sets that will be used as input to the PROC SQL step. If you are combining data from multiple input data sets, then you need to provide a comma separated list of input data sets. 2
FROM TEMP FROM PERM.PERPAT (WHERE=(POP>=1)) FROM TEMP (DROP=STATUS SORTVAL), PERM.PERPAT (KEEP=SITE PATNO TRT POP) FROM TEMP AS T, PERM.PERPAT AS P WHERE: The WHERE statement is used to specify sub setting criteria or merging criteria for observation selection and processing. If the variable name exists in more than one data set listed on the FROM statement, then you must give the two level variable name using the data set name or alias. You can specify conditions in the WHERE statement that use SAS functions such as SUBSTR, SCAN, or INDEX. WHERE CALCULATED MEANX <= 25 WHERE SUBSTR(ATC_CD,9,3) = '001' WHERE AE.PATNO = PERPAT.PATNO AND AE.AESOC LIKE 'B%' WHERE A.ID = B.ID = C.DIFFID AND (A.VISIT < B.VISIT OR A.VISIT > C.VISIT) GROUP BY: The GROUP BY statement in PROC SQL is used to identify sub-groups to which summary functions will be applied. GROUP BY TRT GROUP BY SITE, PATNO GROUP BY A.LABTEST, B.GENDER, B.AGE HAVING: The HAVING statement specifies a condition that must be met by each sub-group in a GROUP BY statement in order for that sub-group to be kept in the resulting data set. Every HAVING statement must include at least one summary function (otherwise, you could have simply used a WHERE statement), but can also contain conditions that do not involve summary functions. 3
HAVING X > MEAN(X) HAVING (DIFF > 0 AND DIFF = MAX(DIFF)) ORDER BY: The ORDER BY statement specifies a comma separated list of variables or valid SAS formulas to use to sort the resulting table. Like PROC SORT, the default sort sequence is in ascending order; however, you can sort in descending order by using the keyword DESC after the variable name or formula. ORDER BY P.SITE, P.PATNO ORDER BY COUNT/TOTN DESC ORDER BY A.SITE, A.PATNO, AEON_DT, AESEV DESC, AEREL Example for PROC SQL statements to Create a Data Set: The first example illustrates how to use PROC SQL to create a temp data set by combining a permanent data set and a temporary data set and to select a subset of existing variables and calculated variables. This example uses the CASE statement to conditionally create a new variable. Note that the resulting data set will only contain observations that have matching SITE and RANDID values in both data sets (notice that RANDID 3778 is not in the resulting COMM data set). It is helpful to remember that, unlike the DATA step, the input data sets in PROC SQL do not need to be sorted prior to the SQL step. The PROC SQL step will require less processing time if they are sorted in advance, but it is not a requirement. A sample of the CLOS data set. OBS SITE RANDID 1 01 3914 4 07 5334 4
A sample of the PERM.MARGINCO data set OBS SITE SCREENID RANDID TABLE_NA COLUMN_N VISIT COMMENT2 1 01 0009 3778 ALL007 PEBODSYS BM <> 2 01 0009 3778 ALL007 VITAL_DD T28D <> 3 01 0009 3778 ALL007 VMN_NM VP1 <> 4 01 0009 3778 ALL007 VT_NM VP1 <> 5 01 0009 3778 ALL007 VHR_NM VP1 <> 6 01 0009 3778 ALL007 CHGHR_NM VC <> 7 01 0003 3914 ALL007 NEUT_NM BM <> 8 01 0003 3914 ALL007 XRYHR_NM D180 <> 9 07 0003 5334 ALL007 ALTUN BM <> 10 07 0003 5334 ALL007 GROUP_NM A01 IMPROVED 11 07 0003 5334 ALL007 ALTUN T4D <> PROC SQL; CREATE TABLE COMM AS SELECT M.SITE, M.SCREENID, M.RANDID, SUBSTR(M.TABLE_NA,8) AS TABLE_NA, M.COLUMN_N, M.VISIT AS VIS, M.COMMENT1, CASE WHEN COMPRESS(M.COMMENT2) = '<>' THEN ' ' ELSE M.COMMENT2 END AS COMMENT2 FROM PERM.MARGINCO AS M, CLOS AS C WHERE M.SITE = C.SITE AND M.RANDID = C.RANDID ORDER BY M.SITE, M.RANDID, TABLE_NA; QUIT; A sample of the COMM data set OBS SITE SCREENID RANDID TABLE_NA COLUMN_N VIS COMMENT2 1 01 0003 3914 HEM NEUT_NM BM 2 01 0003 3914 XRAY XRYHR_NM D180 3 07 0003 5334 ALIQUOT GROUP_NM A01 IMPROVED 4 07 0003 5334 BLOODGAS BDATE_YY BG1 5 07 0003 5334 CHEM ALTUN BM 6 07 0003 5334 CHEM ALTUN T4D 5
PROC SQL Syntax to Create Views Creating views is same as data sets. The purpose of using views is to reduce the real time required to complete a job by eliminating one or more I/O bound segments. If a fortyminute DATA Step that takes only ten minutes of CPU time can be converted into a DATA Step view, the potential real time savings for the entire job could be as much as thirty minutes. In addition to time savings, SAS data views can reduce the peak disk space requirements for a given job by reducing the redundant copies of data required to be held on disk at any given instant. If you process vast volumes of data, using SQL and DATA Step views may cut significant percentages off the real time for your large SAS jobs. The Syntax for creating view is same the creating the data set. See below for example data numbers; infile cards; input number @@; cards; 2 3-4 2.1-2.2 6-34 0 ; run; proc sql; /* Create a view with 3 additional variables: */ /* NEGATIVE is 1 if NUMBER is negative, otherwise it's 0. */ /* ZERO is 1 if NUMBER is zero, otherwise it's 0. */ /* LOG is the log of the absolute value of NUMBER. */ /* There will be one observation in WITHLOG for each one in */ /* NUMBERS. */ create view withlog as select numbers.number as number, (sign(number) = -1) as negative, (number=0) as zero, log(abs(number)) as log from numbers; PROC SQL Syntax to Create Data Listings: For creating data listings, we need to follow the below steps: 1) Create data set with required variables by using proc sql 2) Use that created data set and use proc print or proc report to produce the listings. 6
See below for example: data dads; input famid name $ inc ; cards; 2 Art 22000 1 Bill 30000 3 Paul 25000 ; run; data faminc; input famid faminc96 faminc97 faminc98 ; cards; 3 75000 76000 77000 1 40000 40500 41000 2 45000 45400 45800 ; run; proc sql; create table dadfam1 as select * from dads, faminc where dads.famid=faminc.famid order by dads.famid; quit; proc print data=dadfam1; run; PROC SQL Syntax to Create Macro Variables Another useful property of PROC SQL is its ability to create macro variables. PROC SQL allows the programmer to concatenate summary information from multiple data groupings into a single macro variable. The PROC SQL statements for creating macro variables are nearly identical to those for creating data sets. PROC SQL <NOPRINT>; SELECT expression1 <, expression2 <, expression3>> INTO :macro-variable1< -:macro-variablen> <, :macro-variable2 <, :macrovariable3>> <SEPARATED BY separating character string> FROM input-dataset1, <input-dataset2, <input-dataset3>> <WHERE expression > <GROUP BY variable <, variable >> <ORDER BY variable <, variable >; QUIT; 7
Standard SAS naming rules apply to macro variables created by PROC SQL. As of Version 6.12, macro variable names may be up to 8 characters in length. The NOPRINT statement is optional. If omitted, the values stored in the macro variables will also be part of the printed output. This section only covers the SELECT, INTO, and SEPARATED BY statements. The other statements are used in the same manner as they are for creating data sets. SELECT expression The expression used on the SELECT statement can be any of the following: a variable in the input data set, a valid SAS formula or function combining variables and/or summary functions, the result of a summary function, or any valid combination of variables, formulas, functions, and summary functions. SELECT MEAN(VAL) SELECT PUT(COUNT(DISTINCT RANDID),3.) SELECT (I - MEAN(I))**2 SELECT SCAN(VNAME,1,'_') ' = INPUT(' TRIM(VNAME) ', 8.)' SELECT DISTINCT DSNAME INTO :macro-variable <SEPARATED BY separating character string> The INTO statement is used to name the macro variables that will contain the results of the SELECT statement. Each macro variable name in the list must be immediately preceded by a colon (:). Multiple rows of output can be concatenated into a single macro variable by using the SEPARATED BY statement. When multiple rows of output are concatenated into a single macro variable, the SEPARATED BY statement is used to provide a character string that will be used to delimit the individual values of that concatenation INTO :MEANX, :STDX, :NX INTO :NTRT1-:NTRT&ngrps INTO :MVLIST SEPARATED BY ' =.; ' INTO :NAMES SEPARATED BY ' ', :VAL1-:VAL3 CONCLUSION Based on above presentation, We know the below items by using proc sql Create data sets, data views Create data listings Create macro variables 8
REFERENCES: SAS Institute Inc., Getting Started with the SQL Procedure, Version 6, First Edition SAS Institute Inc., SAS7 Guide to the SQL Procedure: Usage and Reference, Version 6, First Edition ACKNOWLEDGMENTS I would like to thank Coders Corner Co-Chairs for accepting my abstract and paper. I also thank to Chauthi Nguyen and Shi-Tao Yeh for their support for presenting this paper. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author: Raja Panchumarthi Swetha Kongara SAS Certified Professional PVR Technologies Inc Smith Hanley Consulting Group Inc 350 Parsippany Rd, Suite #70 E-mail: panchumarthi@yahoo.com Ph: 510-691-1490 Parsippany,NJ-07054 E-mail:swetha@pvrtech.com Ph:973-885-4712 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. 9