Adding PROC SQL to your SAS toolbox for merging multiple tables without common variable names Susan Wancewicz, Moores Cancer Center, UCSD, La Jolla, CA ABSTRACT The SQL procedure offers an efficient method of creating a new dataset by merging tables in SAS. Advantages of PROC SQL over the SAS Merge statement include: tables do not need to be sorted before joining them; and tables without a common variable can be joined simultaneously. This paper will lead the SAS user through the following steps in PROC SQL: The basic join, join of tables without common variables, a demonstration of grouping variables in order to obtain a new variable containing the average of an existing variable. At the end of this discussion the SAS user will add flexibility to their SAS toolbox for creating a new merged table. INTRODUCTION The SAS Programmer is often confronted with data from multiple tables containing variable names which are different but refer to the same data. At times, these tables may need to be merged (joined) together for further processing. While it is possible to do this in SAS, using the MERGE and RENAME commands, PROC SQL offers another option. Topics will include the basic join with and without common variable names. In addition to the basic code there will be demonstration of unanticipated results and how to avoid them. The use of grouping will be discussed to create new variables. At the end of this paper the SAS user will have added a new tool to their SAS toolbox for joining tables. SAMPLE TABLES USED The following fictitious data sets will be used for this demonstration: Table 1. Demog id education B002 5 b003 8 b005 5 b007 9 b008 7 b009 8 b010 8 The demog table contains 7 rows. The ID field always begins with the letter B. However, it was entered without regard to case. Education levels for the subjects was obtained and coded. This table is currently sorted by the id variable. Table 2. Ids id2 sid namelast dob B002 462 Ferry 3/6/1988 B003 463 CableCar 6/6/1987 b005 464 Scooter 2/4/1988 b006 465 Barge 3/3/1986 b007 466 Boat 9/9/1986 b008 467 Airplane 4/4/1987 b009 468 Walk 6/12/1988 b010 469 BART 8/8/1987 1
The ids table contains 8 rows which is one more row than the demog table. The variable id2 in the ids table refers to the same data as the variable id in the demog table. There is an additional row for subject b006 in the id table who is not listed in the demog table. We also have the variable sid which identifies the subject by a second id. Other variables in this table are namelast and dateofbirth. As seen in the demog table, the id2 information seems to have been entered without regard to case. This table is sorted by both id2 and sid. Table 3. Event sid questnum 462 2005 462 2144 462 2145 462 2154 463 2193 463 2210 467 2211 464 2212 467 2215 468 2275 467 2321 464 2325 466 2364 464 2451 466 2481 466 2491 463 2492 463 2493 467 2494 466 2512 464 2614 468 2739 468 2740 468 2858 The event table has 24 rows and contains the variable SID which was seen in the id table. There is also a variable for questnum. Table 4. Nutrients intnum kcal fatgm carbgm proteingm 2005 753 10 145 25 2193 937 21 168 25 2144 842 13 156 28 2145 909 41 111 31 2154 936 25 150 34 2481 1080 29 157 36 2325 1034 37 135 43 2614 1151 47 142 43 2739 1168 39 137 45 2321 1027 40 123 47 2210 954 22 141 54 2212 1021 26 129 54 2211 992 43 94 55 2215 1022 51 86 56 2
. 2493 1123 35 129 56 2858 1211 48 148 57 2491 1084 21 174 58 2275 1022 15 163 60 2451 1080 25 139 61 2364 1036 31 113 63 2740 1176 36 145 65 2492 1100 23 165 67 2494 1125 39 130 69 2512 1146 28 150 70 The nutrients table contains 24 rows. The intnum variable in the nutrients table contains the same information as the questnum variable in the event table. This table is sorted by the proteingm variable. USING THE SQL PROCEDURE JOINING TABLES USING PROC SQL Let us begin by joining the demog and ids tables using PROC SQL. We will be creating a new table called demogid. We want to select the variables id, education, sid, and namelast to include in our new table. We will be obtaining information from the table demog joined with the table ids. For the join variables we would like to use id from the demog table and id2 from the ids table. Notice there are two equally acceptable formats which may be used for the inner join example below. or CREATE TABLE demogid as SELECT id, id2, education, sid, namelast FROM demog d, ids i WHERE d.id = i.id2; CREATE TABLE demogid as SELECT id, education, sid, namelast FROM demog d JOIN ids i on d.id = i.id2; NOTE: Table WORK.DEMOGID created, with 6 rows and 4 columns. Output for DEMOGID: id id2 education sid namelast B002 B002 5 462 Ferry b005 b005 5 464 Scooter b007 b007 9 466 Boat b008 b008 7 467 Airplane b009 b009 8 468 Walk b010 b010 8 469 BART The newly created demogid table has only six rows but our two originating tables have seven and eight rows, respectively. Further investigation is in order to determine why we are missing data. Upon examination of the demog table we see id b003 is missing from our results. Let us try using a left join to force all of the variables from the table demog (on the left side of our join statement) to be in the new table. 3
CREATE TABLE demogidleft as SELECT id, id2, education, sid, namelast FROM demog d LEFT JOIN ids i on d.id = i.id2; NOTE: Table WORK.DEMOGIDLEFT created, with 7 rows and 5 columns. Output for DEMOGIDLEFT: id id2 education sid namelast B002 B002 5 462 Ferry b003 8 b005 b005 5 464 Scooter b007 b007 9 466 Boat b008 b008 7 467 Airplane b009 b009 8 468 Walk b010 b010 8 469 BART This looks better. We have the 7 rows found in the demog table but sid and namelast which are in the ids table are missing for id b003 (B003). We would like B003 in the ids table to match to b003 in the demog table. The next step will be to adjust our procedure for case sensitivity. We will use upcase to force all the letters to be uppercase for comparison purposes. CREATE TABLE demogidup as SELECT id, id2, education, sid, namelast FROM demog d LEFT JOIN ids i on upcase(d.id) = upcase(i.id2); NOTE: Table WORK.DEMOGIDUP created, with 7 rows and 5 columns. Output for DEMOGIDUP: id id2 education sid namelast B002 B002 5 462 Ferry b003 B003 8 463 CableCar b005 b005 5 464 Scooter b007 b007 9 466 Boat b008 b008 7 467 Airplane b009 b009 8 468 Walk b010 b010 8 469 BART As expected, we now have a row for each id in the demog table without any missing date. If we would like to see all of the data from the ids table we can use a right join. Let s see what happens if we use a right join instead of a left join. CREATE TABLE demogidrt as SELECT id, upcase(id2) as id2, education, sid, namelast FROM demog d RIGHT JOIN ids i on upcase(d.id) = upcase(i.id2); 4
NOTE: Table WORK.DEMOGIDRT created, with 8 rows and 5 columns. Output for DEMOGIDRT: id id2 education sid namelast B002 B002 5 462 Ferry b003 B003 8 463 CableCar b005 B005 5 464 Scooter B006 465 Barge b007 B007 9 466 Boat b008 B008 7 467 Airplane b009 B009 8 468 Walk b010 B010 8 469 BART As expected, we have 8 rows of data. Let s look at the output more closely. Since we used upcase in the select statement it was necessary to alias the resulting variable. In this case we used the same name id2. B006 is not in the demog table so we do not have an id or education value. For illustration both id and id2 have been included in the DEMOGIDRT table but only one of the variables would be necessary generally. GROUPING VARIABLES IN ORDER TO OBTAIN A NEW VARIABLE Looking at the nutrients table (table 4) we see some nutrition information for the intnum variable. We would like to join the newly created demogidrt table with the nutrients table. It will be necessary to use the event table as a bridge between the demogidrt and nutrients tables. The demogidrt table and the event table have a common variable sid so we will join them first. Notice that the demogidrt table is sorted by sid while the event table is sorted by questnum. Secondly, we will join the nutrients table using intnum from the nutrients table and questnum from the event table. We would also like to create variables for averages of the various nutrients. To improve readability of our results we will use order by to sort the table by id2. The wildcard * is also used to include all of the results from the table demogidrt. CREATE TABLE Nutrient as SELECT d.*, avg(kcal) as avgkcal, avg(fatgm) as avgfat, avg(carbgm) as avgcarb, avg(proteingm) as avgprotein FROM demogidrt d LEFTt JOIN event e ON d.sid = e.sid LEFT JOIN nutrients n ON e.questnum = n.intnum GROUP BY id2 ORDER BY id2; NOTE: The query requires remerging summary statistics back with the original data. NOTE: Table WORK.NUTRIENT created, with 26 rows and 9 columns. Output for NUTRIENT: id id2 education sid namelast avgkcal avgfat avgcarb avgprotein b003 B003 8 463 CableCar 1028.5 25.25 150.75 50.5 b003 B003 8 463 CableCar 1028.5 25.25 150.75 50.5 5
b003 B003 8 463 CableCar 1028.5 25.25 150.75 50.5 b003 B003 8 463 CableCar 1028.5 25.25 150.75 50.5 b005 B005 5 464 Scooter 1071.5 33.75 136.25 50.25 b005 B005 5 464 Scooter 1071.5 33.75 136.25 50.25 b005 B005 5 464 Scooter 1071.5 33.75 136.25 50.25 b005 B005 5 464 Scooter 1071.5 33.75 136.25 50.25 B006 465 Barge b007 B007 9 466 Boat 1086.5 27.25 148.5 56.75 b007 B007 9 466 Boat 1086.5 27.25 148.5 56.75 b007 B007 9 466 Boat 1086.5 27.25 148.5 56.75 b007 B007 9 466 Boat 1086.5 27.25 148.5 56.75 b008 B008 7 467 Airplane 1041.5 43.25 108.25 56.75 b008 B008 7 467 Airplane 1041.5 43.25 108.25 56.75 b008 B008 7 467 Airplane 1041.5 43.25 108.25 56.75 b008 B008 7 467 Airplane 1041.5 43.25 108.25 56.75 b009 B009 8 468 Walk 1144.25 34.5 148.25 56.75 b009 B009 8 468 Walk 1144.25 34.5 148.25 56.75 b009 B009 8 468 Walk 1144.25 34.5 148.25 56.75 b009 B009 8 468 Walk 1144.25 34.5 148.25 56.75 b010 B010 8 469 BART There seems to be a problem with this table. We have multiple rows of data with exactly the same information. The nutrient information for B006 and b010 is in fact not present in the nutrient table. Let s use DISTINCT to eliminate the multiple row problem. CREATE TABLE Nutrientdistinct as SELECT DISTINCT id2, d.*, avg(kcal) as avgkcal, avg(fatgm) as avgfat, avg(carbgm) as avgcarb, avg(proteingm) as avgprotein FROM demogidrt d LEFT JOIN event e ON d.sid = e.sid LEFT JOIN nutrients n ON e.questnum = n.intnum GROUP BY id2 ORDER BY id2; WARNING: Column named id2 is duplicated in a select expression (or a view). Explicit references to it will be to the first one. NOTE: The query requires remerging summary statistics back with the original data. WARNING: Variable id2 already exists on file WORK.NUTRIENTDISTINCT. NOTE: Table WORK.NUTRIENTDISTINCT created, with 8 rows and 9 columns. Output for NUTRIENTDISTINCT: id2 id education sid namelast avgkcal avgfat avgcarb avgprotein B003 b003 8 463 CableCar 1028.5 25.25 150.75 50.5 B005 b005 5 464 Scooter 1071.5 33.75 136.25 50.25 B006 465 Barge B007 b007 9 466 Boat 1086.5 27.25 148.5 56.75 B008 b008 7 467 Airplane 1041.5 43.25 108.25 56.75 B009 b009 8 468 Walk 1144.25 34.5 148.25 56.75 B010 b010 8 469 BART 6
The use of distinct did give us one row of data for each distinct id2. In the final example we would like to subset our data by only including data where the last name is a common form of public transportation in San Francisco. CREATE TABLE NutrientSF as SELECT DISTINCT id2, d.*, avg(kcal) as avgkcal, avg(fatgm) as avgfat, avg(carbgm) as avgcarb, avg(proteingm) as avgprotein FROM demogidrt d LEFT JOIN event e ON d.sid = e.sid LEFT JOIN nutrients n ON e.questnum = n.intnum WHERE namelast in ('BART', 'CableCar', 'Ferry') GROUP BY id2 ORDER BY id2; WARNING: Column named id2 is duplicated in a select expression (or a view). Explicit references to it will be to the first one. NOTE: The query requires remerging summary statistics back with the original data. WARNING: Variable id2 already exists on file WORK.NUTRIENTSF. NOTE: Table WORK.NUTRIENTSF created, with 3 rows and 9 columns. Output for NUTRIENTSF: id2 id education sid namelast avgkcal avgfat avgcarb avgprotein B003 b003 8 463 CableCar 1028.5 25.25 150.75 50.5 B010 b010 8 469 BART With the use of IN we are able to include only the last name Ferry, CableCar and BART in the NUTRIENTSF table. CONCLUSION Using Proc SQL can add versatility to your SAS repertoire. The SAS programmer should now have a basic understanding of joining tables as well as an appreciation for some of the potential problems. By using PROC SQL, steps can be saved in presorting, grouping and joining tables without common variable names. While PROC SQL may not always be the right choice for your application it adds another tool to your SAS toolbox. REFERENCES Delwich, Lora D. and Susan J. Slaughter. 2003. The Little SAS Book: A Primer, Third Edition. Cary, NC: Sas Institute Inc. Prairie, Katherine. 2005. The Essential PROC SQL Handbook for SAS Users. The Essential PROC SQL Handbook for SAS Users. Cary, NC: SAS Institute Inc. ACKOWLEGEMENTS Thank you to Shirley Flatt and Martha White for their assistance in bringing this paper to fruition. 7
CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Susan Wancewicz University of California, San Diego swancewicz@ucsd.edu 8