Table 1. Demog. id education B002 5 b003 8 b005 5 b007 9 b008 7 b009 8 b010 8



Similar documents
The Essentials of Finding the Distinct, Unique, and Duplicate Values in Your Data

One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University

Subsetting Observations from Large SAS Data Sets

Outline. SAS-seminar Proc SQL, the pass-through facility. What is SQL? What is a database? What is Proc SQL? What is SQL and what is a database

Using Proc SQL and ODBC to Manage Data outside of SAS Jeff Magouirk, National Jewish Medical and Research Center, Denver, Colorado

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

An Approach to Creating Archives That Minimizes Storage Requirements

Alternatives to Merging SAS Data Sets But Be Careful

Using DATA Step MERGE and PROC SQL JOIN to Combine SAS Datasets Dalia C. Kahane, Westat, Rockville, MD

Fun with PROC SQL Darryl Putnam, CACI Inc., Stevensville MD

Normalizing SAS Datasets Using User Define Formats

Chapter 1 Overview of the SQL Procedure

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

Performing Queries Using PROC SQL (1)

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Intelligent Query and Reporting against DB2. Jens Dahl Mikkelsen SAS Institute A/S

Enhanced Search Results for Service Providers (SCR 8611A)

MWSUG Paper S111

Improve Your Queries; Hints and Tips for Using SQL Marje Fecht, Prowerk Consulting, Mississauga, Ontario, Canada Linda Mitterling, SAS, Cary, NC

Preparing Real World Data in Excel Sheets for Statistical Analysis

Effective Use of SQL in SAS Programming

Chapter 9 Joining Data from Multiple Tables. Oracle 10g: SQL

Using Pharmacovigilance Reporting System to Generate Ad-hoc Reports

SQL SUBQUERIES: Usage in Clinical Programming. Pavan Vemuri, PPD, Morrisville, NC

From Database to your Desktop: How to almost completely automate reports in SAS, with the power of Proc SQL

Can SAS Enterprise Guide do all of that, with no programming required? Yes, it can.

a presentation by Kirk Paul Lafler SAS Consultant, Author, and Trainer

Statistics and Analysis. Quality Control: How to Analyze and Verify Financial Data

SAS Programming Tips, Tricks, and Techniques

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

A Day in the Life of Data Part 2

REx: An Automated System for Extracting Clinical Trial Data from Oracle to SAS

Oracle Database 10g: Introduction to SQL

PH 7525 Introduction to Data & Statistical Packages Course Reference #: Spring 2011

Paper Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA

Lab # 5. Retreiving Data from Multiple Tables. Eng. Alaa O Shama

PROC SQL for DATA Step Die-Hards Christianna S. Williams, Yale University

Simple Rules to Remember When Working with Indexes Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access

Using the SQL Procedure

Return of the Codes: SAS, Windows, and Your s Mark Tabladillo, Ph.D., MarkTab Consulting, Atlanta, GA Associate Faculty, University of Phoenix

KEY FEATURES OF SOURCE CONTROL UTILITIES

Demystifying PROC SQL Join Algorithms Kirk Paul Lafler, Software Intelligence Corporation

Data Warehousing With Microsoft Access

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7

Let SAS Modify Your Excel File Nelson Lee, Genentech, South San Francisco, CA

SAS Views The Best of Both Worlds

Information and Computer Science Department ICS 324 Database Systems Lab#11 SQL-Basic Query

Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

Table Lookups: From IF-THEN to Key-Indexing

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

Using SQL Queries to Insert, Update, Delete, and View Data: Joining Multiple Tables. Lesson C Objectives. Joining Multiple Tables

Using SAS With a SQL Server Database. M. Rita Thissen, Yan Chen Tang, Elizabeth Heath RTI International, RTP, NC

SET The SET statement is often used in two ways copying and appending.

C H A P T E R 1 Introducing Data Relationships, Techniques for Data Manipulation, and Access Methods

Imelda C. Go, South Carolina Department of Education, Columbia, SC

Using Multiple Operations. Implementing Table Operations Using Structured Query Language (SQL)

How To Use Sas With A Computer System Knowledge Management (Sas)

Tales from the Help Desk 3: More Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

Information Systems SQL. Nikolaj Popov

Beyond the Simple SAS Merge. Vanessa L. Cox, MS 1,2, and Kimberly A. Wildes, DrPH, MA, LPC, NCC 3. Cancer Center, Houston, TX.

Click to create a query in Design View. and click the Query Design button in the Queries group to create a new table in Design View.

car boat airplane train taxi bus motorcycle ambulance bicycle tricycle horse mule scooter

Managing Tables in Microsoft SQL Server using SAS

Paper TU_09. Proc SQL Tips and Techniques - How to get the most out of your queries

BRIO QUERY FUNCTIONALITY IN COMPARISION TO CRYSTAL REPORTS

2874CD1EssentialSQL.qxd 6/25/01 3:06 PM Page 1 Essential SQL Copyright 2001 SYBEX, Inc., Alameda, CA

USING SAS WITH ORACLE PRODUCTS FOR DATABASE MANAGEMENT AND REPORTING

Managing Clinical Trials Data using SAS Software

Relational Database: Additional Operations on Relations; SQL

SAS/ACCESS 9.3 Interface to PC Files

Introduction to Proc SQL Steven First, Systems Seminar Consultants, Madison, WI

Retrieving Data Using the SQL SELECT Statement. Copyright 2006, Oracle. All rights reserved.

Title. Syntax. stata.com. odbc Load, write, or view data from ODBC sources. List ODBC sources to which Stata can connect odbc list

Introduction. Why Use ODBC? Setting Up an ODBC Data Source. Stat/Math - Getting Started Using ODBC with SAS and SPSS

SAS Data Views: A Virtual View of Data John C. Boling, SAS Institute Inc., Cary, NC

Using SAS as a Relational Database

Tutorial 3. Maintaining and Querying a Database

Microsoft Office 2010

Q1. Where else, other than your home, do you use the internet? (Check all that apply). Library School Workplace Internet on a cell phone Other

Web Intelligence User Guide

New Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency

9.1 SAS/ACCESS. Interface to SAP BW. User s Guide

Introduction to Criteria-based Deduplication of Records, continued SESUG 2012

Reshaping & Combining Tables Unit of analysis Combining. Assignment 4. Assignment 4 continued PHPM 672/677 2/21/2016. Kum 1

Top Ten Reasons to Use PROC SQL

Same Data Different Attributes: Cloning Issues with Data Sets Brian Varney, Experis Business Analytics, Portage, MI

Dashboard Admin Guide

KEYWORDS ARRAY statement, DO loop, temporary arrays, MERGE statement, Hash Objects, Big Data, Brute force Techniques, PROC PHREG

UNIX Operating Environment

Intro to Longitudinal Data: A Grad Student How-To Paper Elisa L. Priest 1,2, Ashley W. Collinsworth 1,3 1

Configuring an Alternative Database for SAS Web Infrastructure Platform Services

Overview. NT Event Log. CHAPTER 8 Enhancements for SAS Users under Windows NT

Advanced Query for Query Developers

Everything you wanted to know about MERGE but were afraid to ask

Data Warehousing. Paper

The SET Statement and Beyond: Uses and Abuses of the SET Statement. S. David Riba, JADE Tech, Inc., Clearwater, FL

Normalized EditChecks Automated Tracking (N.E.A.T.) A SAS solution to improve clinical data cleaning

Programming Idioms Using the SET Statement

Transcription:

Adding PROC SQL to your SAS toolbox for merging multiple tables without common variable names Susan Wancewicz, Moores Cancer Center, UCSD, La Jolla, CA ABSTRACT The SQL procedure offers an efficient method of creating a new dataset by merging tables in SAS. Advantages of PROC SQL over the SAS Merge statement include: tables do not need to be sorted before joining them; and tables without a common variable can be joined simultaneously. This paper will lead the SAS user through the following steps in PROC SQL: The basic join, join of tables without common variables, a demonstration of grouping variables in order to obtain a new variable containing the average of an existing variable. At the end of this discussion the SAS user will add flexibility to their SAS toolbox for creating a new merged table. INTRODUCTION The SAS Programmer is often confronted with data from multiple tables containing variable names which are different but refer to the same data. At times, these tables may need to be merged (joined) together for further processing. While it is possible to do this in SAS, using the MERGE and RENAME commands, PROC SQL offers another option. Topics will include the basic join with and without common variable names. In addition to the basic code there will be demonstration of unanticipated results and how to avoid them. The use of grouping will be discussed to create new variables. At the end of this paper the SAS user will have added a new tool to their SAS toolbox for joining tables. SAMPLE TABLES USED The following fictitious data sets will be used for this demonstration: Table 1. Demog id education B002 5 b003 8 b005 5 b007 9 b008 7 b009 8 b010 8 The demog table contains 7 rows. The ID field always begins with the letter B. However, it was entered without regard to case. Education levels for the subjects was obtained and coded. This table is currently sorted by the id variable. Table 2. Ids id2 sid namelast dob B002 462 Ferry 3/6/1988 B003 463 CableCar 6/6/1987 b005 464 Scooter 2/4/1988 b006 465 Barge 3/3/1986 b007 466 Boat 9/9/1986 b008 467 Airplane 4/4/1987 b009 468 Walk 6/12/1988 b010 469 BART 8/8/1987 1

The ids table contains 8 rows which is one more row than the demog table. The variable id2 in the ids table refers to the same data as the variable id in the demog table. There is an additional row for subject b006 in the id table who is not listed in the demog table. We also have the variable sid which identifies the subject by a second id. Other variables in this table are namelast and dateofbirth. As seen in the demog table, the id2 information seems to have been entered without regard to case. This table is sorted by both id2 and sid. Table 3. Event sid questnum 462 2005 462 2144 462 2145 462 2154 463 2193 463 2210 467 2211 464 2212 467 2215 468 2275 467 2321 464 2325 466 2364 464 2451 466 2481 466 2491 463 2492 463 2493 467 2494 466 2512 464 2614 468 2739 468 2740 468 2858 The event table has 24 rows and contains the variable SID which was seen in the id table. There is also a variable for questnum. Table 4. Nutrients intnum kcal fatgm carbgm proteingm 2005 753 10 145 25 2193 937 21 168 25 2144 842 13 156 28 2145 909 41 111 31 2154 936 25 150 34 2481 1080 29 157 36 2325 1034 37 135 43 2614 1151 47 142 43 2739 1168 39 137 45 2321 1027 40 123 47 2210 954 22 141 54 2212 1021 26 129 54 2211 992 43 94 55 2215 1022 51 86 56 2

. 2493 1123 35 129 56 2858 1211 48 148 57 2491 1084 21 174 58 2275 1022 15 163 60 2451 1080 25 139 61 2364 1036 31 113 63 2740 1176 36 145 65 2492 1100 23 165 67 2494 1125 39 130 69 2512 1146 28 150 70 The nutrients table contains 24 rows. The intnum variable in the nutrients table contains the same information as the questnum variable in the event table. This table is sorted by the proteingm variable. USING THE SQL PROCEDURE JOINING TABLES USING PROC SQL Let us begin by joining the demog and ids tables using PROC SQL. We will be creating a new table called demogid. We want to select the variables id, education, sid, and namelast to include in our new table. We will be obtaining information from the table demog joined with the table ids. For the join variables we would like to use id from the demog table and id2 from the ids table. Notice there are two equally acceptable formats which may be used for the inner join example below. or CREATE TABLE demogid as SELECT id, id2, education, sid, namelast FROM demog d, ids i WHERE d.id = i.id2; CREATE TABLE demogid as SELECT id, education, sid, namelast FROM demog d JOIN ids i on d.id = i.id2; NOTE: Table WORK.DEMOGID created, with 6 rows and 4 columns. Output for DEMOGID: id id2 education sid namelast B002 B002 5 462 Ferry b005 b005 5 464 Scooter b007 b007 9 466 Boat b008 b008 7 467 Airplane b009 b009 8 468 Walk b010 b010 8 469 BART The newly created demogid table has only six rows but our two originating tables have seven and eight rows, respectively. Further investigation is in order to determine why we are missing data. Upon examination of the demog table we see id b003 is missing from our results. Let us try using a left join to force all of the variables from the table demog (on the left side of our join statement) to be in the new table. 3

CREATE TABLE demogidleft as SELECT id, id2, education, sid, namelast FROM demog d LEFT JOIN ids i on d.id = i.id2; NOTE: Table WORK.DEMOGIDLEFT created, with 7 rows and 5 columns. Output for DEMOGIDLEFT: id id2 education sid namelast B002 B002 5 462 Ferry b003 8 b005 b005 5 464 Scooter b007 b007 9 466 Boat b008 b008 7 467 Airplane b009 b009 8 468 Walk b010 b010 8 469 BART This looks better. We have the 7 rows found in the demog table but sid and namelast which are in the ids table are missing for id b003 (B003). We would like B003 in the ids table to match to b003 in the demog table. The next step will be to adjust our procedure for case sensitivity. We will use upcase to force all the letters to be uppercase for comparison purposes. CREATE TABLE demogidup as SELECT id, id2, education, sid, namelast FROM demog d LEFT JOIN ids i on upcase(d.id) = upcase(i.id2); NOTE: Table WORK.DEMOGIDUP created, with 7 rows and 5 columns. Output for DEMOGIDUP: id id2 education sid namelast B002 B002 5 462 Ferry b003 B003 8 463 CableCar b005 b005 5 464 Scooter b007 b007 9 466 Boat b008 b008 7 467 Airplane b009 b009 8 468 Walk b010 b010 8 469 BART As expected, we now have a row for each id in the demog table without any missing date. If we would like to see all of the data from the ids table we can use a right join. Let s see what happens if we use a right join instead of a left join. CREATE TABLE demogidrt as SELECT id, upcase(id2) as id2, education, sid, namelast FROM demog d RIGHT JOIN ids i on upcase(d.id) = upcase(i.id2); 4

NOTE: Table WORK.DEMOGIDRT created, with 8 rows and 5 columns. Output for DEMOGIDRT: id id2 education sid namelast B002 B002 5 462 Ferry b003 B003 8 463 CableCar b005 B005 5 464 Scooter B006 465 Barge b007 B007 9 466 Boat b008 B008 7 467 Airplane b009 B009 8 468 Walk b010 B010 8 469 BART As expected, we have 8 rows of data. Let s look at the output more closely. Since we used upcase in the select statement it was necessary to alias the resulting variable. In this case we used the same name id2. B006 is not in the demog table so we do not have an id or education value. For illustration both id and id2 have been included in the DEMOGIDRT table but only one of the variables would be necessary generally. GROUPING VARIABLES IN ORDER TO OBTAIN A NEW VARIABLE Looking at the nutrients table (table 4) we see some nutrition information for the intnum variable. We would like to join the newly created demogidrt table with the nutrients table. It will be necessary to use the event table as a bridge between the demogidrt and nutrients tables. The demogidrt table and the event table have a common variable sid so we will join them first. Notice that the demogidrt table is sorted by sid while the event table is sorted by questnum. Secondly, we will join the nutrients table using intnum from the nutrients table and questnum from the event table. We would also like to create variables for averages of the various nutrients. To improve readability of our results we will use order by to sort the table by id2. The wildcard * is also used to include all of the results from the table demogidrt. CREATE TABLE Nutrient as SELECT d.*, avg(kcal) as avgkcal, avg(fatgm) as avgfat, avg(carbgm) as avgcarb, avg(proteingm) as avgprotein FROM demogidrt d LEFTt JOIN event e ON d.sid = e.sid LEFT JOIN nutrients n ON e.questnum = n.intnum GROUP BY id2 ORDER BY id2; NOTE: The query requires remerging summary statistics back with the original data. NOTE: Table WORK.NUTRIENT created, with 26 rows and 9 columns. Output for NUTRIENT: id id2 education sid namelast avgkcal avgfat avgcarb avgprotein b003 B003 8 463 CableCar 1028.5 25.25 150.75 50.5 b003 B003 8 463 CableCar 1028.5 25.25 150.75 50.5 5

b003 B003 8 463 CableCar 1028.5 25.25 150.75 50.5 b003 B003 8 463 CableCar 1028.5 25.25 150.75 50.5 b005 B005 5 464 Scooter 1071.5 33.75 136.25 50.25 b005 B005 5 464 Scooter 1071.5 33.75 136.25 50.25 b005 B005 5 464 Scooter 1071.5 33.75 136.25 50.25 b005 B005 5 464 Scooter 1071.5 33.75 136.25 50.25 B006 465 Barge b007 B007 9 466 Boat 1086.5 27.25 148.5 56.75 b007 B007 9 466 Boat 1086.5 27.25 148.5 56.75 b007 B007 9 466 Boat 1086.5 27.25 148.5 56.75 b007 B007 9 466 Boat 1086.5 27.25 148.5 56.75 b008 B008 7 467 Airplane 1041.5 43.25 108.25 56.75 b008 B008 7 467 Airplane 1041.5 43.25 108.25 56.75 b008 B008 7 467 Airplane 1041.5 43.25 108.25 56.75 b008 B008 7 467 Airplane 1041.5 43.25 108.25 56.75 b009 B009 8 468 Walk 1144.25 34.5 148.25 56.75 b009 B009 8 468 Walk 1144.25 34.5 148.25 56.75 b009 B009 8 468 Walk 1144.25 34.5 148.25 56.75 b009 B009 8 468 Walk 1144.25 34.5 148.25 56.75 b010 B010 8 469 BART There seems to be a problem with this table. We have multiple rows of data with exactly the same information. The nutrient information for B006 and b010 is in fact not present in the nutrient table. Let s use DISTINCT to eliminate the multiple row problem. CREATE TABLE Nutrientdistinct as SELECT DISTINCT id2, d.*, avg(kcal) as avgkcal, avg(fatgm) as avgfat, avg(carbgm) as avgcarb, avg(proteingm) as avgprotein FROM demogidrt d LEFT JOIN event e ON d.sid = e.sid LEFT JOIN nutrients n ON e.questnum = n.intnum GROUP BY id2 ORDER BY id2; WARNING: Column named id2 is duplicated in a select expression (or a view). Explicit references to it will be to the first one. NOTE: The query requires remerging summary statistics back with the original data. WARNING: Variable id2 already exists on file WORK.NUTRIENTDISTINCT. NOTE: Table WORK.NUTRIENTDISTINCT created, with 8 rows and 9 columns. Output for NUTRIENTDISTINCT: id2 id education sid namelast avgkcal avgfat avgcarb avgprotein B003 b003 8 463 CableCar 1028.5 25.25 150.75 50.5 B005 b005 5 464 Scooter 1071.5 33.75 136.25 50.25 B006 465 Barge B007 b007 9 466 Boat 1086.5 27.25 148.5 56.75 B008 b008 7 467 Airplane 1041.5 43.25 108.25 56.75 B009 b009 8 468 Walk 1144.25 34.5 148.25 56.75 B010 b010 8 469 BART 6

The use of distinct did give us one row of data for each distinct id2. In the final example we would like to subset our data by only including data where the last name is a common form of public transportation in San Francisco. CREATE TABLE NutrientSF as SELECT DISTINCT id2, d.*, avg(kcal) as avgkcal, avg(fatgm) as avgfat, avg(carbgm) as avgcarb, avg(proteingm) as avgprotein FROM demogidrt d LEFT JOIN event e ON d.sid = e.sid LEFT JOIN nutrients n ON e.questnum = n.intnum WHERE namelast in ('BART', 'CableCar', 'Ferry') GROUP BY id2 ORDER BY id2; WARNING: Column named id2 is duplicated in a select expression (or a view). Explicit references to it will be to the first one. NOTE: The query requires remerging summary statistics back with the original data. WARNING: Variable id2 already exists on file WORK.NUTRIENTSF. NOTE: Table WORK.NUTRIENTSF created, with 3 rows and 9 columns. Output for NUTRIENTSF: id2 id education sid namelast avgkcal avgfat avgcarb avgprotein B003 b003 8 463 CableCar 1028.5 25.25 150.75 50.5 B010 b010 8 469 BART With the use of IN we are able to include only the last name Ferry, CableCar and BART in the NUTRIENTSF table. CONCLUSION Using Proc SQL can add versatility to your SAS repertoire. The SAS programmer should now have a basic understanding of joining tables as well as an appreciation for some of the potential problems. By using PROC SQL, steps can be saved in presorting, grouping and joining tables without common variable names. While PROC SQL may not always be the right choice for your application it adds another tool to your SAS toolbox. REFERENCES Delwich, Lora D. and Susan J. Slaughter. 2003. The Little SAS Book: A Primer, Third Edition. Cary, NC: Sas Institute Inc. Prairie, Katherine. 2005. The Essential PROC SQL Handbook for SAS Users. The Essential PROC SQL Handbook for SAS Users. Cary, NC: SAS Institute Inc. ACKOWLEGEMENTS Thank you to Shirley Flatt and Martha White for their assistance in bringing this paper to fruition. 7

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Susan Wancewicz University of California, San Diego swancewicz@ucsd.edu 8