Using DATA Step MERGE and PROC SQL JOIN to Combine SAS Datasets Dalia C. Kahane, Westat, Rockville, MD



Similar documents
Demystifying PROC SQL Join Algorithms Kirk Paul Lafler, Software Intelligence Corporation

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

Effective Use of SQL in SAS Programming

Using the SQL Procedure

Performing Queries Using PROC SQL (1)

Simple Rules to Remember When Working with Indexes Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Chapter 9 Joining Data from Multiple Tables. Oracle 10g: SQL

DATA Step versus PROC SQL Programming Techniques

The Essentials of Finding the Distinct, Unique, and Duplicate Values in Your Data

C H A P T E R 1 Introducing Data Relationships, Techniques for Data Manipulation, and Access Methods

a presentation by Kirk Paul Lafler SAS Consultant, Author, and Trainer

Alternatives to Merging SAS Data Sets But Be Careful

Oracle Database 12c: Introduction to SQL Ed 1.1

SAS Programming Tips, Tricks, and Techniques

Scatter Chart. Segmented Bar Chart. Overlay Chart

Duration Vendor Audience 5 Days Oracle End Users, Developers, Technical Consultants and Support Staff

Paper TU_09. Proc SQL Tips and Techniques - How to get the most out of your queries

MOC 20461C: Querying Microsoft SQL Server. Course Overview

Oracle Database 10g: Introduction to SQL

Oracle Database: SQL and PL/SQL Fundamentals

Oracle SQL. Course Summary. Duration. Objectives

Paper Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA

Exploring DATA Step Merges and PROC SQL Joins

Oracle Database: SQL and PL/SQL Fundamentals

Efficient Techniques and Tips in Handling Large Datasets Shilong Kuang, Kelley Blue Book Inc., Irvine, CA

MWSUG Paper S111

Beyond the Simple SAS Merge. Vanessa L. Cox, MS 1,2, and Kimberly A. Wildes, DrPH, MA, LPC, NCC 3. Cancer Center, Houston, TX.

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access

Handling Missing Values in the SQL Procedure

Introduction to SAS Mike Zdeb ( , #122

Oracle Database: SQL and PL/SQL Fundamentals NEW

KEYWORDS ARRAY statement, DO loop, temporary arrays, MERGE statement, Hash Objects, Big Data, Brute force Techniques, PROC PHREG

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

Utilizing Clinical SAS Report Templates Sunil Kumar Gupta Gupta Programming, Thousand Oaks, CA

Imelda C. Go, South Carolina Department of Education, Columbia, SC

A Day in the Life of Data Part 2

Lab # 5. Retreiving Data from Multiple Tables. Eng. Alaa O Shama

Five Little Known, But Highly Valuable, PROC SQL Programming Techniques. a presentation by Kirk Paul Lafler

SQL Server. 1. What is RDBMS?

Normalizing SAS Datasets Using User Define Formats

Table 1. Demog. id education B002 5 b003 8 b005 5 b007 9 b008 7 b009 8 b010 8

Using SAS With a SQL Server Database. M. Rita Thissen, Yan Chen Tang, Elizabeth Heath RTI International, RTP, NC

Fun with PROC SQL Darryl Putnam, CACI Inc., Stevensville MD

Chapter 1 Overview of the SQL Procedure

Subsetting Observations from Large SAS Data Sets

Programming with SQL

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

Can SAS Enterprise Guide do all of that, with no programming required? Yes, it can.

SQL SUBQUERIES: Usage in Clinical Programming. Pavan Vemuri, PPD, Morrisville, NC

AN ANIMATED GUIDE: SENDING SAS FILE TO EXCEL

An macro: Exploring metadata EG and user credentials in Linux to automate notifications Jason Baucom, Ateb Inc.

SQL Server for developers. murach's TRAINING & REFERENCE. Bryan Syverson. Mike Murach & Associates, Inc. Joel Murach

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

The Online Research Database Service (ORDS)

AN INTRODUCTION TO THE SQL PROCEDURE Chris Yindra, C. Y. Associates

SET The SET statement is often used in two ways copying and appending.

Oracle Database: SQL and PL/SQL Fundamentals NEW

Crystal Reports Form Letters Replace database exports and Word mail merges with Crystal's powerful form letter capabilities.

Outline. SAS-seminar Proc SQL, the pass-through facility. What is SQL? What is a database? What is Proc SQL? What is SQL and what is a database

Intelligent Query and Reporting against DB2. Jens Dahl Mikkelsen SAS Institute A/S

Integrating SAS with JMP to Build an Interactive Application

Managing Tables in Microsoft SQL Server using SAS

One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University

MySQL for Beginners Ed 3

Intro to Longitudinal Data: A Grad Student How-To Paper Elisa L. Priest 1,2, Ashley W. Collinsworth 1,3 1

What's New in SAS Data Management

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7

Joining Data: Data Step Merge or SQL? Harry Droogendyk, Stratia Consulting Inc., Lynden, ON Faisal Dosani, RBC Royal Bank, Toronto, ON

Creating HTML Output with Output Delivery System

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

SAS Data Views: A Virtual View of Data John C. Boling, SAS Institute Inc., Cary, NC

Managing Data Issues Identified During Programming

The Query Builder: The Swiss Army Knife of SAS Enterprise Guide

- Eliminating redundant data - Ensuring data dependencies makes sense. ie:- data is stored logically

Wave Analytics Data Integration

Relational Database Schemes and SAS Software SQL Solutions

Data Warehousing. Paper

Flat Pack Data: Converting and ZIPping SAS Data for Delivery

Leads and Lags: Static and Dynamic Queues in the SAS DATA STEP

Utilizing Clinical SAS Report Templates with ODS Sunil Kumar Gupta, Gupta Programming, Simi Valley, CA

SQL Server Database Coding Standards and Guidelines

Talking to Databases: SQL for Designers

Downloading, Configuring, and Using the Free SAS University Edition Software

Welcome to the Data Analytics Toolkit PowerPoint presentation on EHR architecture and meaningful use.

Essential Project Management Reports in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA

Automated distribution of SAS results Jacques Pagé, Les Services Conseils HARDY, Quebec, Qc

Join Example. Join Example Cart Prod Comprehensive Consulting Solutions, Inc.All rights reserved.

Table Lookups: From IF-THEN to Key-Indexing

Everything you wanted to know about MERGE but were afraid to ask

Using SAS as a Relational Database

IT Service Level Management 2.1 User s Guide SAS

Building a Customized Data Entry System with SAS/IntrNet

Database Design Strategies in CANDAs Sunil Kumar Gupta Gupta Programming, Simi Valley, CA

SAS BI Dashboard 3.1. User s Guide

1. INTRODUCTION TO RDBMS

SAS Views The Best of Both Worlds

Big Data Hive! Laurent d Orazio

ABSTRACT INTRODUCTION %CODE MACRO DEFINITION

Transcription:

Using DATA Step MERGE and PROC SQL JOIN to Combine SAS Datasets Dalia C. Kahane, Westat, Rockville, MD ABSTRACT This paper demonstrates important features of combining datasets in SAS. The facility to combine data from different sources and create a convenient store of information in one location is one of the best tools offered to the SAS programmer. Whether you merge data via the SAS data step or you join data via PROC SQL you need to be aware of important rules you must follow. By reading this paper you will gain a deeper understanding of how the DATA Step MERGE works and will be able to compare it to parallel PROC SQL JOIN examples. You may then select what is best for your own programming style and data situation. The paper includes descriptions and examples that cover the following: how to combine data from multiple sources where records share a relationship of one-to-one, one-to-many or many-to-many the importance of the BY statement in the DATA Step and the WHERE clause in PROC SQL the dangers inherent in using (or not using) the BY and WHERE the importance of knowing your data and the fields that are common to all datasets or unique in each and the syntax for properly performing different types of joins in SQL (inner vs. outer join, left vs. right join, etc.) INTRODUCTION The data step language with the MERGE and BY statements provides a powerful method for doing one-to-one combining of data. It can also be used as a powerful look up tool where blocks of observations require combining or looking up information in a corresponding single observation in another data set. This is an important part of SAS programming. The SQL procedure offers another tool for combining data sets through a JOIN. In particular it offers the inner join, left join, right join, full join and natural join. This paper will look at the two different methods of combining data and compare them. One-to-one, one-to-many and many-to-many relationships will be explored and the importance of the MERGE BY and the JOIN WHERE will be described. But first let s look at some examples of why you might want to combine data: Educational data about students appear in multiple files, one per class you want to combine the data to show a student s performance across all classes Your client wants to see educational performance data where you need to combine teachers data with that of their students You have research data about physicians and you want to compare individual against national or regional averages combining individual data with summary data You have two datasets: one which contains data from parents and another which contains data from children you want to combine them parents may have multiple children and vice versa. All the examples used in this paper limit the merge/join to two source datasets, so that the rules can be more easily demonstarated for the beginner programmer. Also, in order to keep things simple, the examples are about toys and children with very few observations in each dataset/table. DATASETS USED TO ILLUSTRATE COMBINING OF DATA The datasets that will be used as examples for the purpose of discussing the Merge and Join actions include the following: Toy dataset includes the code, description and company code for various toys Code Description CompanyCode 1202 Princess 1000 0101 Baby Doll 1000 1316 Animal Set 1000 3220 Model Train 1000 3201 Electric Truck 1000 4300 Animal Cards 2000 4400 Teddy Bear 2000 1

ToyGenderAge provides associated gender and recommended-age information for each toy Code Gender AgeRangeLow AgeRangeHigh 1202 F 6 9 0101 F 4 9 1316 B 3 6 3220 M 6 9 3201 M 6 9 5500 M 2 6 Company provides the company code and name for toy manufacturers CompanyCode CompanyName 1000 Kids Toys 2000 More Toys Factory provides data on all factories associated with each company Company Code FactoryCode FactoryState 1000 1111 MD 1000 1112 NY 1000 1113 VT 2000 2221 AZ 2000 2222 ME 2000 2223 CA 2

DATA STEP MERGE SAS Merge allows the programmer to combine data from multiple datasets. Each observation from dataset one is combined with a corresponding observation in dataset two (and dataset three, etc.) 1 Which observations and which data fields from the source datasets will be included in the resulting dataset is determined by the detailed instructions provided by the programmer. DEFAULT MERGE: ONE-TO-ONE NO MATCHING The default action taken by SAS when the code requests a merge between two datasets is to simply combine the observations one by one in the order that they appear in the datasets. data Merged_ToyGA_default merge Toy ToyGenderAge run INFO: The variable Code on data set WORK.TOY will be overwritten by data set WORK.TOYGENDERAGE. NOTE: There were 7 observations read from the data set WORK.TOY. NOTE: There were 6 observations read from the data set WORK.TOYGENDERAGE. NOTE: The data set WORK.MERGED_TOYGA_DEFAULT has 7 observations and 6 variables. Code Description CompanyCode Gender RangeAgeLow RangeAgeHigh 1202 Princess 1000 F 6 9 0101 Baby Doll 1000 F 4 9 1316 Animal Set 1000 B 3 6 3220 Model Train 1000 M 6 9 3201 Electric Truck 1000 M 6 9 5500 Animal Cards 2000 M 2 6 4400 Teddy Bear 2000.. By default the SAS Merge does the following: combines in order observation #1 from Toy with Observation #1 from ToyGenderAge, then Observation #2 with Observation #2, etc. does not try to match on common variable(s) simply matches the observations in the random order that they appear in the datasets keeps all observations from both datasets the system option MergeNoBy is set to NOWARN neither a warning nor an error message is given to alert user that a merge is being performed without matching observations through a BY statement in a one-to-one merge, common observations that are not part of BY statement, act as follows: the value coming in from the right dataset override those coming in from the left dataset see SAS NOTE in Log results above This is usually not what the programmer would want, since it would make much more sense to combine all data related to toy Princess together all data related to Electric Truck together etc. ONE-TO-ONE MATCH-MERGE KEEPING ALL OBSERVATIONS A Match-Merge combines observations from multiple datasets into a single onservation in the result dataset based on the values of one or more common variables. It is more effective for the programmer to take control of the SAS Merge process and use matching variable(s). In our example, the matching variable is the Toy Code. It is highly recommended that you do the following: set the system option: MergeNoBy = ERROR this ensures that if a MERGE statement is used without a corresponding BY statement the log will present an error message to that effect use a BY statement in the data step to force SAS to put values into a single observation combining data from observations in the source datasets where the BY Variable has the same value make sure to sort all source datasets by the matching variable(s) listed in the BY statement if the datasets are not sorted properly you will get error messages in the log 3

Note that by default a match-merge keeps all observations from all datasets but each final observation gets its data values only from data variables that are contributed by those datsets with matching Toy Code values where there is no contributing matching observation the data values are set to missing. proc sort data=toy by Code run proc sort data=toygenderage by Code run data Merged_ToyGA_ByCode merge Toy (keep=code Description) ToyGenderAge by Code run NOTE: There were 7 observations read from the data set WORK.TOY. NOTE: There were 6 observations read from the data set WORK.TOYGENDERAGE. NOTE: The data set WORK.MERGED_TOYGA_BYCODE has 8 observations and 5 variables. Code Description Gender AgeRangeLow AgeRangeHigh 0101 Baby Doll F 4 9 1202 Princess F 6 9 1316 Animal Set B 3 6 3201 Electric Truck M 6 9 3220 Model Train M 6 9 4300 Animal Cards 4400 Teddy Bear 5500 M 2 6 Note that: all observations from both datasets are kept where identical code value was found on both datasets, all variables get filled with data coming in from the corresponding source datasets therefore, for codes 4300 and 4400 the combined observations contain Description values from the Toy dataset but no Gender or Age Range values coming in from the ToyGenderAge dataset for Code 5500, the combined observation contains data values from the ToyGenderAge dataset but no Description from the Toy dataset ONE-TO-ONE MATCH-MERGE KEEPING SOME OBSERVATIONS In many cases we may want to further control which observations (of those that match) will actually be included in the final dataset. This is done via a subsetting if statement combined with an in= data option. There are several different choices such as: keep only those observations where a match is found on the BY variable in all source datasets keep all those that appear in the 1 st dataset whether or not a match is found in the 2 nd dataset keep all those that appear in the 2 nd dataset whether or not a match is found in the 1 st dataset data Merged_ToyGA_KeepOnlyMatched merge Toy (in=a keep=code Description) ToyGenderAge (in=b) by Code if a and b run NOTE: There were 7 observations read from the data set WORK.TOY. NOTE: There were 6 observations read from the data set WORK.TOYGENDERAGE. NOTE: The data set WORK.MERGED_TOYGA_KEEPONLYMATCHED has 5 observations and 5 variables. 4

Code Description Gender AgeRangeLow AgeRangeHigh 0101 Baby Doll F 4 9 1202 Princess F 6 9 1316 Animal Set B 3 6 3201 Electric Truck M 6 9 3220 Model Train M 6 9 There are sevaral things to note here: the in= option which uniquely identifies each source dataset the by Code statement which ensure a one-to-one matching the if a and b statement, which instructs SAS to keep only those observations with matching code values in both datasets those observations with code values that appears only in one ot the other dataset, get excluded from the final combined dataset a statement of if a will ensure that all observations and only observations from dataset Toy will be kept in the final dataset the data values coming from the ToyGenderAge will be set to missing when no matching observation with that Code value is found there result in log NOTE: The data set WORK.MERGED_TOYGA_KEEPONLYLEFT has 7 observations and 5 variables. a statement of if b will ensure that all observations and only observations from dataset ToyGenderAge will be kept in the final dataset the data values coming from Toy will be set to missing when no matching observation with that Code value is found in the Toy dataset result in log NOTE: The data set WORK.MERGED_TOYGA_KEEPONLYRIGHT has 6 observations and 5 variables. ONE-TO-MANY MERGE Using our example datasets, an example of a one-to-many merge is to combine the Toys with their Company name. There are many toys per company. proc sort data=toy by CompanyCode run proc sort data=company by CompanyCode run data Merged_ToyCompany merge Toy Company by CompanyCode run NOTE: There were 7 observations read from the data set WORK.TOY. NOTE: There were 2 observations read from the data set WORK.COMPANY. NOTE: The data set WORK.MERGED_TOYCOMPANY has 7 observations and 4 variables. Code Description Company Code Company Name 0101 Baby Doll 1000 Kids Toys 1202 Princess 1000 Kids Toys 1316 Animal Set 1000 Kids Toys 3201 Electric Truck 1000 Kids Toys 3220 Model Train 1000 Kids Toys 4300 Animal Cards 2000 More Toys 4400 Teddy Bear 2000 More Toys Important issues to note for the one-to-many merge are: In DATA Step MERGE the one-to-many and many-to-one merges are the same! The order of the dataset names within the MERGE statement has no significance the actual merge action still combines one observation from a dataset to one observation from another 5

The data from each observation coming fron the one side is retained through all merges with the many side (until a new observation comes in from the one side) Common variables that are not included in the BY statement should be renamed since bringing them in may lead to errors. In a one-to-one merge a common variable s value coming from a later dataset simply overwrites the value from an earlier dataset (in the order that the datasets appear in the MERGE statement). This is not always true in the case of a one-to-many merge. It is good practice to always rename common variables as they come in from the source files and calculate a final value in the result dataset. Use the RENAME= option per dataset in the MERGE statement. MANY-TO-MANY MERGE The many-to-many merge refers to the instance where at least two source datasets contain multiple repeats of values in the BY variables used for match-merging. An example of a many-to-many merge using our datasets is to present all possible factories from where all toys of a given company may be shipped. Another way to describe this: show any factory from which any toy may be shipped. proc sort data=toy by CompanyCode Code run proc sort data=factory by CompanyCode FactoryCode run /* create cross walk dataset to tell us how many factories per company and assigns them numbers */ data Factory(drop=cnt) Company_Xwalk(keep=CompanyCode FactoryNumber rename=(factorynumber=maxfactory)) retain cnt 0 set Factory by CompanyCode if first.companycode then cnt=0 cnt+1 FactoryNumber=cnt output factory if last.companycode then output Company_Xwalk run /* prepare the toys dataset by merging with the company Xwalk */ data ToyWithFN (drop=i) merge Toy(in=a) Company_Xwalk(in=b) by CompanyCode if a if not b then put "No Company record found for Toy: " _all_ /* per toy - output multiple records - one per factory */ do i =1 to MaxFactory FactoryNumber=i output end run /* we now finaly merge Toys with Factories */ proc sort data=toywithfn by CompanyCode FactoryNumber Code run proc sort data=factory by CompanyCode FactoryNumber run data Merged_ToyFactory (drop=maxfactory FactoryNumber) merge ToyWithFN(in=a) Factory(in=b) by CompanyCode FactoryNumber run 6

NOTE: The data set WORK.COMPANY_XWALK has 2 observations and 2 variables. NOTE: The data set WORK.TOYWITHFN has 21 observations and 5 variables. NOTE: There were 21 observations read from the data set WORK.TOYWITHFN. NOTE: There were 6 observations read from the data set WORK.FACTORY. NOTE: The data set WORK.MERGED_TOYFACTORY has 21 observations and 5 variables. Code Description CompanyCode FactoryCode FactoryState 0101 Baby Doll 1000 1111 MD 1202 Princess 1000 1111 MD 1316 Animal Set 1000 1111 MD 3201 Electric Truck 1000 1111 MD 3220 Model Train 1000 1111 MD 0101 Baby Doll 1000 1112 NY 1202 Princess 1000 1112 NY 1316 Animal Set 1000 1112 NY 3201 Electric Truck 1000 1112 NY 3220 Model Train 1000 1112 NY 0101 Baby Doll 1000 1113 VT 1202 Princess 1000 1113 VT 1316 Animal Set 1000 1113 VT 3201 Electric Truck 1000 1113 VT 3220 Model Train 1000 1113 VT 4300 Animal Cards 2000 2221 AZ 4400 Teddy Bear 2000 2221 AZ 4300 Animal Cards 2000 2222 ME 4400 Teddy Bear 2000 2222 ME 4300 Animal Cards 2000 2223 CA 4400 Teddy Bear 2000 2223 CA Important notes for the many-to-many merge: As shown in the SAS code, the goal here is to combine many toys with many factories. The MERGE statement however still ends up combining observations from the 1 st dataset with observations of the other datasets, one by one Before performing the merge, several steps are needed to prepare the source datasets: we use PROC SORT for each source dataset and we create a crosswalk dataset etc. Much less work is needed to accomplish this task of combining many-to-many in the PROC SQL JOIN (described later in this paper) 7

PROC SQL JOIN PROC SQL (which implements the Standard Query Language) allows the user to combine tables through join-queries. Here we use different terminology: tables instead of datasets, join instead of merge, but the concept of combining data from multiple sources into a single resulting table is still the same. As described in the SAS Guide to the SQL Procedure 2, the PROC SQL FROM clause is used in a query-expression to specify the table(s), view(s), or queryexpressions which are referred to here as the source tables, and which can be combined to produce the joined result. In addition to the various types of joins (inner and outer) that are described in this section, the SQL procedure also offers the feature of equi-join. The equi-join refers to the constraint of a matching condition which may be added to any type of join. The true significance of an equi-join is speed. There may be different conditions which help speed and shape the result of a join. Examples include: equality between column values coming from the tables being joined comparison between calculated values etc. The WHERE clause or ON clause contains the conditions under which some rows are kept or eliminated in the result table. WHERE is used to select rows from inner joins. ON is used to select rows from inner or outer joins. DEFAULT JOIN When the user specifies a simple join-query (i.e. defines a FROM clause with mutiple source tables but no WHERE or ON clause) the result is a Cartesian Product. When two tables are joined, each row of table A is matched with all the rows of table B thereby creating a result table that is equal to the product of the number of rows in each of the source tables. PROC SQL 3 create table Joined_ToyGA_default as select toy.*, tga.* from Toy as toy, ToyGenderAge as tga quit NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. WARNING: Variable Code already exists on file WORK.JOINED_TOYGA_DEFAULT. NOTE: Table WORK.JOINED_TOYGA_DEFAULT created, with 42 rows and 6 columns. Only the combined data for the Toy rows with the 1 st row of the ToyGenderAge are shown here 7 rows (for 6 Toy row combined with 7 ToyGenderAge rows there are a total of 42 rows) Code Description Company Code Gender AgeRange Low Age Range High 1202 Princess 1000 F 6 9 0101 Baby Doll 1000 F 6 9 1316 Animal Set 1000 F 6 9 3220 Model Train 1000 F 6 9 3201 Electric Truck 1000 F 6 9 4300 Animal Cards 2000 F 6 9 4400 Teddy Bear 2000 F 6 9 Please note that : SAS indicates that a Cartesian product was involved and implicitly recommends that, when appropriate, indexes should be created and maintained The select statement includes toy.* and tga.* instead an exact list of data fields should have been specified for each of these tables in order to avoid the ugly WARNING that appeared on the log Unlike the data step Merge default, which creates a one-to-one row-to-row result, here we get a row-combinedwith-all-rows product The Cartesian product is also produced if the SQL procedure cross join is requested 8

INNER JOIN An inner join returns a result table for all the rows in a table that have one or more matching rows in the other table(s), as specified by the sql-expression. A two-table inner join may be viewed as the intersection between table A and table B as shown in the following Venn diagram. 5 create table Joined_ToyGA_InnerJoin as select toy.code, toy.description, tga.gender from Toy as toy, ToyGenderAge as tga where toy.code=tga.code NOTE: Table WORK.JOINED_TOYGA_INNERJOIN created, with 5 rows and 3 columns. Code Description Gender 1202 Princess F 0101 Baby Doll F 1316 Animal Set B 3220 Model Train M 3201 Electric Truck M Here it is important to note that: the result here is similar to the Match-Merge Keeping Some Observations example shown above (i.e. it is similar to a MERGE with BY and if a and b ) an inner-join may involve up to 32 tables 4 LEFT JOIN A LEFT JOIN is one type of OUTER JOIN where the result table includes all the observations from the left table, whether or not a match is found for them on any of the tables specified to the right. A LEFT JOIN between two tables may be represented graphically as shown in the following Venn diagram. create table Joined_ToyGA_LeftJoin as select toy.code, toy.description, tga.gender from Toy as toy LEFT JOIN ToyGenderAge as tga on toy.code=tga.code NOTE: Table WORK.JOINED_TOYGA_LEFTJOIN created, with 7 rows and 3 columns. 9

Code Description Gender 0101 Baby Doll F 1202 Princess F 1316 Animal Set B 3201 Electric Truck M 3220 Model Train M 4300 Animal Cards 4400 Teddy Bear Note that the result of a LEFT JOIN, when using a unique key and an equality condition, is similar to the Match-Merge with the BY statement and if a demonstrated above. RIGHT JOIN The RIGHT JOIN is identical to the LEFT JOIN except the result table includes all the observations from the right table, whether or not a match is found for them on any of the tables specified to the left. Therefore, the result is similar to the Match-Merge with a BY statement and if b. It is important to note that in SQL the LEFT and RIGHT have specific meaning but in SAS MERGE, they do not. In the latter, there is no importance to the order in which the tables appear. What s important in the subsetting criteria is the IN= value which identifies which table is responsible for the inclusion criteria. create table Joined_ToyGA_RightJoin as select tga.code, toy.description, tga.gender from Toy as toy RIGHT JOIN ToyGenderAge as tga on toy.code=tga.code NOTE: Table WORK.JOINED_TOYGA_RIGHTJOIN created, with 6 rows and 3 columns. Code Description Gender 0101 Baby Doll F 1202 Princess F 1316 Animal Set B 3201 Electric Truck M 3220 Model Train M 5500 M It is important to note that: for this Join, the tga.code is purposely preserved (instead of the toy.code) because otherwise, those observations that contribute data only from the right table will only contribute values of Gender and nothing else i.e. Code value will stay missing too a better option is to use the SAS COALESCE function (as shown in the next example) the COALESCE function overlays the two Code columns (returns the first value that is a SAS nonmissing value) 10

FULL JOIN When a join is specified as a FULL JOIN, the result table includes all the observations from the Cartesian product of the two tables for which the sql-expression is true, plus rows from each table that do not match any row in the other table. The visual representation of the full outer join is shown in the following Venn diagram. create table Joined_ToyGA_FullJoin as select coalesce (toy.code, tga.code) as Code, toy.description, tga.gender from Toy as toy FULL JOIN ToyGenderAge as tga on toy.code=tga.code NOTE: Table WORK.JOINED_TOYGA_FULLJOIN created, with 8 rows and 3 columns. Code Description Gender 0101 Baby Doll F 1202 Princess F 1316 Animal Set B 3201 Electric Truck M 3220 Model Train M 4300 Animal Cards 4400 Teddy Bear 5500 M Note that: a full join which uses a unique key and an equality condition is similar to the default match-merge which keeps all observations from all datasets a full join is limited to two source tables ONE-TO-MANY JOIN The following code performs an inner-join where the many toys per company are combined with the company name data. Only records that match on Company Code are kept. The one-to-many join may be performed using outer joins as well. What is significant here is that the value of Company Name from the compamy table is added to each matching toy record. create table Joined_ToyCompany as (select toy.*, c.* from Toy as toy, Company as c where toy.companycode=c.companycode ) NOTE: Table WORK.JOINED_TOYCOMPANY created, with 7 rows and 4 columns. 11

Code Description CompanyCode CompanyName 1202 Princess 1000 Kids Toys 0101 Baby Doll 1000 Kids Toys 1316 Animal Set 1000 Kids Toys 3220 Model Train 1000 Kids Toys 3201 Electric Truck 1000 Kids Toys 4300 Animal Cards 2000 More Toys 4400 Teddy Bear 2000 More Toys MANY-TO-MANY JOIN The many-to-many join functions in the same manner as a default join with equality condition. Since the 1 st step in any SQL JOIN is building internally the Cartesian product of the tables, an automatic many-to-many join is performed which is then simply restricted by the WHERE clause. create table Joined_ToyFactory as select toy.*, f.factorycode, f.factorystate from Toy as toy, Factory as f where toy.companycode=f.companycode NOTE: Table WORK.JOINED_TOYFACTORY created, with 21 rows and 5 columns. Code Description CompanyCode FactoryCode FactoryState 1202 Princess 1000 1111 MD 1202 Princess 1000 1112 NY 1202 Princess 1000 1113 VT 0101 Baby Doll 1000 1111 MD 0101 Baby Doll 1000 1112 NY 0101 Baby Doll 1000 1113 VT 1316 Animal Set 1000 1111 MD 1316 Animal Set 1000 1112 NY 1316 Animal Set 1000 1113 VT 3220 Model Train 1000 1111 MD 3220 Model Train 1000 1112 NY 3220 Model Train 1000 1113 VT 3201 Electric Truck 1000 1111 MD 3201 Electric Truck 1000 1112 NY 3201 Electric Truck 1000 1113 VT 4300 Animal Cards 2000 2221 AZ 4300 Animal Cards 2000 2222 ME 4300 Animal Cards 2000 2223 CA 4400 Teddy Bear 2000 2221 AZ 4400 Teddy Bear 2000 2222 ME 4400 Teddy Bear 2000 2223 CA 12

COMPARING MERGE TO JOIN The following tables provide a quick reference for comparing the functionality of the MERGE and JOIN. 6 Characteristics which dominate how one should think of these two methods for combining data: MERGE One-to-one oriented (consequently limited but offers very tight control) Observations are read once (SAS data retained) BY variables must have same name, type and length in all source tables The order of the data sets should not matter therefore no variables (other than the BY variables) should be in common variables should be renamed (if needed) IN= variables are not reset by the system unless new data is read Has a missing value concept (supports special missing values so it is possible for each special missing value to represent a different meaning for numeric variables) JOIN Cartesian product oriented SQL expects you to use WHERE or ON conditions to control the matching process (different column names and manipulations are possible) Has a single null value (foreign to SAS and cannot be used in arithmetic expressions) Similarities: MERGE Match-merge of two datsets (merge with "BY Common variable(s)" but without a subsetting "IF" Merge of two datasets with "BY common variables(s) and "If a and b" (where a indicates left dataset and b indicates right dataset) Merge with "BY common variables(s) and "If a" Merge with "BY common variables(s) and "If b" JOIN Full-outer-join using the ON clause with equality as the matching condition Inner-join using a WHERE clause with equality as the matching condition Left-join using the ON clause with equality as the matching condition Right-join using the ON clause with equality as the matching condition Differences: MERGE Default is one-to-one "outer-join" using the given order of observations in the source datasets Match-merge results by default in a special case of "full-outer-join" "Outer-joins" may be done on multiple source datasets Need to sort or index the source datasets before the data step match-merge Requires variable or variables that are identical on all datasets JOIN Default is Cartesian Product joining all rows from one table with all the rows in the other table Match-join results by default in an inner-join Outer-joins can be done only on two source tables No need to sort or index the source tables before join May join on columns which are named differently 7 13

Differences (continued): MERGE In a one-to-one match-merge, common variables that are not included in the match condition (i.e. are not part of the BY statement): the value from the latter dataset sometimes overwrites the value coming from the left-more dataset Impossible to do a many-to-many merge due to one-to-one behavior of MERGE Fewer advantages when working with tables stored in database servers since databases are designed for extensive matching tools Usually more efficient when combining small datasets but too much overhead on large unsorted datasets! JOIN Common columns are not overwritten unless this is specifically managed by the code via the use of the COALESCE function Relatively easy to do a many-to-many join Most advantageous when working with tables stored in Database Servers because databases are designed for SQL processing More efficient than Merge when dealing with multiple large datasets (or when combining an indexed large table with a small table) If you are trying to choose between using the Merge and the Join, take into account the items listed in comparing the two methods but also select the method that is easier for you to code and to maintain! If you are planning to run the code repeatedly in a production environment, it s a good idea to test both methods for the specific conditions of the task, and make the decision after evaluating the results and the performance indicators. CONCLUSION The DATA Set Merge and the PROC SQL JOIN are viewed as two alternatives for use in combining multiple data sources. This paper provides basic information about both both for beginning SAS programmers. It is intended to be used as a starting point and a reference. Programmers need to do a lot of trial-and-error experimentation with both of these methods of combining data. Most importantly, programmers need to know their data well before attempting to combine observations together. NOTES 1 Refer to the SAS Language: Reference, chapter 9, description of the Merge statement 2 Refer to the SAS Guide to the SQL Procedure, chapter 5, joined-table section 3 In the 1 st example of the SQL Join code segment the PROC SQL and Quit statements are included they are implicit in all later example code segments 4 Refer to the SAS Help for SAS 9.3, section the SQL Procedure Joined-Table 5 all Venn diagram images were taken from www.codinghorror.com by Jeff Atwood 6 Other excellent comparisons of the MERGE and JOIN (looking at other aspects of code and performance) are provided in papers by Kirk Lafler and Malachy J. Foley Also, for a comprehensive description of the DATA Step MERGe please review paper by Howard Schreier. 7 Refer to short description by Stephen Philp REFERENCES Foley, Malachy J. (2005), MERGING vs. JOINING: Comparing the DATA Step with SQL,Proceedings of the 30 th Annual SAS Users Group International Conference Lafler, Kirk Paul (2006), A Hands-on Tour Inside the World of PROC SQL, Proceedings of the 31 st Annual SAS Users Group International Conference, Software Intelligence Corporation, Spring Valley, CA, USA. Philp, Stephen (2008), Data Steps: SAS SQL Join, datasteps.blogspot.com/2008/04/sas-sql-join.html Schreier, Howard (2005), Let Your Data Power Your DATA Step: Making Effective Use of the SET, MERGE, UPDATE, and MODIFY Statements, Proceedings of the 30 th Annual SAS Users Group International Conference SAS Institute Inc. (1990), SAS Language: Reference, Version 6, 1 st Edition, Cary, NC: SAS Institute Inc. SAS Institute Inc. (1989), SAS Guide to the SQL Procedure: Usage and Reference, Version 6, 1 st Edition, Cary, NC: SAS Institute Inc. 14

DISCLAIMER The contents of this paper are the work of the author and do not necessarily represent the opinions, recommendations, or practices of Westat. ACKNOWLEDGEMENTS I would like to thank Ian Whitlock who reviewed my paper, provided comments, and in his usual fashion made this too a true learning experience for me. CONTACT INFORMATION If you have any questions or comments about the paper, you can reach the author at: Dalia Kahane, Ph.D. Westat 1650 Research Blvd. Rockville, MD 20850 Kahaned1@Westat.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. 15