Demystifying PROC SQL Join Algorithms Kirk Paul Lafler, Software Intelligence Corporation



Similar documents
Simple Rules to Remember When Working with Indexes Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

DATA Step versus PROC SQL Programming Techniques

Exploring DATA Step Merges and PROC SQL Joins

Downloading, Configuring, and Using the Free SAS University Edition Software

Creating HTML Output with Output Delivery System

Using DATA Step MERGE and PROC SQL JOIN to Combine SAS Datasets Dalia C. Kahane, Westat, Rockville, MD

Powerful and Sometimes Hard-to-find PROC SQL Features

Five Little Known, But Highly Valuable, PROC SQL Programming Techniques. a presentation by Kirk Paul Lafler

SAS Programming Tips, Tricks, and Techniques

Chapter 9 Joining Data from Multiple Tables. Oracle 10g: SQL

Using the SQL Procedure

Proficiency in JMP Visualization

Effective Use of SQL in SAS Programming

A Review of "Free" Massive Open Online Content (MOOC) for SAS Learners

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

KEYWORDS ARRAY statement, DO loop, temporary arrays, MERGE statement, Hash Objects, Big Data, Brute force Techniques, PROC PHREG

Efficient Techniques and Tips in Handling Large Datasets Shilong Kuang, Kelley Blue Book Inc., Irvine, CA

Dynamic Dashboards Using Base-SAS Software

Efficiency Techniques for Beginning PROC SQL Users Kirk Paul Lafler, Software Intelligence Corporation

D61830GC30. MySQL for Developers. Summary. Introduction. Prerequisites. At Course completion After completing this course, students will be able to:

Lab # 5. Retreiving Data from Multiple Tables. Eng. Alaa O Shama

SQL SUBQUERIES: Usage in Clinical Programming. Pavan Vemuri, PPD, Morrisville, NC

Building a Better Dashboard Using Base SAS Software

AN INTRODUCTION TO THE SQL PROCEDURE Chris Yindra, C. Y. Associates

Top Ten SAS Performance Tuning Techniques

Fun with PROC SQL Darryl Putnam, CACI Inc., Stevensville MD

Duration Vendor Audience 5 Days Oracle End Users, Developers, Technical Consultants and Support Staff

Table Lookups: From IF-THEN to Key-Indexing

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

Displaying Data from Multiple Tables

Oracle Database 12c: Introduction to SQL Ed 1.1

Talking to Databases: SQL for Designers

Oracle Database: SQL and PL/SQL Fundamentals

Paper TU_09. Proc SQL Tips and Techniques - How to get the most out of your queries

Oracle Database: SQL and PL/SQL Fundamentals

Oracle Database 11g: SQL Tuning Workshop

Alternatives to Merging SAS Data Sets But Be Careful

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

Oracle 10g PL/SQL Training

Oracle Database 11g: SQL Tuning Workshop Release 2

Oracle SQL. Course Summary. Duration. Objectives

Oracle Database: SQL and PL/SQL Fundamentals NEW

Normalizing SAS Datasets Using User Define Formats

SQL Server Query Tuning

Displaying Data from Multiple Tables. Copyright 2004, Oracle. All rights reserved.

Unit 3. Retrieving Data from Multiple Tables

The SET Statement and Beyond: Uses and Abuses of the SET Statement. S. David Riba, JADE Tech, Inc., Clearwater, FL

MySQL for Beginners Ed 3

Displaying Data from Multiple Tables. Copyright 2004, Oracle. All rights reserved.

Displaying Data from Multiple Tables. Copyright 2006, Oracle. All rights reserved.

Oracle Database 10g: Introduction to SQL

2015 Workshops for Professors

UNIX Operating Environment

Guide to SQL Programming: SQL:1999 and Oracle Rdb V7.1

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access

SQL Query Evaluation. Winter Lecture 23

Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D.

Utilizing Clinical SAS Report Templates Sunil Kumar Gupta Gupta Programming, Thousand Oaks, CA

Using SAS With a SQL Server Database. M. Rita Thissen, Yan Chen Tang, Elizabeth Heath RTI International, RTP, NC

Statistics and Analysis. Quality Control: How to Analyze and Verify Financial Data

Using SQL Queries to Insert, Update, Delete, and View Data: Joining Multiple Tables. Lesson C Objectives. Joining Multiple Tables

Crystal Reports Form Letters Replace database exports and Word mail merges with Crystal's powerful form letter capabilities.

Performing Queries Using PROC SQL (1)

Parallel Data Preparation with the DS2 Programming Language

In this session, we use the table ZZTELE with approx. 115,000 records for the examples. The primary key is defined on the columns NAME,VORNAME,STR

SAS. Cloud. Account Administrator s Guide. SAS Documentation

Programming with SQL

Oracle Database: Introduction to SQL

What's New in SAS Data Management

Subsetting Observations from Large SAS Data Sets

Let SAS Modify Your Excel File Nelson Lee, Genentech, South San Francisco, CA

Business Systems Analysis Certificate Program. Millennium Communications & Training Inc. 2013, All rights reserved

Structured Query Language (SQL)

OBJECT_EXIST: A Macro to Check if a Specified Object Exists Jim Johnson, Independent Consultant, North Wales, PA

Alternative Methods for Sorting Large Files without leaving a Big Disk Space Footprint

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Join Example. Join Example Cart Prod Comprehensive Consulting Solutions, Inc.All rights reserved.

Switching from PC SAS to SAS Enterprise Guide Zhengxin (Cindy) Yang, inventiv Health Clinical, Princeton, NJ

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

Relational Database: Additional Operations on Relations; SQL

PharmaSUG Paper TT03

Chapter 1 Overview of the SQL Procedure

Understanding SQL Server Execution Plans. Klaus Aschenbrenner Independent SQL Server Consultant SQLpassion.at

Querying Microsoft SQL Server 2012

Database CIS 340. lab#6. I.Arwa Najdi

Database Programming with PL/SQL: Learning Objectives

Oracle Database: Introduction to SQL

Oracle Database: Introduction to SQL

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Guide to Operating SAS IT Resource Management 3.5 without a Middle Tier

Transcription:

Paper TU01 Demystifying PROC SQL Join Algorithms Kirk Paul Lafler, Software Intelligence Corporation ABSTRACT When it comes to performing PROC SQL joins, users supply the names of the tables for joining along with the join conditions, and the PROC SQL optimizer determines which of the available join algorithms to use for performing the join operation. Attendees explore nested loop joins, merge joins, index joins, and hash join algorithms along with selective options to control processing. This presentation illustrates the various join algorithms; join processes including Cartesian Product joins, equijoins, many table joins, and outer joins. INTRODUCTION The SQL procedure is a simple and flexible tool for joining tables of data together. Certainly many of the join techniques can be accomplished using other methods, but the simplicity and flexibility found in the SQL procedure makes it especially interesting, if not indispensable, as a tool for the information practitioner. This paper presents the importance of joins, the join algorithms, how joins are performed without a WHERE clause, with a WHERE clause, using table aliases, and with three tables of data. WHY JOIN ANYWAY? As relational database systems continue to grow in popularity, the need to access normalized data stored in separate tables becomes increasingly important. By relating matching values in key columns in one table with key columns in two or more tables, information can be retrieved as if the data were stored in one huge file. Consequently, the process of joining data from two or more tables can provide new and exciting insights into data relationships. SQL JOINS A join of two or more tables provides a means of gathering and manipulating data in a single SELECT statement. Joins are specified on a minimum of two tables at a time, where a column from each table is used for the purpose of connecting the two tables using a WHERE clause. Typically, the tables connecting columns have "like" values and the same datatype attributes since the join's success is often dependent on these values. EXAMPLE TABLES A relational database is simply a collection of tables. Each table contains one or more columns and one or more rows of data. The examples presented in this paper apply an example database consisting of three tables:,, and. Each table appears below. CUST_NO NAME CITY STATE 11321 John Smith Miami FL 44555 Alice Jones Baltimore MD 21713 Ryan Adams Atlanta GA CUST_NO MOVIE_NO RATING CATEGORY 44555 1011 PG-13 Adventure 21713 3090 G Comedy 44555 2198 G Comedy 37753 4456 PG Suspense MOVIE_NO LEAD_ACTOR 1011 Mel Gibson 2198 Clint Eastwood 3090 Sylvester Stallone 1

PROC SQL JOIN ALGORITHMS When it comes to performing PROC SQL joins, users supply the names of the tables for joining along with the join conditions, and the PROC SQL optimizer determines which of the available join algorithms to use for performing the join operation. There are three algorithms used in the process of performing a join: Nested Loop Join When an equality condition is not specified, a read of the complete contents of the right table is processed for each row in the left table. Merge Join When the tables specified are already in the desired sort order, resources will not need to be extended to rearranging the tables. Hash Join When an equality relationship exists, the smaller of the tables is able to fit in memory, no sort operations are required, and each table is read only once. JOINING TWO TABLES WITH A WHERE CLAUSE Joining two tables together is a relatively easy process in SQL. To illustrate how a join works, a two-table join is linked in the following diagram. Name City State Movie_no The following SQL code references a join on two tables with CUST_NO specified as the connecting column. SELECT * FROM, WHERE.CUST_NO =.CUST_NO; In this example, tables and are used. Each table has a common column, CUST_NO which is used to connect rows together from each when the value of CUST_NO is equal, as specified in the WHERE clause. A WHERE clause restricts what rows of data will be included in the resulting join. CREATING A CARTESIAN PRODUCT When a WHERE clause is omitted, all possible combinations of rows from each table is produced. This form of join is known as the Cartesian Product. Say for example you join two tables with the first table consisting of 10 rows and the second table with 5 rows. The result of these two tables would consist of 50 rows. Very rarely is there a need to perform a join operation in SQL where a WHERE clause is not specified. The primary importance of being aware of this form of join is to illustrate a base for all joins. Visually, the two tables would be combined without a corresponding WHERE clause as illustrated in the following diagram. Consequently, no connection between common columns exists. Cust_no Name City State Cust_no Movie_no To inspect the results of a Cartesian Product, you could submit the same code as before but without the WHERE clause. SELECT * FROM, ; 2

TABLE ALIASES Table aliases provide a "short-cut" way to reference one or more tables within a join operation. One or more aliases are specified so columns can be selected with a minimal number of keystrokes. To illustrate how table aliases in a join works, a two-table join is linked in the following diagram. M Cust_no A Lead_actor The following SQL code illustrates a join on two tables with MOVIE_NO specified as the connecting column. The table aliases are specified in the SELECT statement as qualified names, the FROM clause, and the WHERE clause. SELECT M.MOVIE_NO, M.RATING, A.LEADING_ACTOR FROM M, A WHERE M.MOVIE_NO = A.MOVIE_NO; JOINING THREE TABLES In an earlier example, you saw where customer information was combined with the movies they rented. You may also want to display the leading actor of each movie along with the other information. To do this, you will need to extract information from three different tables:,, and. A join with three tables follows the same rules as in a two-table join. Each table will need to be listed in the FROM clause with appropriate restrictions specified in the WHERE clause. To illustrate how a three table join works, the following diagram should be visualized. Name City State Lead_actor The following SQL code references a join on three tables with CUST_NO specified as the connecting column for the and tables, and MOVIE_NO as the connecting column for the and tables. SELECT C.CUST_NO, M.MOVIE_NO, M.RATING, M.CATEGORY, A.LEADING_ACTOR FROM C, M, A WHERE C.CUST_NO = M.CUST_NO AND M.MOVIE_NO = A.MOVIE_NO; 3

OUTER JOINS Generally a join is a process of relating rows in one table with rows in another. But occasionally, you may want to include rows from one or both tables that have no related rows. This concept is referred to as row preservation and is a significant feature offered by the outer join construct. There are operational and syntax differences between inner (natural) and outer joins. First, the maximum number of tables that can be specified in an outer join is two (the maximum number of tables that can be specified in an inner join is 32). Like an inner join, an outer join relates rows in both tables. But this is where the similarities end because the result table also includes rows with no related rows from one or both of the tables. This special handling of matched and unmatched rows of data is what differentiates an outer join from an inner join. An outer join can accomplish a variety of tasks that would require a great deal of effort using other methods. This is not to say that a process similar to an outer join can not be programmed it would probably just require more work. Let s take a look at a few tasks that are possible with outer joins: List all customer accounts with rentals during the month, including customer accounts with no purchase activity. Compute the number of rentals placed by each customer, including customers who have not rented. Identify movie renters who rented a movie last month, and those who did not. Another obvious difference between an outer and inner join is the way the syntax is constructed. Outer joins use keywords such as LEFT JOIN, RIGHT JOIN, and FULL JOIN, and has the WHERE clause replaced with an ON clause. These distinctions help identify outer joins from inner joins. Finally, specifying a left or right outer join is a matter of choice. Simply put, the only difference between a left and right join is the order of the tables they use to relate rows of data. As such, you can use the two types of outer joins interchangeably and is one based on convenience. EXPLORING OUTER JOINS Outer joins process data relationships from two tables differently than inner joins. In this section a different type of join, known as an outer join, will be illustrated. The following code example illustrates a left outer join to identify and match movie numbers from the and tables. The resulting output would contain all rows for which the SQL expression, referenced in the ON clause, matches both tables and retaining all rows from the left table () that did not match any row in the right () table. Essentially the rows from the left table are preserved and captured exactly as they are stored in the table itself, regardless if a match exists. SQL Code SELECT movies.movie_no, leading_actor, rating FROM LEFT JOIN ON movies.movie_no = actors.movie_no; The result of a Left Outer join is illustrated by the shaded area (A and AB) in the following diagram. A AB B 4

The next example illustrates the result of using a right outer join to identify and match movie titles from the and tables. The resulting output would contain all rows for which the SQL expression, referenced in the ON clause, matches in both tables (is true) and all rows from the right table () that did not match any row in the left () table. SQL Code SELECT movies.movie_no, actor_leading, rating FROM RIGHT JOIN ON movies.movie_no = actors.movie_no; The result of a Right Outer join is illustrated by the shaded area (AB and B) in the following diagram. A AB B CONCLUSION The SQL procedure provides a powerful way to join two or more tables of data. It's easy to learn and use. More importantly, since the SQL procedure follows the ANSI (American National Standards Institute) guidelines, your knowledge is portable to other platforms and vendor implementations. The simplicity and flexibility of performing joins with the SQL procedure makes it an especially interesting, if not indispensable, tool for the information practitioner. REFERENCES Lafler, Kirk Paul (2008), Exploring the Undocumented PROC SQL _METHOD Option, Proceedings of the SAS Global Forum (SGF) 2008 Conference, Software Intelligence Corporation, Spring Valley, CA, USA. Lafler, Kirk Paul (2007), Undocumented and Hard-to-find PROC SQL Features, Proceedings of the PharmaSUG 2007 Conference, Software Intelligence Corporation, Spring Valley, CA, USA. Lafler, Kirk Paul and Ben Cochran (2007), A Hands-on Tour Inside the World of PROC SQL Features, Proceedings of the SAS Global Forum (SGF) 2007 Conference, Software Intelligence Corporation, Spring Valley, CA, and The Bedford Group, USA. Lafler, Kirk Paul (2006), A Hands-on Tour Inside the World of PROC SQL, Proceedings of the 31 st Annual SAS Users Group International Conference, Software Intelligence Corporation, Spring Valley, CA, USA. Lafler, Kirk Paul (2005), Manipulating Data with PROC SQL, Proceedings of the 30 th Annual SAS Users Group International Conference, Software Intelligence Corporation, Spring Valley, CA, USA. Lafler, Kirk Paul (2004). PROC SQL: Beyond the Basics Using SAS, SAS Institute Inc., Cary, NC, USA. Lafler, Kirk Paul (2003), Undocumented and Hard-to-find PROC SQL Features, Proceedings of the Eleventh Annual Western Users of SAS Software Conference. Lafler, Kirk Paul (1992-2006). PROC SQL for Beginners; Software Intelligence Corporation, Spring Valley, CA, USA. Lafler, Kirk Paul (1998-2006). Intermediate Software Intelligence Corporation, Spring Valley, CA, USA. SAS Guide to the SQL Procedure: Usage and Reference, Version 6, First Edition (1990). SAS Institute, Cary, NC. SAS SQL Procedure User s Guide, Version 8 (2000). SAS Institute Inc., Cary, NC, USA. 5

TRADEMARK CITATIONS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. and other countries. indicates USA registration. ACKNOWLEDGMENTS I would like to thank Nancy Brucken and Xiaohui Wang, PharmaSUG 2009 Tutorials Section Co-Chairs for accepting this paper; as well as Syamala Ponnapalli, Academic Co-Chair and Ellen Brookstein, Operations Co-Chair for a great conference! ABOUT THE AUTHOR Kirk Paul Lafler is consultant and founder of Software Intelligence Corporation and has been programming in SAS since 1979. As a SAS Certified Professional and SAS Institute Alliance Member (1996 2002), Kirk provides IT consulting services and training to SAS users around the world. As the author of four books including PROC SQL: Beyond the Basics Using SAS (SAS Institute. 2004), he has written more than three hundred peer-reviewed papers and articles that have appeared in professional journals and SAS User Group proceedings. Kirk has also been an Invited speaker and trainer at more than three hundred SAS International, regional, local, and special-interest user group conferences and meetings throughout North America. His popular SAS Tips column, Kirk s Korner of Quick and Simple Tips, appears regularly in several SAS User Group newsletters and Web sites, and his fun-filled SASword Puzzles is featured in SAScommunity.org. Comments and suggestions can be sent to: Kirk Paul Lafler Software Intelligence Corporation World Headquarters P.O. Box 1390 Spring Valley, California 91979-1390 E-mail: KirkLafler@cs.com 6