a presentation by Kirk Paul Lafler SAS Consultant, Author, and Trainer E-mail: KirkLafler@cs.com



Similar documents
Five Little Known, But Highly Valuable, PROC SQL Programming Techniques. a presentation by Kirk Paul Lafler

Demystifying PROC SQL Join Algorithms Kirk Paul Lafler, Software Intelligence Corporation

SAS Programming Tips, Tricks, and Techniques

Paper BB MOVIES Table

DATA Step versus PROC SQL Programming Techniques

Exploring DATA Step Merges and PROC SQL Joins

Powerful and Sometimes Hard-to-find PROC SQL Features

Simple Rules to Remember When Working with Indexes Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Conditional Processing Using the Case Expression in PROC SQL Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Alternative Methods for Sorting Large Files without leaving a Big Disk Space Footprint

PharmaSUG Paper TT03

Oracle Database 11g: SQL Tuning Workshop

C H A P T E R 1 Introducing Data Relationships, Techniques for Data Manipulation, and Access Methods

Efficiency Techniques for Beginning PROC SQL Users Kirk Paul Lafler, Software Intelligence Corporation

Chapter 9 Joining Data from Multiple Tables. Oracle 10g: SQL

Stars and Models: How to Build and Maintain Star Schemas Using SAS Data Integration Server in SAS 9 Nancy Rausch, SAS Institute Inc.

SQL Query Evaluation. Winter Lecture 23

Using DATA Step MERGE and PROC SQL JOIN to Combine SAS Datasets Dalia C. Kahane, Westat, Rockville, MD

SQL Server Query Tuning

Diving into SAS Software with the SQL Procedure Kirk Paul Lafler, Software Intelligence Corporatron

Downloading, Configuring, and Using the Free SAS University Edition Software

Introduction to Querying & Reporting with SQL Server

Joining Data: Data Step Merge or SQL? Harry Droogendyk, Stratia Consulting Inc., Lynden, ON Faisal Dosani, RBC Royal Bank, Toronto, ON

Fun with PROC SQL Darryl Putnam, CACI Inc., Stevensville MD

Lab # 5. Retreiving Data from Multiple Tables. Eng. Alaa O Shama

Oracle Database 12c: Introduction to SQL Ed 1.1

Oracle Database 11g: SQL Tuning Workshop Release 2

Query tuning by eliminating throwaway

Chapter 13: Query Processing. Basic Steps in Query Processing

Querying Microsoft SQL Server 2012

A Day in the Life of Data Part 2

Improving Maintenance and Performance of SQL queries

Paper TU_09. Proc SQL Tips and Techniques - How to get the most out of your queries

MOC 20461C: Querying Microsoft SQL Server. Course Overview

Database Programming with PL/SQL: Learning Objectives

Querying Microsoft SQL Server 2012

Course ID#: W 35 Hrs. Course Content

Course 10774A: Querying Microsoft SQL Server 2012

Course 10774A: Querying Microsoft SQL Server 2012 Length: 5 Days Published: May 25, 2012 Language(s): English Audience(s): IT Professionals

Displaying Data from Multiple Tables. Copyright 2006, Oracle. All rights reserved.

Introducing Microsoft SQL Server 2012 Getting Started with SQL Server Management Studio

Querying Microsoft SQL Server

Oracle BI 11g R1: Build Repositories

Guide to SQL Programming: SQL:1999 and Oracle Rdb V7.1

Using the SQL Procedure

About PivotTable reports

Querying Microsoft SQL Server Course M Day(s) 30:00 Hours

Using SAS Views and SQL Views Lynn Palmer, State of California, Richmond, CA

The Query Builder: The Swiss Army Knife of SAS Enterprise Guide

Relational Database: Additional Operations on Relations; SQL

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

Creating HTML Output with Output Delivery System

Oracle Database: SQL and PL/SQL Fundamentals

Duration Vendor Audience 5 Days Oracle End Users, Developers, Technical Consultants and Support Staff

Database Design Strategies in CANDAs Sunil Kumar Gupta Gupta Programming, Simi Valley, CA

Inside the PostgreSQL Query Optimizer

Unit 3. Retrieving Data from Multiple Tables

Course 20461C: Querying Microsoft SQL Server Duration: 35 hours

Querying Microsoft SQL Server 20461C; 5 days

Performance Tuning for the Teradata Database

Advanced Oracle SQL Tuning

Querying Microsoft SQL Server 2012

FHE DEFINITIVE GUIDE. ^phihri^^lv JEFFREY GARBUS. Joe Celko. Alvin Chang. PLAMEN ratchev JONES & BARTLETT LEARN IN G. y ti rvrrtuttnrr i t i r

Relational Databases

Oracle Database: SQL and PL/SQL Fundamentals NEW

Displaying Data from Multiple Tables. Copyright 2004, Oracle. All rights reserved.

DBMS / Business Intelligence, SQL Server

Information Systems SQL. Nikolaj Popov

Useful Business Analytics SQL operators and more Ajaykumar Gupte IBM

PROC SQL. Beyond the Basics Using SAS Second Edition. Kirk Paul Lafler

Oracle Database: SQL and PL/SQL Fundamentals

9.1 SAS. SQL Query Window. User s Guide

Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D.

MOC QUERYING MICROSOFT SQL SERVER

AV-005: Administering and Implementing a Data Warehouse with SQL Server 2014

Oracle Database: SQL and PL/SQL Fundamentals NEW

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

A Closer Look at PROC SQL s FEEDBACK Option Kenneth W. Borowiak, PPD, Inc., Morrisville, NC

Querying Microsoft SQL Server (20461) H8N61S

Oracle Database: Introduction to SQL

UNIVERSE BEST PRACTICES. Rob Rohloff, OEM Sales Consultant

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

What's New in SAS Data Management

Top Ten SAS Performance Tuning Techniques

Basics of Dimensional Modeling

Join Example. Join Example Cart Prod Comprehensive Consulting Solutions, Inc.All rights reserved.

SAP BO Course Details

Effective Use of SQL in SAS Programming

Oracle Database 10g: Introduction to SQL

The Database is Slow

MySQL for Beginners Ed 3

Normalizing SAS Datasets Using User Define Formats

Where? Originating Table Employees Departments

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Oracle SQL. Course Summary. Duration. Objectives

Alternatives to Merging SAS Data Sets But Be Careful

An Oracle White Paper June Creating an Oracle BI Presentation Layer from Imported Oracle OLAP Cubes

Lecture 4: SQL Joins. Morgan C. Wang Department of Statistics University of Central Florida

Displaying Data from Multiple Tables. Copyright 2004, Oracle. All rights reserved.

Oracle USF

Transcription:

a presentation by Kirk Paul Lafler SAS Consultant, Author, and Trainer E-mail: KirkLafler@cs.com 1

Copyright Kirk Paul Lafler, 1992-2010. All rights reserved. SAS is the registered trademark of SAS Institute Inc., Cary, NC, USA. SAS Certified Professional is the trademark of SAS Institute Inc., Cary, NC, USA. All other company and product names mentioned are used for identification purposes only and may be trademarks of their respective owners. Copyright 1992-2010 by Kirk Paul Lafler 2

Objectives What is data federation? Characteristics associated with data federation What is a join? Why join? SAS DATA step merge versus a Join Cartesian product joins Two table joins Table aliases to reference tables Three table joins Left and Right outer joins What happens during a join? Available join algorithms Copyright 1992-2010 by Kirk Paul Lafler 3

Tables Used in Examples Movies Actors Copyright 1992-2010 by Kirk Paul Lafler 4

What is Data Federation? Process of integrating data from many different sources Leaves data in place without using resources to copy data Makes access to data sources as easy as possible Provides a degree of reusability A data federation approach often replaces a data warehouse Copyright 1992-2010 by Kirk Paul Lafler

Data Federation Characteristics It represents an integration approach Ability to aggregate data from many different sources Data sources can be in any location It can be implemented within any lifecycle methodology Provides flexibility It is user-centric Federated data does not copy and store data like a data warehouse It contains metadata (information about the actual data and its location) Is NOT always designed with optimality in mind Copyright 1992-2010 by Kirk Paul Lafler

What is a Join? Process of combining tables side-by-side (horizontally) Consists of a matching process between rows in tables Some or all of the tables contents are brought together Gather and manipulate data from across tables Visually, it would look something like this: Table One Table Two... Copyright 1992-2010 by Kirk Paul Lafler 7

Why Join? Data in a database is often stored in separate tables Joins allow data to be combined as if it were stored in one huge file Provide exciting insights into data relationships Types of joins: Inner joins a maximum of 256 tables can be joined Outer joins a maximum of 2 tables can be joined Copyright 1992-2010 by Kirk Paul Lafler 8

DATA Step Merge versus a Join Merges process data differently than a standard join The merge process overlays the duplicate by-column Joins adhere to ANSI guidelines Joins do not automatically overlay the duplicate matching column Copyright 1992-2010 by Kirk Paul Lafler 9

DATA Step Merge Process Customers Cust_no Name 3 Ryan 5 Anna-liese 10 Ronnie = = X Movies Cust_no Category 3 Adventure 5 Comedy 7 Suspense DATA merged; MERGE customers (IN=c) movies (IN=m); BY cust_no; IF c AND m; RUN; Merged Cust_no Name Category 3 Ryan Adventure 5 Anna-liese Comedy Copyright 1992-2010 by Kirk Paul Lafler 10

Join Process Customers Cust_no Name 3 Ryan 5 Anna-liese 10 Ronnie = = X Movies Cust_no Category 3 Adventure 5 Comedy 7 Suspense PROC SQL; SELECT * FROM customers, movies WHERE customers.cust_no = movies.cust_no; QUIT; Cust_no Name Cust_no Category 3 Ryan 3 Adventure 5 Anna-liese 5 Comedy Copyright 1992-2010 by Kirk Paul Lafler 11

Merge versus Join Results Merge Cust_no Name Category 3 Ryan Adventure 5 Anna-liese Comedy Versus Join Cust_no Name Cust_no Category 3 Ryan 3 Adventure 5 Anna-liese 5 Comedy Features 1. Data must be sorted using by-value. 2. Requires variable names to be same. 3. Duplicate matching column is overlaid. 4. Results are not automatically printed. Features 1. Data does not have to be sorted using by-value. 2. Does not require variable names to be same. 3. Duplicate matching column is not overlaid. 4. Results are automatically printed unless NOPRINT option is specified. Copyright 1992-2010 by Kirk Paul Lafler 12

Cartesian Product Join (Cross Join) Customers Cust_no Name 3 Ryan 5 Anna-liese 10 Ronnie No WHERE or Key Used Movies Cust_no Category 3 Adventure 5 Comedy 7 Suspense Result represents all possible combinations of rows and columns PROC SQL; SELECT * FROM customers, movies; QUIT; Absence of WHERE clause produces a Cartesian product Cust_no Name Cust_no Movie_no Category 3 Ryan 3 1011 Adventure 3 Ryan 5 3090 Comedy 3 Ryan 7 4456 Suspense 5 Anna-liese 3 1011 Adventure 5 Anna-liese 5 3090 Comedy 5 Anna-liese 7 4456 Suspense 10 Ronnie 3 1011 Adventure 10 Ronnie 5 3090 Comedy 10 Ronnie 7 4456 Suspense Copyright 1992-2010 by Kirk Paul Lafler 13

Example Cartesian Product A Cartesian product join, sometimes referred to as a cross join, can be very large because it represents all the possible combinations of rows and columns. PROC SQL; SELECT * FROM MOVIES, QUIT; ACTORS; Copyright 1992-2010 by Kirk Paul Lafler 14

Equi-Join with Two Tables The result of an Equi-join is illustrated by the shaded area (AB) in the Venn diagram. A AB B Copyright 1992-2010 by Kirk Paul Lafler 15

Equi-Join with Two Tables Customers Cust_no Name 3 Ryan 5 Anna-liese 10 Ronnie = Equi Join Movies Cust_no Category 3 Adventure 5 Comedy 7 Suspense PROC SQL; SELECT * FROM customers, movies WHERE customers.cust_no = movies.cust_no; QUIT; Cust_no Name Cust_no Movie_no Category 3 Ryan 3 1011 Adventure 5 Anna-liese 5 3090 Comedy Copyright 1992-2010 by Kirk Paul Lafler 16

Example WHERE-clause The most reliable way to join two tables together, and to avoid creating a Cartesian product, is to use a WHERE clause with common columns or keys. PROC SQL; SELECT MOVIES.TITLE, RATING, ACTOR_LEADING FROM MOVIES, QUIT; ACTORS WHERE MOVIES.TITLE = ACTORS.TITLE; Copyright 1992-2010 by Kirk Paul Lafler 17

Table Aliases Customers Cust_no Name 3 Ryan 5 Anna-liese 10 Ronnie = Equi Join Movies Cust_no Category 3 Adventure 5 Comedy 7 Suspense PROC SQL; SELECT * FROM customers c, movies m WHERE c.cust_no = m.cust_no; QUIT; Aliases Cust_no Name Cust_no Movie_no Category 3 Ryan 3 1011 Adventure 5 Anna-liese 5 3090 Comedy Copyright 1992-2010 by Kirk Paul Lafler 18

Example Table Alias Assigning a table alias is a not only a useful way to reference a table, but can reduce the number of keystrokes typed. PROC SQL; SELECT M.TITLE, RATING, ACTOR_LEADING FROM MOVIES M, QUIT; ACTORS A WHERE M.TITLE = A.TITLE; Copyright 1992-2010 by Kirk Paul Lafler 19

Joining Three Tables Customers Movies Actors Cust_no Name = Equi Join Cust_no Movie_no Category = Equi Join Movie_no Lead_actor PROC SQL; SELECT c.cust_no, c.name, m.movie_no, c.category, a.lead_actor FROM customers c, movies m, actors a WHERE c.cust_no = m.cust_no AND m.movie_no = a.movie_no; QUIT; Cust_no Name Movie_no Category Lead_actor 3 Ryan 1011 Adventure Mel Gibson Copyright 1992-2010 by Kirk Paul Lafler 20

Left Outer Joins The result of a Left Outer join is illustrated by the shaded areas (A and AB) in the Venn diagram. A AB B Copyright 1992-2010 by Kirk Paul Lafler 21

Example Left Outer Join The result of a Left Outer join produces both matched rows from both tables plus any unmatched rows from the left table. PROC SQL; SELECT MOVIES.TITLE, RATING, ACTOR_LEADING FROM MOVIES LEFT JOIN QUIT; ACTORS ON MOVIES.TITLE = ACTORS.TITLE; Copyright 1992-2010 by Kirk Paul Lafler 22

Right Outer Joins The results of a Right Outer join is illustrated by the shaded areas (B and AB) in the Venn diagram. A AB B Copyright 1992-2010 by Kirk Paul Lafler 23

Example Right Outer Join The result of a Right Outer join produces matched rows from both tables while preserving unmatched rows from the right table. PROC SQL; SELECT MOVIES.TITLE, RATING, ACTOR_LEADING FROM MOVIES RIGHT JOIN QUIT; ACTORS ON MOVIES.TITLE = ACTORS.TITLE; Copyright 1992-2010 by Kirk Paul Lafler 24

What Happens during a Join? When joining two tables: An intermediate Cartesian product is built from the two tables Rows are selected that match the WHERE clause, if present When joining more than two tables: SQL query optimizer evaluates the available methods for retrieving the data and attempts to use the most efficient method The join is reconstructed into several two-way joins Removes unwanted rows and columns from the intermediate tables Determines the order of processing to reduce the size of the intermediate Cartesian product Copyright 1992-2010 by Kirk Paul Lafler 25

Join Algorithms Users supply the list of tables for joining along with the join conditions, and the PROC SQL optimizer determines which join algorithm to use for performing the join. The algorithms include: Copyright 1992-2010 by Kirk Paul Lafler 26

Join Algorithms Users supply the list of tables for joining along with the join conditions, and the PROC SQL optimizer determines which of the join algorithms to use for performing the join. The algorithms include: Nested Loop Join (brute-force join) When an equality condition is not specified, a read of the complete contents of the right table is processed for each row in the left table. Copyright 1992-2010 by Kirk Paul Lafler 27

Nested Loop Join - Features Used with join relations of two tables One or both of the tables is relatively small I/O intensive This join generally performs fairly well with smaller tables, but generally performs poorly with larger join relations Copyright 1992-2010 by Kirk Paul Lafler 28

Join Algorithms Users supply the list of tables for joining along with the join conditions, and the PROC SQL optimizer determines which of the join algorithms to use for performing the join. The algorithms include: Nested Loop Join When an equality condition is not specified, a read of the complete contents of the right table is processed for each row in the left table. Sort-Merge Join When the specified tables are already in the desired sort order, resources are not expended for resorting. Copyright 1992-2010 by Kirk Paul Lafler 29

Sort-Merge Join - Features Used with joins of two tables Works best when one or both of the join relations are in the desired order One or both of the tables are of moderate size If the optimizer determines a sort is not needed no sort will be performed, otherwise sort resources are expended: - using an explicit sort operation <or> - by taking advantage of pre-existing ordering Generally performs well, particularly when the majority of the rows are being joined Copyright 1992-2010 by Kirk Paul Lafler 30

Join Algorithms Users supply the list of tables for joining along with the join conditions, and the PROC SQL optimizer determines which of the join algorithms to use for performing the join. The algorithms include: Nested Loop Join When an equality condition is not specified, a read of the complete contents of the right table is processed for each row in the left table. Sort-Merge Join When the specified tables are already in the desired sort order, resources are not expended for resorting. Indexed Join When an index exists on >=1 variable(s) to represent a key, matching rows may be accessed using the index. Copyright 1992-2010 by Kirk Paul Lafler 31

Indexed Join - Features Used with joins of two tables An index must be defined that produces a small subset of the total number of rows in a table Matching rows are accessed directly using the index One or both of the tables are of moderate to large size Generally performs well, particularly when a small number of rows are being joined Copyright 1992-2010 by Kirk Paul Lafler 32

Join Algorithms Users supply the list of tables for joining along with the join conditions, and the PROC SQL optimizer determines which of the join algorithms to use for performing the join. The algorithms include: Nested Loop Join When an equality condition is not specified, a read of the complete contents of the right table is processed for each row in the left table. Sort-Merge Join When the specified tables are already in the desired sort order, resources are not expended for resorting. Indexed Join When an index exists on >=1 variable(s) to represent a key, matching rows may be accessed using the index. Hash Join When an equality relationship exists and the smaller table is able to fit in memory, set-matching operations generally perform well. Copyright 1992-2010 by Kirk Paul Lafler 33

Hash Join - Features Used with joins of two tables The SQL optimizer attempts to estimate the amount memory required to build a hash table in memory A hash table data structure associates keys with values The optimizer tries to use the smaller of the tables as the hash table Its purpose is to perform more efficient lookups Requires an equi-join predicate This algorithm generally performs well with small to medium join relations Copyright 1992-2010 by Kirk Paul Lafler 34

Options MSGLEVEL=I By specifying the MSGLEVEL=I option, helpful notes describing index usage, sort utilities, and merge processing are displayed on the SAS Log. OPTIONS MSGLEVEL=I; PROC SQL; SELECT MOVIES.TITLE, RATING, LENGTH, ACTOR_LEADING FROM MOVIES, ACTORS WHERE MOVIES.TITLE = ACTORS.TITLE AND RATING = PG ; QUIT; Copyright 1992-2010 by Kirk Paul Lafler 35

MSGLEVEL=I Log SAS Log Results OPTIONS MSGLEVEL=I; PROC SQL; SELECT MOVIES.TITLE, RATING, LENGTH, ACTOR_LEADING FROM MOVIES, ACTORS WHERE MOVIES.TITLE = ACTORS.TITLE AND RATING = 'PG'; INFO: Index Rating selected for WHERE clause optimization. QUIT; Copyright 1992-2010 by Kirk Paul Lafler 36

_METHOD Option A _METHOD option can be specified on the PROC SQL statement to display the hierarchy of processing that takes place. Results are displayed on the Log. Codes sqxcrta sqxslct sqxjsl sqxjm sqxjndx sqxjhsh sqxsort sqxsrc sqxfil sqxsumg sqxsumn Description Create table as Select Select Step loop join (Cartesian) Merge join Index join Hash join Sort Source rows from table Filter rows Summary statistics with GROUP BY Summary statistics with no GROUP BY Copyright 1992-2010 by Kirk Paul Lafler 37

Program Example using _METHOD PROC SQL _METHOD; TITLE 2-Way Equi Join ; SELECT MOVIES.TITLE, RATING, ACTOR_LEADING FROM MOVIES, ACTORS WHERE MOVIES.TITLE = ACTORS.TITLE; QUIT; SAS Log Results NOTE: SQL execution methods chosen are: sqxslct sqxjhsh sqxsrc( MOVIES ) sqxsrc( ACTORS ) Copyright 1992-2010 by Kirk Paul Lafler 38

Conclusion Data federation characteristics A merge and join do not process data the same way A join combines tables side-by-side (horizontally) Joins adhere to ANSI guidelines Cartesian product represents all possible combinations of rows from the underlying tables 256 tables can be joined using an inner join construct 2 tables can be joined using an outer join construct Four types of join algorithms: Nested loop join Sort-Merge join Indexed join Hash join Copyright 1992-2010 by Kirk Paul Lafler 39

PROC SQL Beyond the Basics Using SAS PROC SQL Examples Book Kirk Paul Lafler sas Available Coming at Winter www.sas.com! 2004! Copyright 1992-2010 by Kirk Paul Lafler 40

SAS and Excel Transferring Data between SAS and Excel Using SAS A Book of Data Transfer Methods with Examples Kirk Paul Lafler William E. Benjamin, Jr. sas Coming Coming September Winter 2004! 2010! Copyright 1992-2010 by Kirk Paul Lafler 41

Questions? Joining Worlds of Data is exciting and fun! Kirk Paul Lafler SAS Consultant, Author, and Trainer Software Intelligence Corporation E-mail: KirkLafler@cs.com

We kindly thank the WUSS Leadership and the following organizations for their assistance and support for this symposium: BASAS

Mark your calendars for WUSS 2010! Three days of classes, workshops and presentations introducing the latest in SAS technology and applications. November 3-5 Hyatt Regency Mission Bay San Diego, CA More information will be available at www.wuss.org