Handling Missing Values in the SQL Procedure



Similar documents
Performing Queries Using PROC SQL (1)

Oracle SQL. Course Summary. Duration. Objectives

Instant SQL Programming

WORKING WITH SUBQUERY IN THE SQL PROCEDURE

Oracle Database: SQL and PL/SQL Fundamentals

Chapter 1 Overview of the SQL Procedure

Oracle Database: SQL and PL/SQL Fundamentals

Oracle Database: SQL and PL/SQL Fundamentals NEW

Introduction to Microsoft Jet SQL

COMP 5138 Relational Database Management Systems. Week 5 : Basic SQL. Today s Agenda. Overview. Basic SQL Queries. Joins Queries

Microsoft Access 3: Understanding and Creating Queries

Oracle Database 12c: Introduction to SQL Ed 1.1

A Comparison of Database Query Languages: SQL, SPARQL, CQL, DMX

ICAB4136B Use structured query language to create database structures and manipulate data

SQL SELECT Query: Intermediate

Advanced Query for Query Developers

PharmaSUG Paper TT03

Duration Vendor Audience 5 Days Oracle End Users, Developers, Technical Consultants and Support Staff

Effective Use of SQL in SAS Programming

1 Structured Query Language: Again. 2 Joining Tables

Oracle Database 11g SQL

3.GETTING STARTED WITH ORACLE8i

Innovative Techniques and Tools to Detect Data Quality Problems

Oracle Database: Introduction to SQL

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

Programming with SQL

Querying Microsoft SQL Server (20461) H8N61S

Relational Database: Additional Operations on Relations; SQL

Ad Hoc Advanced Table of Contents

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

Oracle Database: SQL and PL/SQL Fundamentals NEW

Oracle Database: Introduction to SQL

Oracle Database: Introduction to SQL

Access Queries (Office 2003)

AN INTRODUCTION TO THE SQL PROCEDURE Chris Yindra, C. Y. Associates

Introduction to Proc SQL Steven First, Systems Seminar Consultants, Madison, WI

Database Administration with MySQL

DBMS / Business Intelligence, SQL Server

Beginning C# 5.0. Databases. Vidya Vrat Agarwal. Second Edition

SQL - QUICK GUIDE. Allows users to access data in relational database management systems.

Advance DBMS. Structured Query Language (SQL)

The software shall provide the necessary tools to allow a user to create a Dashboard based on the queries created.

IT2304: Database Systems 1 (DBS 1)

MOC 20461C: Querying Microsoft SQL Server. Course Overview

Paper TU_09. Proc SQL Tips and Techniques - How to get the most out of your queries

White Paper. Blindfolded SQL Injection

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

IT2305 Database Systems I (Compulsory)

Introducing Microsoft SQL Server 2012 Getting Started with SQL Server Management Studio

Querying Microsoft SQL Server

Financial Data Access with SQL, Excel & VBA

Katie Minten Ronk, Steve First, David Beam Systems Seminar Consultants, Inc., Madison, WI

Querying Microsoft SQL Server Course M Day(s) 30:00 Hours

Course 20461C: Querying Microsoft SQL Server Duration: 35 hours

Oracle 10g PL/SQL Training

Oracle Database 10g: Introduction to SQL

T-SQL STANDARD ELEMENTS

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Chapter 5. SQL: Queries, Constraints, Triggers

Course ID#: W 35 Hrs. Course Content

Querying Microsoft SQL Server 2012

Querying Microsoft SQL Server 20461C; 5 days

Five Little Known, But Highly Valuable, PROC SQL Programming Techniques. a presentation by Kirk Paul Lafler

Chapter 9 Joining Data from Multiple Tables. Oracle 10g: SQL

Introduction to Proc SQL Katie Minten Ronk, Systems Seminar Consultants, Madison, WI

Consulting. Personal Attention, Expert Assistance

PROC SQL for SQL Die-hards Jessica Bennett, Advance America, Spartanburg, SC Barbara Ross, Flexshopper LLC, Boca Raton, FL

Database Programming with PL/SQL: Learning Objectives

Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

Identifying Invalid Social Security Numbers

PeopleSoft Query Training

Blindfolded SQL Injection. Written By: Ofer Maor Amichai Shulman

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

David M. Kroenke and David J. Auer Database Processing: Fundamentals, Design and Implementation

Structured Query Language (SQL)

SQL Server 2008 Core Skills. Gary Young 2011

Querying Microsoft SQL Server 2012

MOC QUERYING MICROSOFT SQL SERVER

Top Ten Reasons to Use PROC SQL

Using DATA Step MERGE and PROC SQL JOIN to Combine SAS Datasets Dalia C. Kahane, Westat, Rockville, MD

CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY

Course 10774A: Querying Microsoft SQL Server 2012

Course 10774A: Querying Microsoft SQL Server 2012 Length: 5 Days Published: May 25, 2012 Language(s): English Audience(s): IT Professionals

MySQL for Beginners Ed 3

Database Applications Microsoft Access

Introduction to SQL and SQL in R. LISA Short Courses Xinran Hu

Exploring Microsoft Office Access Chapter 2: Relational Databases and Multi-Table Queries

Database Query 1: SQL Basics

Data Cleaning 101. Ronald Cody, Ed.D., Robert Wood Johnson Medical School, Piscataway, NJ. Variable Name. Valid Values. Type

Guide to SQL Programming: SQL:1999 and Oracle Rdb V7.1

Boats bid bname color 101 Interlake blue 102 Interlake red 103 Clipper green 104 Marine red. Figure 1: Instances of Sailors, Boats and Reserves

Netezza SQL Class Outline

Testing data warehouses with key data indicators Results with Highspeed

David Dye. Extract, Transform, Load

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

SQL Server for developers. murach's TRAINING & REFERENCE. Bryan Syverson. Mike Murach & Associates, Inc. Joel Murach

Paper Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA

What You re Missing About Missing Values

Simple Rules to Remember When Working with Indexes Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Transcription:

Handling Missing Values in the SQL Procedure Danbo Yi, Abt Associates Inc., Cambridge, MA Lei Zhang, Domain Solutions Corp., Cambridge, MA ABSTRACT PROC SQL as a powerful database management tool provides many features available in the DATA steps and the MEANS, TRANSPOSE, PRINT and SORT procedures. If properly used, PROC SQL often results in concise solutions to data manipulations and queries. PROC SQL follows most of the guidelines set by the American National Standards Institute (ANSI) in its implementation of SQL. However, it is not fully compliant with the current ANSI Standard for SQL, especially for the missing values. PROC SQL uses SAS System convention to express and handle the missing values, which is significantly different from many ANSIcompatible SQL databases such as Oracle, Sybase. In this paper, we summarize the ways the PROC SQL handles the missing values in a variety of situations. Topics include missing values in the logic, arithmetic and string expression, missing values in the SQL predicates such as LIKE, ANY, ALL, JOINs, missing values in the aggregate functions, and missing value conversion. INTRODUCTION SQL procedure follows the SAS System convention for handling missing values. The way it expresses missing numeric values and character values are totally different. A missing numeric value is usually expressed as a period (.), but it also can be stated as one of other 27 special missing value expressions based on the underscore (_) and letters A, B,,Z, that is,._,.a,.b,,.z. In SAS SQL procedure, a particular missing value is equal to itself, but those 28 missing values are not equal to each other. They actually have their own order, that is._ <. <.A<.B, <.Z. When missing numeric values are compared to non-missing numeric value, the missing numeric values are always less than or smaller than all the non-missing numeric values. A missing character value is expressed and treated as a string of blanks. Missing character values are always same no matter whether it is expressed as one blank, or more than one blanks. Obviously, missing character values are not the smallest strings. In SAS system, the way missing Date and DateTime values are expressed and treated is similar to missing numeric values. This paper will cover following topics. 1. Missing Values and Expression Logic Expression Arithmetic Expression String Expression 2. Missing Values and Predicates IS NULL /IS MISSING [NOT] LIKE ALL, ANY, and SOME [NOT] EXISTS 3. Missing Values and JOINs Inner Join Left/Right Join Full Join 4. Missing Values and Aggregate Functions COUNT Functions SUM, AVG and STD Functions MIN and MAX Functions 5. Missing value conversion In order to see how differently the missing data are handled in SQL procedure, we will produce examples based on a very small data set that can be generated by following SAS codes. data ABC; input x1 1-2 x2 4-5 y1 $7-9 y2 $11-12; datalines; -1 2 ABC 12 0-3 DE 34 1 1 1 CDE 56

._.A 78..Z ER.A 2 ABC 90 ; run; The small data set has four variables, X1 and X2 are numeric variables, and Y1 and Y2 are character variables. 1. MISSING VALUES AND EXPRESSION Since SAS system uses different ways to express the missing numeric values and missing character values, the missing values are treated differently in the logic, arithmetic, and string expression. LOGIC EXPRESSION When a missing numeric value appear in the logic expression (not, and, or). The missing numeric value is regarded as 0 or FALSE. In another word, In SAS system, 0 and missing values are regarded as FALSE, and non missing numeric values except 0 are regarded as TRUE. Consider this example, Proc sql; Select not x1 as z1, x1 or. as z2, x1 and. as z3 ; Z1 Z2 Z3 ---------------------- 0-1 0 1 0 0 0 1 0 0 1 0 1 _ 0 1. 0 1 A 0 Notice that missing numeric values behave like FALSE or 0 in NOT and AND logical expression, but in the logical OR expression, the missing values was simply dropped because of the nature of OR operation. ARITHMETIC EXPRESSIONS When you use a missing numeric value in an arithmetic expression (+,-,*, /), the SQL procedure always set the result of the expression to period (.) missing value. If you use that result in another expression, the next result is also period (.) missing value. This method of treating missing values is called propagation of missing values. For example, Select x1+1 as z1 where x1<0; Z1 ------ 0... Notice that all special missing values are changed into period (.) missing value after the calculation. STRING EXPRESSIONS When you use a missing character value in a string expression concatenated by, the SQL procedure sets the missing character value to a string of blanks, the length of which is equal to the length of the variable. Here is an example. Select AA y1 y2 ; Z1 ------- AAABC12 AADE 34 AA AACDE56 AA 78 AAER AAABC90

2. MISSING VALUES AND PREDICATES In SQL procedure predicates test some conditions that have the truth-value TRUE, FALSE and NULL, and returns TRUE and FALSE. It also has special predicates to handle missing values, that is IS NULL and IS MISSING. Although missing character values are expressed as a string of blanks, but a string of blanks are always regarded as missing values. You will find this situation in a LIKE predicate below. IS [NOT] NULL /IS [NOT] MISSING IS [NOT] NULL and IS [NOT] MISSING predicates are two predicates especially designed to deal with missing values. IS [NOT] NULL is a SQL standard predicate and IS [NOT] MISSING is SAS SQL predicate. They are generic because they can handle both numeric and character variables. For example, Select x1, y1 Where x1 is not null and y1 is not missing; LIKE X1 Y1 -------- -1 ABC 0 DE 1 CDE As mentioned above, missing character values are always the same but there is one exception. In SAS SQL LIKE predicate, missing values are regarded as a string of blanks, they are not missing values. If the length of a character column variable is 3, then a missing character value is like one blank, two blanks, and three blanks, but not like four blanks or more than four blanks. Consider following SQL statement, which returns the same results because the length of y1 is 3. /* One Blank */ select x1, y1 where y1 like ' '; /* Two Blanks */ select x1, y1 where y1 like ' '; /* Three Blanks */ select x1, y1 where y1 like ' '; Same result: X1 Y1 ------ - 1._ However, the following SQL statement returns nothing. /* Four Blanks */ select x1, y1 where y1 like ' '; If you use [NOT] IN Predicate, missing character values are always equal to one or more blanks no matter what size of the variables. For example, following SQL statement returns the same results as those of first three SQL statements with LIKE predicate even though the size of blanks are more than 3. Select x1, y1 where y1 in ( ); /* Four Blanks */ For two character variables, say A, and B, with different size, if size(a) > size(b), then predicate A like B is TRUE, but predicate B like A is FALSE when they both have missing values. Note that predicate A=B is

always TRUE when both have missing values. For example, Proc sql; select x1, y1, y2 where y1 like y2; will return the following results: X1 Y1 Y2 ------------- 1 However, Proc sql; select x1, y1, y2 where y2 like y1; Returns nothing. ALL, ANY, AND SOME ALL, ANY and SOME are special predicates that are oriented around subqueries. They are used in conjunction with relational operator. Actually, there are only two because ANY and SOME in SQL procedure are the same. ALL, ANY, and SOME are very similar to the IN predicate when it is used with subqueries; they take all the values produced by the subquery and treat them as a unit. However, unlike IN, they can be used only with subqueries. In SQL procedure, ALL and ANY differ from each other in how they react if the subquery procedures have no values to use in a comparison. These differences can give your queries unexpected results if you do not account for them. One significant difference between ALL and ANY is the way they deal with the situation in which the subquery returns no values. In SQL procedure, whenever a legal subquery fails to produce output, the comparison operator modified with ALL is automatically TRUE, and the comparison operator modified with ANY is automatically FALSE. This means that the following queries where x2 > ALL (select x1 where x2 < ALL (select x1 where x2 NE ALL (select x1 would produce the entire list of X2 values as following results. X2 --- 2-3. 1 A Z 2 Whereas those queries where x2 > SOME (select x1 where x2 < SOME (select x1 where x2 NE SOME (select x1 would produce no output. Of course, neither of these comparisons is very meaningful.

[NOT] EXISTS In [NOT] EXISTS predicate, the way missing values are dealt with in SQL procedure is different from that in most SQL databases because missing values are regarded as comparable values in SQL procedures. Consider following example. select x1 as a where exists (select x1 as b where a.x1=b.x1); would return an entire list of X1 values, including all the missing values, but in most SQL databases, it will only return nonmissing values. Because in most SQL databases, if the missing values are used in the predicate of the subquery, the predicate is made unknown in every case. This means the subquery will produce no values, and EXISTS will be false. This, naturally makes NOT EXISTS true. However, in SQL procedure, missing values are normal comparable values that would be used in evaluation of subquery. 3. MISSING VALUES AND JOINS In SQL procedure, a missing value equals to itself. When joining tables with missing values, the results are most likely different from those from most of ANSI-compatible SQL databases such as Oracle, Sybase because missing values are never equal to each other in those database system. For INNER JOIN, SAS SQL will probably produce more observations and for FULL JOIN, SAS SQL will probably have fewer observations if joining tables have missing values. Here are two examples: Example 1 proc sql number; select T1.x1, T2.x2 as T1 Inner join ABC as T2 on (T1.x1=T2.x2); Example 2 Row X1 X2 ------------------------ 1 1 1 2 1 1 3.. 4 A A proc sql number; select T1.x1, T2.x2 as T1 LEFT join ABC as T2 on (T1.x1=T2.x2); Row X1 X2 ------------------------ 1 _. 2.. 3 A A 4-1. 5 0. 6 1 1 7 1 1 4. MISSING VALUES AND AGGREGATE FUNCTIONS SQL procedure supports most common aggregate functions, or statistical summary functions such as count, average, sum, min, max, and standard deviation. They are functions that work on a set of values. Aggregate functions first construct a column variable as defined in the parameter. The parameter is usually a single column variable, but it can be an arithmetic expression with scalar functions and other column variables. Once the working column is constructed. The aggregate function performs its operation on a set of known values and the unknown values, or missing values will be given a special treatment dependent on the individual function.

COUNT FUNCTION Count function works on all SAS data type. Count(*) returns the number of rows in a table. It is the only aggregate function that uses as asterisk (*) as a parameter. The missing values in the table are counted because this function deals with observations or rows and not individual values. For an empty table, COUNT(*) returns zero. The COUNT([ALL]< sql-expression>) returns the number of non-missing values in the <sql-expression> set. The missing values have been excluded before the counting take place. It returns zero when <sql-expression> set is empty or consisted of only missing values. The COUNT(DISTINCT <sql- expression>) returns the number of unique non-missing values in the <sql-expression> set. The missing values have been excluded before the counting take place, and then all redundant duplicates are removed. Again, it returns zero when <sql-expression> set is empty or consisted of only missing values. If COUNT ([ALL DISTINCT] <sqlexpression>) is used to count the number of values in character variable, the blank strings will be excluded, which is undesirable in some situations. SAS function N, and FREQ have the exact functionality as COUNT ([ALL DISTINCT] <sqlexpression>), but there exist no N(*) and FREQ(*) functions in the SQL procedure. The other related SAS function is NMISS, which counts the number of missing values in a <sql-expression> set variable. You can even use NMISS(DISTINCT <sqlexpression>) to count the number of distinct missing numeric values. Below are two examples: create table DUAL (nothing num); insert into DUAL values(1); Select (select count(*) from ABC) as C1, (select count(x1) from ABC) as C2, (select count(all x1) ) as C3, (select count(distinct x1) ) as C4, from DUAL; C1 C2 C3 C4 ---------------------------------- 7 4 4 3 create table DUAL (nothing num); insert into DUAL values(1); Select (select N(distinct x1) from ABC) as C5, (select freq(distinct x1) ) as C6, (select NMISS(distinct x1) ) as C7 from DUAL; C5 C6 C7 ----------------------- 3 3 3 MIN AND MAX FUNCTIONS MIN and MAX functions work on numeric, DATE, DATETIME and character variables. MAX(<sql-expression>) returns the greatest known value in the <sql-expression> set, and MIN(<sql-expression>) returns the smallest known value in the <sqlexpression> set if the <sql-expression> set is not empty or not consisted of only missing values. The missing values have been excluded from the operation before the functions take place. It returns a period (.) for DATE, DATETIME and numeric values, and return blanks for character variables when the <sql-expression> set is empty or consisted of only missing values.

The MAX() for a set of known numeric values is the largest one. The MAX() for a set of known DATE and DATETIME data is the one farthest in the future or most recent. The MAX for a set of non-blank character strings is the last one in the ascending sort order. Likewise, the MIN() for a set of known numeric values is the smallest one. The MIN() for a set of known DATE and DATETIME data is the least recent one. The MIN() for a set of non-blank character strings is the first one in the ascending sort order. Here is an example: Select min(x1) as minx, max(x1) as maxx, min(y1) as miny, max(y1) as maxy ; MINX MAXX MINY MAXY --------------------------- -1 1 ABC ER SUM AND OTHER RELATED FUNCTIONS SUM only works on numeric values. SUM([ALL]<sql-expression>) returns the numeric total of all known numeric values. It returns a missing value when the <sql expression> is empty or consisted of only missing values. SUM((DISTINCT<sql-expression>) returns the numeric total of all known, unique numeric values. The missing values and all duplicates have been removed before the summation took place. It returns a period (.), or a missing value if the <sql-expression> set is empty or consisted of only missing values. The other related SAS SQL functions such as AVG, MEAN, RANGE, STD, STDERR, CV, CSS use the same mechanism as SUM function to handling the missing numeric values. Here is an example. Select sum(x1) as sumx, avg(x1) as avgx, std(x1) as stdx, range(x1) as rangex ; SUMX AVGX STDX RANGEX ----------------------------- 1 0.25 0.957427 2 5. MISSING VALUES CONVERSION Missing numeric values and character values can be converted into each other by SAS INPUT and PUT function, and they also can be converted into character values in macro variables by using INTO clause in the SELECT statement. A missing numeric value can be converted into a character value such as a period (.), underscore (-), or characters A-Z depending on the corresponding missing value form. A missing character value is always converted into a period (.) missing value. Here is an example. select x1 into :valx1 where x1 =._; select x1 into :valx2 from ABC where x1 =.; select x1 into :valx3 from ABC where x1 =.A; select y1 into :valy1 from ABC where y1=' '; %put valx1=**&valx1** is %length(&valx1); %put valx2=**&valx2** is %length(&valx2); %put valx3=**&valx3** is %length(&valx3); %put valy1=**&valy1** is %length(&valy1);

Log output: Valx1=** _** is 1 Valx2=**.** is 1 Valx3=** A** is 1 Valy1=** ** is 0 CONCLUSION In this paper we have discussed the ways the SAS procedure handle missing values in a variety of situations. Although SQL procedure is a very powerful tool and can save a lot of programming time, you must have a good understanding of how missing values are handled in the SAS SQL procedure in order to avoid the possible pitfalls. SAS and SAS/BASE mentioned in this paper are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration REFERENCES SAS Institute Inc., SAS Guide to the SQL Procedure, 1989 CONTACT ADDRESS Questions pertaining to this article should be addressed to: Danbo Yi Abt Associates Inc. 55 Wheeler Street Cambridge, MA 02138-1168, USA Tel: 617-349-2346 Fax: 617-520-2940 e-mail: danbo_yi@abtassoc.com Lei Zhang DomianPharma Inc. 10 Maguire Road Lexington, MA 02140, USA Tel: 781-778-3880 Fax: 781-778-3700 e-mail: lzhang@domainpharma.com