Paper 2917. Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA



Similar documents
LOCF-Different Approaches, Same Results Using LAG Function, RETAIN Statement, and ARRAY Facility Iuliana Barbalau, ClinOps LLC. San Francisco, CA.

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

Demonstrating a DATA Step with and without a RETAIN Statement

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Statistics and Analysis. Quality Control: How to Analyze and Verify Financial Data

The SET Statement and Beyond: Uses and Abuses of the SET Statement. S. David Riba, JADE Tech, Inc., Clearwater, FL

Custom Javascript In Planning

Let the CAT Out of the Bag: String Concatenation in SAS 9 Joshua Horstman, Nested Loop Consulting, Indianapolis, IN

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

A Technique for Storing and Manipulating Incomplete Dates in a Single SAS Date Value

The Essentials of Finding the Distinct, Unique, and Duplicate Values in Your Data

Introduction to SAS Informats and Formats

Advanced Tutorials. Numeric Data In SAS : Guidelines for Storage and Display Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD

B) Mean Function: This function returns the arithmetic mean (average) and ignores the missing value. E.G: Var=MEAN (var1, var2, var3 varn);

Using SAS to Build Customer Level Datasets for Predictive Modeling Scott Shockley, Cox Communications, New Orleans, Louisiana

Everything you wanted to know about MERGE but were afraid to ask

PharmaSUG Paper QT26

Using DATA Step MERGE and PROC SQL JOIN to Combine SAS Datasets Dalia C. Kahane, Westat, Rockville, MD

Taming the PROC TRANSPOSE

The Program Data Vector As an Aid to DATA step Reasoning Marianne Whitlock, Kennett Square, PA

Reshaping & Combining Tables Unit of analysis Combining. Assignment 4. Assignment 4 continued PHPM 672/677 2/21/2016. Kum 1

Anyone Can Learn PROC TABULATE

MATCH-MERGING: 20 Some Traps and How to Avoid Them. Malachy J. Foley. University of North Carolina at Chapel Hill, NC ABSTRACT

Wave Analytics Data Integration

1 Description of The Simpletron

EXST SAS Lab Lab #4: Data input and dataset modifications

Search and Replace in SAS Data Sets thru GUI

Converting Numeric Variables and Character Variables in SAS Randall Reilly; Covance Clinical Pharmacology; Madison, WI

Preparing Real World Data in Excel Sheets for Statistical Analysis

Handling Missing Values in the SQL Procedure

What is a Loop? Pretest Loops in C++ Types of Loop Testing. Count-controlled loops. Loops can be...

PROC SQL for SQL Die-hards Jessica Bennett, Advance America, Spartanburg, SC Barbara Ross, Flexshopper LLC, Boca Raton, FL

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7

Tom wants to find two real numbers, a and b, that have a sum of 10 and have a product of 10. He makes this table.

Changing the Shape of Your Data: PROC TRANSPOSE vs. Arrays

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

How to Benchmark Your Building. Instructions for Using ENERGY STAR Portfolio Manager and Southern California Gas Company s Web Services

You have got SASMAIL!

The SAS Data step/macro Interface

Using the COMPUTE Block in PROC REPORT Jack Hamilton, Kaiser Foundation Health Plan, Oakland, California

Efficient Techniques and Tips in Handling Large Datasets Shilong Kuang, Kelley Blue Book Inc., Irvine, CA

Data Cleaning and Base SAS Functions Caroline Bahler, Meridian Software Inc

Imelda C. Go, South Carolina Department of Education, Columbia, SC

Directions for the AP Invoice Upload Spreadsheet

Grade 7/8 Math Circles Sequences and Series

Storing and Using a List of Values in a Macro Variable

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

Binary Representation. Number Systems. Base 10, Base 2, Base 16. Positional Notation. Conversion of Any Base to Decimal.

We can express this in decimal notation (in contrast to the underline notation we have been using) as follows: b + 90c = c + 10b

Innovative Techniques and Tools to Detect Data Quality Problems

Digital System Design Prof. D Roychoudhry Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Session 7 Fractions and Decimals

What You re Missing About Missing Values

=

14:440:127 Introduction to Computers for Engineers. Notes for Lecture 06

Alternatives to Merging SAS Data Sets But Be Careful

8 Square matrices continued: Determinants

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Data Preparation in SPSS

Integrating Data and Business Rules with a Control Data Set in SAS

Advanced MATCH-MERGING: Techniques, Tricks, and Traps

Fun with PROC SQL Darryl Putnam, CACI Inc., Stevensville MD

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Expert Reference Series of White Papers. Binary and IP Address Basics of Subnetting

3.GETTING STARTED WITH ORACLE8i

Integrity Constraints and Audit Trails Working Together Gary Franklin, SAS Institute Inc., Austin, TX Art Jensen, SAS Institute Inc.

PROC REPORT: How To Get Started

A Method for Cleaning Clinical Trial Analysis Data Sets

Choosing the Best Method to Create an Excel Report Romain Miralles, Clinovo, Sunnyvale, CA

Tales from the Help Desk 3: More Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

Everything you wanted to know about using Hexadecimal and Octal Numbers in Visual Basic 6

New Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency

An automatic predictive datamining tool. Data Preparation Propensity to Buy v1.05

THE POWER OF PROC FORMAT

Preserving Line Breaks When Exporting to Excel Nelson Lee, Genentech, South San Francisco, CA

Nine Steps to Get Started using SAS Macros

CHAPTER 5 Round-off errors

Excel for Mac Text Functions

Intellect Platform - Parent-Child relationship Basic Expense Management System - A103

Simple Rules to Remember When Working with Indexes Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Performing Queries Using PROC SQL (1)

1 Checking Values of Character Variables

Lecture 2 Mathcad Basics

ACADEMIC TECHNOLOGY SUPPORT

A New Paradigm for Synchronous State Machine Design in Verilog

Reading Delimited Text Files into SAS 9 TS-673

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

SAS Macro Programming for Beginners

Multiplying and Dividing Signed Numbers. Finding the Product of Two Signed Numbers. (a) (3)( 4) ( 4) ( 4) ( 4) 12 (b) (4)( 5) ( 5) ( 5) ( 5) ( 5) 20

Lies My Calculator and Computer Told Me

Introduction to Character String Functions

White Paper. Blindfolded SQL Injection

Directions for the Well Allocation Deck Upload spreadsheet

Programming Idioms Using the SET Statement

Eliminating Tedium by Building Applications that Use SQL Generated SAS Code Segments

Preparing your data for analysis using SAS. Landon Sego 24 April 2003 Department of Statistics UW-Madison

Advanced Tutorials. The Dark Side of the Transparent Moon

Transcription:

Paper 2917 Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA ABSTRACT Creation of variables is one of the most common SAS programming tasks. However, sometimes it produces unexpected results without an error or warning message in the SAS log. These unexpected results could occur while performing illegal operations such as division by zero or attempting calculations involving missing values. Or they could happen while doing calculations in DATA steps containing MERGE or SET statements. One should also be beware of the frequent issues concerning numeric comparisons, concatenation of character variables and order of evaluation of numeric variables. This paper examines these issues, points out what to look for in note messages in SAS log and offers defensive programming techniques. ROUNDING TO DECIMAL PLACES According to SAS 9.2 Language Reference: Concepts (p.42), imprecision can cause problems with comparisons. Consider, for example, computations involving fraction 1/9: data _null_; x=1/9; if x=.111 then y = 1; put x= / y=; The resulting SAS Log output is shown below: x=0.1111111111 y=. Trap: Some numbers cannot be represented exactly in a decimal notation. Good Programming Practice: use the ROUND( ) functions whenever fractions are involved: data _null_; x=1/9; if round(x,.001) =.111 then y = 1; put y=; And we get the correct result: y=1 NUMERIC REPRESENTATION (ROUND-OFF PRECISION) ERROR Consider the following example modified from SAS Support website at http://support.sas.com/techsup/technote/ts230.html: data test; x = 15.7-11.9; if x=3.8 then var1=1; else var1=0; put var1= ; 1

SAS Log: var1=0 However, if you observe the data set created, you ll see that in fact, x=3.8, and therefore, var1 should be equal to 1, not zero! Here, var1 is not equal to 1 although it s the assumption that many people would make. Trap: The reason for this is that without var1 s format defined by the user, SAS uses the default format for var1 BEST12. Good programming practice with numerical variables would be to use the widest BESTw.format BEST32., along with the ROUND( ) function: data test; x = 15.7-11.9; if round(x,.1)=3.8 then var1=1; else var1=0; put var1= ; format x best32.; The resulting output is correct: MERGING DATA SETS According to Malachy J. Foley in MATCH-MERGING: 20 Some Traps and How to Avoid Them, "Use only the basic 4-statement match-merge and to do all other processing in a separate DATA step. In other words, whenever possible, NEVER use a WHERE, an IF statement, an assignment statement, or anything else in the match-merge code. If manipulation is required, if at all possible, do it in a separate DATA step after the merge DATA step. This is extreme, but it works. And here s why. When MERGE statement is present, all input variables are initialized to missing only at the beginning of each BY group. Within the same BY group, values of all input variables will be automatically retained for the next iteration. The effect is similar to RETAIN statement. Consider these data sets: Data1: Data2: 2

data data3; merge data1 data2; by ID; if GRADE='D' then GOOD_POOR='Poor'; Trap: The second and third records are wrong since GRADE is B but GOOD_POOR is Poor! The value for GOOD_POOR for observation #2 has also been retained also for observations #3 and 4. The reason is because observations 3 and 4 belong to the same BY group as observation #2. Value for GOOD_POOR for observation #5 is correct since it belongs to the different BY group and was reset. And that is why it s a good programming practice not to perform any calculations in the data step that contains MERGE or SET statements. RETAIN STATEMENT According to SAS Online documentation, RETAIN statement causes a variable that is created by an INPUT or assignment statement to retain its value from one iteration of the DATA step to the next. Here are some points associated with frequent errors when applying RETAIN statement: FORGETTING TO RESET THE RETAINED VARIABLES Let s see what happens if you forget to reset the retained variables. Take this data set as an example: Data set TEST1: We want to retain the value of VAR3 in the (VAR1, VAR2) combination, provided that VAR2 is not missing. proc sort data=test1; by var1 var2; data test2; set test1; by var1 var2; retain var4; if first.var1 and var2 ne. then var4=var3; 3

Data set TEST3: Trap: Here, VAR4 was not supposed to be retained for observation #3, because VAR2 is missing. Many beginner programmers assume that BY statement automatically resets retained variables for each BY variable. But the BY statement only creates first.var and last.var, and we have to explicitly reset each BY variable. Good programming practice: Set the variables you create with the RETAIN statement to missing values (or zeros, or other suitable default values) before each iteration. The addition of this statement to the previous data step will do just what s needed: if first.var1 then var4=. ; USING RETAIN STATEMENT TO INITIALIZE VALUES You can use RETAIN statement to initialize values for individual variables, a list of variables, or members of an array. This RETAIN statement retains the values of five variables and sets their initial values to 1: retain month1-month5 1; Trap: If you don t initialize a retained variable and use it only in the RETAIN statement, this variable is not written to a data set, and the SAS log will state that the variable is uninitialized. USAGE OF IF AND WHERE STATEMENTS WITH RETAIN STATEMENT Let s continue with the previous example, where we wanted to retain the value of VAR3 in the (VAR1, VAR2) combination, provided that VAR2 is not missing. The data set is Data set TEST1: But this time, if VAR2 is missing, we want to retain the value of VAL3 from the next observation. In our example, for VAR1=2, we want to retain the value of VAR3 = 25. If we use the IF statement, we get an incorrect result: 4

data test2; set test1; if var2 ne.; by var1 var2; retain var4; if first.var1 then var4=.; if first.var1 then var4=var3; Data set TEST2: Why was the value of 22 erroneously retained? Trap: Conditions in the IF statement are applied after the data enters the program data vector. Good Programming Practice: Use the WHERE statement instead of IF. Conditions in the WHERE statement are applied before the data enters the input buffer, and we get the correct result: Data set TEST2: RENAME, KEEP, DROP If you need to use the DROP, KEEP and RENAME statements or DROP=, KEEP=, RENAME= options in the same data step, keep in mind that they follow a special timing rule. DROP is evaluated first, followed by KEEP and then RENAME. 1. RENAME, KEEP, DROP STATEMENTS The following code will run with errors: data test1; old_var=1; data test2; set test1 (keep=new_var /* using new variable name here will produce errors */ rename=(old_var=new_var)); SAS Log: ERROR: Variable old_var is not on file WORK.TEST1. ERROR: Invalid DROP, KEEP, or RENAME option on file WORK.TEST1. 5

Trap: Since KEEP= option is evaluated before RENAME= option, it doesn t recognize variable new_var because it hasn t been created yet. Good Programming Practice: When RENAME= option is used, use the old variable name with KEEP= option, but in program statements use the new variable name: data test2; set test1 (keep= old_var /* use old variable name */ rename=(old_var=new_var)); if new_var=1 then other_var=2; /* use new variable name */ put new_var= other_var= ; 2. DROP, KEEP, RENAME OPTIONS If you use DROP, KEEP, RENAME statements instead of options, the same order of evaluation applies. Trap: When applying the RENAME statement, using new variable names in other statements of the same data step results in error: data test2; set test1; KEEP new_var; /* using new variable name here will produce errors */ RENAME old_var=new_var; if new_var=1 then other_var=2; /* using new variable name here will produce errors */ put old_var= / new_var= / other_var=; SAS Log: NOTE: Variable new_var is uninitialized. WARNING: The variable old_var in the DROP, KEEP, or RENAME list has never been referenced. old_var=1 new_var=. other_var=. Good Programming Practice: If you use the RENAME statement, then in other statements in the same data step use the old variable name: data test2; set test1; RENAME old_var=new_var; if old_var=1 then other_var=2; /* use OLD variable name here */ KEEP old_var other_var; /* use OLD variable name here */ put old_var= / other_var=; SAS Log: old_var=1 other_var=2 6

CONCATENATION When creating variables based on the values of character variables, one should consider the following peculiarities of character variables. 1. CHARACTER VARIABLES ARE CASE-SENSITIVE. Therefore, always make sure you specified the correct case. Good Programming Practice: use the UPCASE( ) function to ensure the right character case: data test; a='notes'; if UPCASE(a)='NOTES' then b=1; put b=; SAS Log: b=1 2. CHARACTER VALUES MAY HAVE LEADING AND/OR TRAILING BLANKS It s possible to see the leading blanks in the raw data sets, but trailing blanks are not easy to detect. A good way to check for leading and trailing blanks is to attach a character (a colon, for example) to the beginning and end of a character value: data _null_; var1 = ' Day '; Join1 = ':' var1 ':'; put join1 = ; SAS Log: Join1=: Day : Good Programming Practice: Use the STRIP( ) function to get rid of these blanks: data new; a=' happy b-day '; b=strip(a); if a='happy b-day' then c=1; if STRIP(a)='happy b-day' then d=1; put c= / d=; c=. d=1 If you need to concatenate several character variables, you might have to get rid of the leading and trailing blanks, too, but instead of using STRIP( ) function for each variable, use either the CATS or CATX function. They both strip off leading and trailing blanks before concatenating strings, but with CATX, you must supply a separator for these strings. Here are the examples: 7

data _null_; Var1 = 'Blue '; Var2 = 'Sky '; Join1 = cats(var1,var2); Join2 = catx(' ',Var1,Var2); put Join1= / Join2= /; Join1=BlueSky Join2=Blue Sky MISSING VALUES 1. CALCULATIONS INVOLVING ARITHMETIC OPERATIONS WITH MISSING VALUES Calculations involving arithmetic operations with missing values will result in a missing value: data test; x=.; y=2; a=x+y; b=sum(x,y); put a= / b=; a=. b=2 NOTE: Missing values were generated as a result of performing an operation on missing values. Good Programming Practice: If you want to omit missing values from computations, use sample statistic functions. For example, SUM function disregards missing values. Adding X to Y using the SUM function results not in missing value but in 2. For a complete discussion and examples, see SAS Language Reference: Dictionary. 2. UNINITIALIZED VARIABLES When variable var1 is compared to a variable var2 that was not previously defined, the following condition will always be true and will produce undesirable result: data new1; age=10; if age > var1 then group='teen age'; put age= / group=; NOTE: Variable var1 is uninitialized. age=10 group=teen age 8

3. CHECKING FOR MISSING NUMERIC VALUES When checking for missing numeric values, the most frequently used code is the following: IF var1 =. THEN a=1; Trap: A dot is only one of the 28 ways representing a missing numeric value, and therefore in some instances, the above code won t detect all the missing values. They are the dot-underscore (._), and dot-letter(.a thru.z). Good programming practice is to pick the highest missing value from this list of 28 ways to represent a missing value, which will catch all the instances of a missing numeric value: IF var1 <=.Z THEN a=1; 4. ILLEGAL OPERATIONS According to SAS(R) 9.1 Language Reference: Concepts, SAS prints a note in the log and assigns a missing value to the result if you try to perform an illegal operation, such as the following: dividing by zero taking the logarithm of zero using an expression to produce a number too large to be represented as a floating-point number (known as overflow). The following example illustrates these points: data _null_; a=5; b=0; c = a/b; put c= ; NOTE: Division by zero detected at line 374 column 10. c=. a=5 b=0 c=. _ERROR_=1 _N_=1 NOTE: Mathematical operations could not be performed at the following places. The results of the operations have been set to missing values. NUMERIC COMPARISONS When it comes to the order of numeric values, a missing numeric value comes before a non-missing numeric values, and missing numeric values have their own sort order. Therefore, the following operation produces undesirable results: data new; a=.; if a<5 then b=1; put b= ; b=1 See this paper s previous section on discussion of missing values. One of the suggested solutions was to use.z for comparisons: 9

data new; a=.; if.z<a<5 then b=1; put b= ; This time, the value for b is correct: b=. CONCLUSION This paper has shown the traps associated with creating variables and the ways to avoid these traps. Special attention has been given to traps involving merging data sets, comparison of numeric variables and concatenation of character variables, as well as the RETAIN statement and calculations involving missing values. Avoiding these traps will save debugging time for novice and advanced SAS programmers. REFERENCES Christof Binder (2007). Proc Format - Tricks and Traps. PhUSE 2007 Conference Proceedings SAS Institute Inc. (1999). Dealing With Numeric Representation Error in SAS Applications. Technical Support TS- 230. SAS Institute Inc., Cary North Carolina. http://ftp.sas.com/techsup/download/technote/ts230.html Malachy J. Foley.(1998). MATCH-MERGING: 20 Some Traps and How to Avoid Them. SUGI 23 Conference Proceedings Malachy J. Foley (2007). MISSING VALUES: Everything You Ever Wanted to Know. WUSS 2007 Conference Proceedings Jyotheeswara Naidu Yellanki (2007). Importance of Warnings and Notes messages from SAS log. NESUG 2007 Proceedings Cody, Ron (2007). Learning SAS by Example: A Programmer s Guide. Cary, NC: SAS Institute Inc. SAS Institute Inc. (2005). SAS 9.1.3 Language Reference: Concepts, Third Edition. Cary, NC: SAS Institute Inc. SAS Institute Inc. (2010). SAS 9.2 Language Reference: Concepts, Second Edition. Cary, NC: SAS Institute Inc. SAS Institute Inc. (2005). SAS 9.1.3 Language Reference: Dictionary, Third Edition. Cary, NC: SAS Institute Inc. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Olena Galligan Clinops, LLC 353 Sacramento St. San Francisco, CA 94111 E-mail: stats208@gmail.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 10