Paper 2917 Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA ABSTRACT Creation of variables is one of the most common SAS programming tasks. However, sometimes it produces unexpected results without an error or warning message in the SAS log. These unexpected results could occur while performing illegal operations such as division by zero or attempting calculations involving missing values. Or they could happen while doing calculations in DATA steps containing MERGE or SET statements. One should also be beware of the frequent issues concerning numeric comparisons, concatenation of character variables and order of evaluation of numeric variables. This paper examines these issues, points out what to look for in note messages in SAS log and offers defensive programming techniques. ROUNDING TO DECIMAL PLACES According to SAS 9.2 Language Reference: Concepts (p.42), imprecision can cause problems with comparisons. Consider, for example, computations involving fraction 1/9: data _null_; x=1/9; if x=.111 then y = 1; put x= / y=; The resulting SAS Log output is shown below: x=0.1111111111 y=. Trap: Some numbers cannot be represented exactly in a decimal notation. Good Programming Practice: use the ROUND( ) functions whenever fractions are involved: data _null_; x=1/9; if round(x,.001) =.111 then y = 1; put y=; And we get the correct result: y=1 NUMERIC REPRESENTATION (ROUND-OFF PRECISION) ERROR Consider the following example modified from SAS Support website at http://support.sas.com/techsup/technote/ts230.html: data test; x = 15.7-11.9; if x=3.8 then var1=1; else var1=0; put var1= ; 1
SAS Log: var1=0 However, if you observe the data set created, you ll see that in fact, x=3.8, and therefore, var1 should be equal to 1, not zero! Here, var1 is not equal to 1 although it s the assumption that many people would make. Trap: The reason for this is that without var1 s format defined by the user, SAS uses the default format for var1 BEST12. Good programming practice with numerical variables would be to use the widest BESTw.format BEST32., along with the ROUND( ) function: data test; x = 15.7-11.9; if round(x,.1)=3.8 then var1=1; else var1=0; put var1= ; format x best32.; The resulting output is correct: MERGING DATA SETS According to Malachy J. Foley in MATCH-MERGING: 20 Some Traps and How to Avoid Them, "Use only the basic 4-statement match-merge and to do all other processing in a separate DATA step. In other words, whenever possible, NEVER use a WHERE, an IF statement, an assignment statement, or anything else in the match-merge code. If manipulation is required, if at all possible, do it in a separate DATA step after the merge DATA step. This is extreme, but it works. And here s why. When MERGE statement is present, all input variables are initialized to missing only at the beginning of each BY group. Within the same BY group, values of all input variables will be automatically retained for the next iteration. The effect is similar to RETAIN statement. Consider these data sets: Data1: Data2: 2
data data3; merge data1 data2; by ID; if GRADE='D' then GOOD_POOR='Poor'; Trap: The second and third records are wrong since GRADE is B but GOOD_POOR is Poor! The value for GOOD_POOR for observation #2 has also been retained also for observations #3 and 4. The reason is because observations 3 and 4 belong to the same BY group as observation #2. Value for GOOD_POOR for observation #5 is correct since it belongs to the different BY group and was reset. And that is why it s a good programming practice not to perform any calculations in the data step that contains MERGE or SET statements. RETAIN STATEMENT According to SAS Online documentation, RETAIN statement causes a variable that is created by an INPUT or assignment statement to retain its value from one iteration of the DATA step to the next. Here are some points associated with frequent errors when applying RETAIN statement: FORGETTING TO RESET THE RETAINED VARIABLES Let s see what happens if you forget to reset the retained variables. Take this data set as an example: Data set TEST1: We want to retain the value of VAR3 in the (VAR1, VAR2) combination, provided that VAR2 is not missing. proc sort data=test1; by var1 var2; data test2; set test1; by var1 var2; retain var4; if first.var1 and var2 ne. then var4=var3; 3
Data set TEST3: Trap: Here, VAR4 was not supposed to be retained for observation #3, because VAR2 is missing. Many beginner programmers assume that BY statement automatically resets retained variables for each BY variable. But the BY statement only creates first.var and last.var, and we have to explicitly reset each BY variable. Good programming practice: Set the variables you create with the RETAIN statement to missing values (or zeros, or other suitable default values) before each iteration. The addition of this statement to the previous data step will do just what s needed: if first.var1 then var4=. ; USING RETAIN STATEMENT TO INITIALIZE VALUES You can use RETAIN statement to initialize values for individual variables, a list of variables, or members of an array. This RETAIN statement retains the values of five variables and sets their initial values to 1: retain month1-month5 1; Trap: If you don t initialize a retained variable and use it only in the RETAIN statement, this variable is not written to a data set, and the SAS log will state that the variable is uninitialized. USAGE OF IF AND WHERE STATEMENTS WITH RETAIN STATEMENT Let s continue with the previous example, where we wanted to retain the value of VAR3 in the (VAR1, VAR2) combination, provided that VAR2 is not missing. The data set is Data set TEST1: But this time, if VAR2 is missing, we want to retain the value of VAL3 from the next observation. In our example, for VAR1=2, we want to retain the value of VAR3 = 25. If we use the IF statement, we get an incorrect result: 4
data test2; set test1; if var2 ne.; by var1 var2; retain var4; if first.var1 then var4=.; if first.var1 then var4=var3; Data set TEST2: Why was the value of 22 erroneously retained? Trap: Conditions in the IF statement are applied after the data enters the program data vector. Good Programming Practice: Use the WHERE statement instead of IF. Conditions in the WHERE statement are applied before the data enters the input buffer, and we get the correct result: Data set TEST2: RENAME, KEEP, DROP If you need to use the DROP, KEEP and RENAME statements or DROP=, KEEP=, RENAME= options in the same data step, keep in mind that they follow a special timing rule. DROP is evaluated first, followed by KEEP and then RENAME. 1. RENAME, KEEP, DROP STATEMENTS The following code will run with errors: data test1; old_var=1; data test2; set test1 (keep=new_var /* using new variable name here will produce errors */ rename=(old_var=new_var)); SAS Log: ERROR: Variable old_var is not on file WORK.TEST1. ERROR: Invalid DROP, KEEP, or RENAME option on file WORK.TEST1. 5
Trap: Since KEEP= option is evaluated before RENAME= option, it doesn t recognize variable new_var because it hasn t been created yet. Good Programming Practice: When RENAME= option is used, use the old variable name with KEEP= option, but in program statements use the new variable name: data test2; set test1 (keep= old_var /* use old variable name */ rename=(old_var=new_var)); if new_var=1 then other_var=2; /* use new variable name */ put new_var= other_var= ; 2. DROP, KEEP, RENAME OPTIONS If you use DROP, KEEP, RENAME statements instead of options, the same order of evaluation applies. Trap: When applying the RENAME statement, using new variable names in other statements of the same data step results in error: data test2; set test1; KEEP new_var; /* using new variable name here will produce errors */ RENAME old_var=new_var; if new_var=1 then other_var=2; /* using new variable name here will produce errors */ put old_var= / new_var= / other_var=; SAS Log: NOTE: Variable new_var is uninitialized. WARNING: The variable old_var in the DROP, KEEP, or RENAME list has never been referenced. old_var=1 new_var=. other_var=. Good Programming Practice: If you use the RENAME statement, then in other statements in the same data step use the old variable name: data test2; set test1; RENAME old_var=new_var; if old_var=1 then other_var=2; /* use OLD variable name here */ KEEP old_var other_var; /* use OLD variable name here */ put old_var= / other_var=; SAS Log: old_var=1 other_var=2 6
CONCATENATION When creating variables based on the values of character variables, one should consider the following peculiarities of character variables. 1. CHARACTER VARIABLES ARE CASE-SENSITIVE. Therefore, always make sure you specified the correct case. Good Programming Practice: use the UPCASE( ) function to ensure the right character case: data test; a='notes'; if UPCASE(a)='NOTES' then b=1; put b=; SAS Log: b=1 2. CHARACTER VALUES MAY HAVE LEADING AND/OR TRAILING BLANKS It s possible to see the leading blanks in the raw data sets, but trailing blanks are not easy to detect. A good way to check for leading and trailing blanks is to attach a character (a colon, for example) to the beginning and end of a character value: data _null_; var1 = ' Day '; Join1 = ':' var1 ':'; put join1 = ; SAS Log: Join1=: Day : Good Programming Practice: Use the STRIP( ) function to get rid of these blanks: data new; a=' happy b-day '; b=strip(a); if a='happy b-day' then c=1; if STRIP(a)='happy b-day' then d=1; put c= / d=; c=. d=1 If you need to concatenate several character variables, you might have to get rid of the leading and trailing blanks, too, but instead of using STRIP( ) function for each variable, use either the CATS or CATX function. They both strip off leading and trailing blanks before concatenating strings, but with CATX, you must supply a separator for these strings. Here are the examples: 7
data _null_; Var1 = 'Blue '; Var2 = 'Sky '; Join1 = cats(var1,var2); Join2 = catx(' ',Var1,Var2); put Join1= / Join2= /; Join1=BlueSky Join2=Blue Sky MISSING VALUES 1. CALCULATIONS INVOLVING ARITHMETIC OPERATIONS WITH MISSING VALUES Calculations involving arithmetic operations with missing values will result in a missing value: data test; x=.; y=2; a=x+y; b=sum(x,y); put a= / b=; a=. b=2 NOTE: Missing values were generated as a result of performing an operation on missing values. Good Programming Practice: If you want to omit missing values from computations, use sample statistic functions. For example, SUM function disregards missing values. Adding X to Y using the SUM function results not in missing value but in 2. For a complete discussion and examples, see SAS Language Reference: Dictionary. 2. UNINITIALIZED VARIABLES When variable var1 is compared to a variable var2 that was not previously defined, the following condition will always be true and will produce undesirable result: data new1; age=10; if age > var1 then group='teen age'; put age= / group=; NOTE: Variable var1 is uninitialized. age=10 group=teen age 8
3. CHECKING FOR MISSING NUMERIC VALUES When checking for missing numeric values, the most frequently used code is the following: IF var1 =. THEN a=1; Trap: A dot is only one of the 28 ways representing a missing numeric value, and therefore in some instances, the above code won t detect all the missing values. They are the dot-underscore (._), and dot-letter(.a thru.z). Good programming practice is to pick the highest missing value from this list of 28 ways to represent a missing value, which will catch all the instances of a missing numeric value: IF var1 <=.Z THEN a=1; 4. ILLEGAL OPERATIONS According to SAS(R) 9.1 Language Reference: Concepts, SAS prints a note in the log and assigns a missing value to the result if you try to perform an illegal operation, such as the following: dividing by zero taking the logarithm of zero using an expression to produce a number too large to be represented as a floating-point number (known as overflow). The following example illustrates these points: data _null_; a=5; b=0; c = a/b; put c= ; NOTE: Division by zero detected at line 374 column 10. c=. a=5 b=0 c=. _ERROR_=1 _N_=1 NOTE: Mathematical operations could not be performed at the following places. The results of the operations have been set to missing values. NUMERIC COMPARISONS When it comes to the order of numeric values, a missing numeric value comes before a non-missing numeric values, and missing numeric values have their own sort order. Therefore, the following operation produces undesirable results: data new; a=.; if a<5 then b=1; put b= ; b=1 See this paper s previous section on discussion of missing values. One of the suggested solutions was to use.z for comparisons: 9
data new; a=.; if.z<a<5 then b=1; put b= ; This time, the value for b is correct: b=. CONCLUSION This paper has shown the traps associated with creating variables and the ways to avoid these traps. Special attention has been given to traps involving merging data sets, comparison of numeric variables and concatenation of character variables, as well as the RETAIN statement and calculations involving missing values. Avoiding these traps will save debugging time for novice and advanced SAS programmers. REFERENCES Christof Binder (2007). Proc Format - Tricks and Traps. PhUSE 2007 Conference Proceedings SAS Institute Inc. (1999). Dealing With Numeric Representation Error in SAS Applications. Technical Support TS- 230. SAS Institute Inc., Cary North Carolina. http://ftp.sas.com/techsup/download/technote/ts230.html Malachy J. Foley.(1998). MATCH-MERGING: 20 Some Traps and How to Avoid Them. SUGI 23 Conference Proceedings Malachy J. Foley (2007). MISSING VALUES: Everything You Ever Wanted to Know. WUSS 2007 Conference Proceedings Jyotheeswara Naidu Yellanki (2007). Importance of Warnings and Notes messages from SAS log. NESUG 2007 Proceedings Cody, Ron (2007). Learning SAS by Example: A Programmer s Guide. Cary, NC: SAS Institute Inc. SAS Institute Inc. (2005). SAS 9.1.3 Language Reference: Concepts, Third Edition. Cary, NC: SAS Institute Inc. SAS Institute Inc. (2010). SAS 9.2 Language Reference: Concepts, Second Edition. Cary, NC: SAS Institute Inc. SAS Institute Inc. (2005). SAS 9.1.3 Language Reference: Dictionary, Third Edition. Cary, NC: SAS Institute Inc. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Olena Galligan Clinops, LLC 353 Sacramento St. San Francisco, CA 94111 E-mail: stats208@gmail.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 10