How to Reduce the Disk Space Required by a SAS Data Set



Similar documents
Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

B) Mean Function: This function returns the arithmetic mean (average) and ignores the missing value. E.G: Var=MEAN (var1, var2, var3 varn);

Labels, Labels, and More Labels Stephanie R. Thompson, Rochester Institute of Technology, Rochester, NY

Programming Idioms Using the SET Statement

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

The SET Statement and Beyond: Uses and Abuses of the SET Statement. S. David Riba, JADE Tech, Inc., Clearwater, FL

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

SUGI 29 Coders' Corner

9.1 SAS. SQL Query Window. User s Guide

Storing and Using a List of Values in a Macro Variable

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

3.GETTING STARTED WITH ORACLE8i

A Macro to Create Data Definition Documents

Coders' Corner. Paper 81-26

Beginning Tutorials. bt009 A TUTORIAL ON THE SAS MACRO LANGUAGE John J. Cohen AstraZeneca LP

Nine Steps to Get Started using SAS Macros

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

Tales from the Help Desk 3: More Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

Using Macros to Automate SAS Processing Kari Richardson, SAS Institute, Cary, NC Eric Rossland, SAS Institute, Dallas, TX

Advanced Tutorials. Numeric Data In SAS : Guidelines for Storage and Display Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD

A Faster Index for sorted SAS Datasets

Managing Tables in Microsoft SQL Server using SAS

The Power of CALL SYMPUT DATA Step Interface by Examples Yunchao (Susan) Tian, Social & Scientific Systems, Inc., Silver Spring, MD

Using SAS With a SQL Server Database. M. Rita Thissen, Yan Chen Tang, Elizabeth Heath RTI International, RTP, NC

AN INTRODUCTION TO MACRO VARIABLES AND MACRO PROGRAMS Mike S. Zdeb, New York State Department of Health

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

The SAS Data step/macro Interface

The entire SAS code for the %CHK_MISSING macro is in the Appendix. The full macro specification is listed as follows: %chk_missing(indsn=, outdsn= );

That Mysterious Colon (:) Haiping Luo, Dept. of Veterans Affairs, Washington, DC

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

Subsetting Observations from Large SAS Data Sets

Search and Replace in SAS Data Sets thru GUI

Writing cleaner and more powerful SAS code using macros. Patrick Breheny

A Technique for Storing and Manipulating Incomplete Dates in a Single SAS Date Value

Preparing your data for analysis using SAS. Landon Sego 24 April 2003 Department of Statistics UW-Madison

Eliminating Tedium by Building Applications that Use SQL Generated SAS Code Segments

A Method for Cleaning Clinical Trial Analysis Data Sets

Automating SAS Macros: Run SAS Code when the Data is Available and a Target Date Reached.

TECHNICAL UNIVERSITY OF CRETE DATA STRUCTURES FILE STRUCTURES

Managing very large EXCEL files using the XLS engine John H. Adams, Boehringer Ingelheim Pharmaceutical, Inc., Ridgefield, CT

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Oracle Database: SQL and PL/SQL Fundamentals NEW

Oracle Database: SQL and PL/SQL Fundamentals

Integrating Data and Business Rules with a Control Data Set in SAS

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

Oracle Database: SQL and PL/SQL Fundamentals

MS SQL Performance (Tuning) Best Practices:

Keywords are identifiers having predefined meanings in C programming language. The list of keywords used in standard C are : unsigned void

1 Files to download. 3 A macro to list out-of-range data values. 2 Reading in the example data file. 22S:172 Lab session 9 Macros for data cleaning

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

Informatica e Sistemi in Tempo Reale

Financial Data Access with SQL, Excel & VBA

The programming language C. sws1 1

Tips, Tricks, and Techniques from the Experts

9 Control Statements. 9.1 Introduction. 9.2 Objectives. 9.3 Statements

Applications Development ABSTRACT PROGRAM DESIGN INTRODUCTION SAS FEATURES USED

MS ACCESS DATABASE DATA TYPES

PL/SQL MOCK TEST PL/SQL MOCK TEST I

C++ INTERVIEW QUESTIONS

Introduction to Market Basket Analysis Bill Qualls, First Analytics, Raleigh, NC

Alternative Methods for Sorting Large Files without leaving a Big Disk Space Footprint

Stacks. Linear data structures

Oracle SQL. Course Summary. Duration. Objectives

Using SQL Server Management Studio

Encoding the Password

Transferring vs. Transporting Between SAS Operating Environments Mimi Lou, Medical College of Georgia, Augusta, GA

Using Pharmacovigilance Reporting System to Generate Ad-hoc Reports

Overview. NT Event Log. CHAPTER 8 Enhancements for SAS Users under Windows NT

Symbol Tables. Introduction

Answers to Review Questions Chapter 7

Object-Oriented Design Lecture 4 CSU 370 Fall 2007 (Pucella) Tuesday, Sep 18, 2007

Top Ten SAS Performance Tuning Techniques

Arithmetic Coding: Introduction

Using Casio Graphics Calculators

1 Description of The Simpletron

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Paper Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA

Number Representation

CDW DATA QUALITY INITIATIVE

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Tips to Use Character String Functions in Record Lookup

Arrays. Atul Prakash Readings: Chapter 10, Downey Sun s Java tutorial on Arrays:

Innovative Techniques and Tools to Detect Data Quality Problems

EXTRACTING DATA FROM PDF FILES

grep, awk and sed three VERY useful command-line utilities Matt Probert, Uni of York grep = global regular expression print

Paper D Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

THE POWER OF PROC FORMAT

Class Notes CS Creating and Using a Huffman Code. Ref: Weiss, page 433

Transcription:

How to Reduce the Disk Space Required by a SAS Data Set Selvaratnam Sridharma, U.S. Census Bureau, Washington, DC ABSTRACT SAS datasets can be large and disk space can often be at a premium. In this paper, SAS options like COMPRESS, SAS statements like LENGTH and ATTRIB statements, SAS View, and Macros are discussed as to how to reduce the size of a SAS dataset. An already developed %SQUEEZE macro can find the minimum lengths required for both numeric and character variables in a SAS dataset, and use these minimum lengths for the variables to reduce the size of the SAS dataset. Another macro %DROPMISS that is developed here can automatically identify and drop SAS variables that have only missing or null values. INTRODUCTION When storing a large data set, storage space can be exhausted. By reducing the size of a dataset by compressing the dataset using SAS COMPRESS= option, a large amount of storage space can be saved. Using SAS Views instead of SAS datasets, a large amount of storage space can be saved. Another way to reduce the size of a SAS data set is by saving only the needed variables in a dataset using DROP and/or KEEP options. Also, using LENGTH or ATTRIB statements to assign the minimum lengths that are required for the variables in a SAS dataset can reduce the size of a SAS data set. It is often difficult to find the minimum length required by a variable in a SAS data set. An already developed macro %SQUEEZE finds the minimum lengths required by the variables in a SAS data set and assigns the minimum lengths to these variables. The %SQUEEZE macro is modified slightly here to make it more efficient. Sometimes all the values of some variables in a SAS data set are missing and we would like to drop these variables to save storage space. Using the % DROPMISS macro that is developed here can do this. SAS SYSTEM OPTIONS / STATEMENTS Some SAS system options / statements such as LENGTH, ATTRIB, KEEP, DROP, and COMPRESS can be used to reduce the size required by a SAS data set. LENGTH AND ATTRIB Controlling the lengths of individual variables may greatly reduce the size of a SAS data set. LENGTH or ATTRIB statement can be used to assign a length to a numeric or a character variable. For character variables, the statement must occur in a data step before the first occurrences to the variables included in the statement. In a SAS data set, for integers and character variables with short values this may dramatically decrease the size of the data set. For character variables one byte corresponds to one character. Hence, to minimize storing space, set the length of each character variable to the number of characters in the longest value of the variable. The minimum length required by a numeric variable depends on the operating environment. Two examples are given below. 15 Data X; 16 Length a b c 3 17 d e 5 18 F g $4; 19 set Y; 20 run; 50 Data X; 51 Attrib a b c length=3; 52 Attrib d e length=5; 53 Attrib f g length=$4; 54 set Y; 55 run; The ATTRIB statement can also be used to change a variable's FORMAT, INFORMAT, and LABEL. 1

One needs to be careful when assigning the length of a numeric variable using the LENGTH statement. If the length assigned for a numeric variable is not adequate, some of the values of that variable will be truncated in the output data set. The statement will not generate an error. It is not advisable to change the lengths of non-integer variables because you can loose the precision of some of the non-integer values. When the length assigned is not adequate for a character variable, the length statement will generate an error. KEEP AND DROP When a SAS data set is created, only the needed variables should be kept. This could save a large amount of space required to store the dataset. This can be done using KEEP= and/or DROP= to delete the unnecessary variables. To save processing time, this should be done as early as logically possible as in the following example. 39 Data A (keep= a b q r); 40 Set B (drop = h k); 41 a= l+p; 42 b= r+q; 43 Run; SAS COMPRESS The COMPRESS= option is a SAS system option and a data set option that can be used to greatly reduce the disk space required to store a SAS data set. You can set the option to either YES or BINARY. In new versions CHAR can be used instead of YES. If there are more character variables than numeric variables, generally it is better to use COMPRESS = YES option. If there are more numeric variables than character variables, generally it is better to use COMPRESS = BINARY. But both options should be tried to find out which one works better. These options are used like as they are used in the following examples. 58 Data A (COMPRESS= YES); 59 SET SASHELP.EISMSG; 60 RUN; NOTE: There were 1470 observations read from the data set SASHELP.EISMSG. NOTE: The data set WORK.A has 1470 observations and 6 variables. NOTE: Compressing data set WORK.A decreased size by 57.14 percent. Compressed is 15 pages; un-compressed would require 35 pages. 61 Data A (COMPRESS= BINARY); 62 SET SASHELP.EISMSG; 63 RUN; NOTE: There were 1470 observations read from the data set SASHELP.EISMSG. NOTE: The data set WORK.A has 1470 observations and 6 variables. NOTE: Compressing data set WORK.A decreased size by 54.29 percent. Compressed is 16 pages; un-compressed would require 35 pages. When COMPRESS = option is used as a SAS system option in the beginning of a program, all the SAS datasets created by the program will be compressed. An option to use with COMPRESS= is REUSE= option. Specifying this option allows SAS to reuse space within the compressed SAS data set that has been freed by deleted observations. All compressed SAS data sets are uncompressed by SAS prior to being used in computations in the DATA or PROC steps. So, although compression saves disk space, it requires additional CPU time to compress and uncompress. Sometimes, compressing will result in a file larger than the uncompressed file if the uncompressed file is small. Beginning with version 8, SAS will not compress a SAS data set when the result would be a larger file. 72 Data A (COMPRESS= YES); 73 SET SASHELP.ACCPEO; 74 RUN; 2

NOTE: There were 20 observations read from the data set SASHELP.ACCPEO. NOTE: The data set WORK.A has 20 observations and 3 variables. NOTE: Compressing data set WORK.A increased size by 100.00 percent. Compressed is 2 pages; un-compressed would require 1 pages. Here are some benchmark results for a large data set that is used at Census Bureau. The compression ratio is the ratio of the size of the compressed data set to the size of the uncompressed data set. COMPRESS OPTION SIZE (BYTES) COMPRESSION RATIO None 80,159,297 ------ Binary 21,517,441 26.8% Char 37,008,961 46.1% SAS VIEW As an alternative to a SAS data set, one can use a SAS view. SAS Views provide all the functionality of a SAS data set. A SAS View contains only the instructions that are required for retrieving data values from other SAS data sets or files, and it occupies only a little fraction of the space required by the SAS data set. A SAS View can be created with data step or with a PROC SQL. Following is an example of a SAS View created with data step. 10 Data B /view = B; 11 set sashelp.eismsg; 12 run; NOTE: DATA STEP view saved on file WORK.B. A PROC SQL View can read data from DATA step Views, SAS data sets, other PROC SQL views, ORACLE or other DBMS data. 62 Proc sql; 63 Create view AB as 64 select var1, var2, var3 65 from A 66 order by var3, var4; NOTE: SQL view WORK.AB has been defined. 67 quit; In the above example A can be a SAS view, SAS data set, ORACLE table or any other DBMS table. Starting with Version 8, DATA step View retains source statements. One can retrieve these statements as in the following example. 32 data view=b; 33 describe; 34 run; NOTE: DATA step view WORK.B is defined as: data B/view=B; set sashelp.eismsg; run; To retrieve the source statements for an SQL View, one needs to use SQL as in the following example. 3

68 proc sql; 69 describe view AB; NOTE: SQL view WORK.AB is defined as: select var1, var2, var3 from A order by var3 asc, var4 asc; 70 quit; SQUEEZING A SAS DATA SET When a large dataset is created, most often it is difficult to find the minimum length required by an individual variable. The following macros can find the minimum lengths required by numeric or character variables for a SAS data set and use these lengths to reduce the size the data set. These macros could greatly reduce the storage space required by a SAS dataset. %SQUEEZE %SQUEEZE macro created by Ross Bettinger (see Reference 1) can squeeze a data set by reducing the space required by numeric and character variables. If you do not want to squeeze some variables, you have the option of doing so by using a parameter in the %SQUEEZE macro. If you do not include the highlighted part of the code in Appendix A, you would have the code for %SQUEEZE macro. %SQUEEZE_1 The %SQUEEZE macro is modified slightly here to create a macro %SQUEEEZE_1. %SQUEEZE macro checks all numeric variables to find the minimum lengths required by these variables by repeated use of TRUNC function on each and every value of these variables. You do not need to find the minimum length of a numeric variable if its length is already three, and sometimes you do not need to apply the TRUNC function on each and every value of a numeric variable. %SQUEEZE_1 incorporates these improvements. This macro generally runs faster than %SQUEEZE and it runs at least as fast as %SQUEEZE. The squeezing technique that is discussed here may be used on integer valued numeric variable, but should not be used on non-integer valued numeric variables. If you use this technique on non-integer valued numeric variables, you might lose some accuracy for these variables. The code for this macro is given in Appendix A. COMPARING %SQUEEZE AND %SQUEEZE_1 %SQUEEZE_1 SIZE (bytes) SQUEEZED RATIO No 24,969,537 ------ Yes 21,515,265 86.1% %SQUEEZE and %SQUEEZE_1 are used to squeeze some large SAS data set of size 24969537 bytes to come with the results below. Macros TIME (minutes) %SQUEEZE 126 %SQUEEZE_1 108 DROPING VARIABLES WITH ONLY MISSING VALUES When all the values in some numeric or character variables are missing, deleting these variables can save a large amount of disk space. But sometimes you want to keep some variables even though all the values for these variables are missing. A macro %DROPMISS (see Appendix B) that is developed here will automatically drop the variables in a data set that have always missing values. You have the option of not dropping the variables you do not want to drop by using a parameter in the %DROPMISS macro. This macro is more efficient than the program in Reference 2. The table below gives the results for both programs for a SAS data set. 4

Programs SIZE (bytes) TIME (minutes) Program in Sample 53 567,889 13.2 %DROPMISS 567,889 3.3 COMBINING %SQUEEZE_1 AND %DROPMISS Combining %SQUEEZE_1 and %DROPMISS, another macro %SQ_DROPMISS (see Appendix C) is created. To save processing time, instead of using %SQUEEZE_1 and %DROPMISS on a SAS data set, it would be better to use %SQ_DROPMISS. When the methods in %SQUEEZE are used on a SAS data set, these methods squeeze the lengths of the character variables that are always missing to 1, and squeeze the lengths of the numeric variables that are always missing to 3. So, after applying the methods in the %SQUEEZE macro to SAS data set, to drop the variables that have all the values missing we need to check only the character variables with length 1 and numeric variables with length 3 for the variables that have always missing values. This saves a great amount of processing time. CONCLUSIONS There are many ways to reduce the size of a data set that you want to store. Some SAS options and SAS statements, SAS Views, and some macros discussed in this paper can be used to reduce the space required by a SAS data set. Instead of using %SQUEEZE_1 and %DROPMISS for a data set, it would be better to use the macro %SQ_DROPMISS to save processing time. REFERENCES 1. Ross Bettinger, Sample 267: %SQUEEZE-ing before Compressing Data, Redux. 7 Jul. 2006 <http://support.sas.com/ctx/samples/index.jsp?sid=267> 2. Sample 53: Delete variables that have only missing data. 7 Jul. 2006 <http://support.sas.com/ctx/samples/index.jsp?sid=53&tab=code> ACKNOWLEDGMENTS We would like to thank David Chapman for offering valuable suggestions and comments. SAS is a Registered Trademark of the SAS Institute, Inc. of Cary, North Carolina. DISCLAIMER This paper reports the results of research and analysis undertaken by Census Bureau staff. It has undergone a more limited review by the Census Bureau than its official publications. This report is released to inform interested parties and to encourage discussion. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Selvaratnam Sridharma Economic Planning and Coordination Division U.S. Bureau of the Census Washington, DC 20233-6100 301-763-6774 Email: selvaratnam.sridharma@census.gov 5

APPENDIX A: %SQUEEZE_1 %macro SQUEEZE_1( DSNIN /* name of input SAS dataset */, DSNOUT /* name of output SAS dataset */, NOCOMPRESS= /* [optional] variables to be omitted from the minimum-length computation process */ ); /* PURPOSE: create LENGTH statement for vars that minimizes the variable length * to: * numeric vars: the fewest # of bytes needed to exactly represent the values * contained in the variable * character vars: the fewest # of bytes needed to contain the longest * character string * * macro variable SQZLENTH is created which is then invoked in a subsequent * data step * * NOTE: if no char vars in dataset, produce no char var processing code * NOTE: length of format for char vars is changed to match computed length * of char var * e.g., if length( CHAR_VAR ) = 10 after %SQUEEZE-ing, then FORMAT CHAR_VAR * $10. ; is generated * NOTE: variables in &DSNOUT are maintained in same order as in &DSNIN * NOTE: variables named in &NOCOMPRESS are not included in the minimum- * length computation process and keep their original lengths as specified in * &DSNIN * * EXAMPLE OF USE: * %SQUEEZE( DSNIN, DSNOUT ) * %SQUEEZE( DSNIN, DSNOUT, NOCOMPRESS=A B C D--H X1-X100 ) * %SQUEEZE( DSNIN, DSNOUT, NOCOMPRESS=_numeric_ ) * %SQUEEZE( DSNIN, DSNOUT, NOCOMPRESS=_character_ ) */ %global SQUEEZE ; %local I ; %if "&DSNIN" = "&DSNOUT" %then %do ; %put /------------------------------------------------\ ; %put ERROR from SQUEEZE: ; %put Input Dataset has same name as Output Dataset. ; %put Execution terminating forthwith. ; %put \------------------------------------------------/ ; %goto L9999 ; /*###############################################################################*/ /* begin executable code /*###############################################################################*/ 6

/* Find the first positive integer n such that n+1 needs more than 3 bytes /* Negative of this number will be the first negative integer n such that n-1 /* needs more than 3 bytes data x; do i=1 to 10000; a=trunc(i,3); if a ^=i then do; call symput ('max_3', a); output; stop; end; end; run; /* create dataset of variable names whose lengths are to be minimized /* exclude from the computation all names in &NOCOMPRESS proc contents data=&dsnin( drop=&nocompress ) memtype=data noprint out=_cntnts_( keep= name type LENGTH) ; run ; %let N_CHAR = 0 ; %let N_NUM = 0 ; data _null_ ; set _cntnts_ end=lastobs nobs=nobs ; WHERE (TYPE =1 AND LENGTH ^= 3) OR (TYPE =2 AND LENGTH ^=1); if nobs = 0 then stop ; n_char + ( type = 2 ) ; n_num + ( type = 1 ) ; /* create macro vars containing final # of char, numeric variables */ if lastobs then do ; call symput( 'N_CHAR', left( put( n_char, 5. ))) ; call symput( 'N_NUM', left( put( n_num, 5. ))) ; end ; run ; /* if there are NO numeric or character vars in dataset, stop further /* processing %if %eval( &N_NUM + &N_CHAR ) = 0 %then %do ; %put /----------------------------------\ ; %put ERROR from SQUEEZE: ; %put No variables in dataset. ; %put Execution terminating forthwith. ; %put \----------------------------------/ ; %goto L9999 ; /* put global macro names into global symbol table for later retrieval %do I = 1 %to &N_NUM ; %global NUM&I NUMLEN&I ; %do I = 1 %to &N_CHAR ; %global CHAR&I CHARLEN&I ; 7

/* create macro vars containing variable names /* efficiency note: could compute n_char, n_num here, but must declare macro /* names to be global b4 stuffing them /* note: if no char vars in data, do not create macro vars proc sql noprint ; %if &N_CHAR > 0 %then %str( select name into :CHAR1 - :CHAR&N_CHAR from _cntnts_ where type = 2 AND LENGTH NE 1; ) ; %if &N_NUM > 0 %then %str( select name int o :NUM1 - :NUM&N_NUM from _cntnts_ where type = 1 AND LENGTH NE 3; ) ; quit ; /* compute min # bytes (3 = min length, for portability over platforms) for /* numeric vars compute min # bytes to keep rightmost character for char vars data _null_ ; set &DSNIN end=lastobs ; %if &N_NUM > 0 %then %str ( array _num_len_ ( &N_NUM ) 3 _temporary_ ; ) ; %if &N_CHAR > 0 %then %str( array _char_len_ ( &N_CHAR ) _temporary_ ; ) ; if _n_ = 1 then do; %if &N_CHAR > 0 %then %str( do i = 1 to &N_CHAR ; _char_len_( i ) = 0 ; end ; ) ; %if &N_NUM > 0 %then %str( do i = 1 to &N_NUM ; _num_len_ ( i ) = 3 ; end ; ) ; end ; %if &N_CHAR > 0 %then %do I = 1 %to &N_CHAR ; _char_len_( &I ) = max( _char_len_( &I ), length( &&CHAR&I )) ; %if &N_NUM > 0 %then %do I = 1 %to &N_NUM ; if &&NUM&I ne. THEN DO; IF ( &&NUM&I > &max_3 OR &&NUM&I < -&max_3) THEN DO; if &&NUM&I ne trunc( &&NUM&I, 7 ) then _num_len_( &I ) = max( _num_len_( &I ), 8 ) ; else if &&NUM&I ne trunc( &&NUM&I, 6 ) then _num_len_( &I ) = max( _num_len_( &I ), 7 ) ; else if &&NUM&I ne trunc( &&NUM&I, 5 ) then _num_len_( &I ) = max( _num_len_( &I ), 6 ) ; else if &&NUM&I ne trunc( &&NUM&I, 4 ) then _num_len_( &I ) = max( _num_len_( &I ), 5 ) ; else if &&NUM&I ne trunc( &&NUM&I, 3 ) then _num_len_( &I ) = max( _num_len_( &I ), 4 ) ; end ; end; if lastobs then do ; %if &N_CHAR > 0 %then %do I = 1 %to &N_CHAR ; call symput( "CHARLEN&I", put( _char_len_( &I ), 5. )) ; %if &N_NUM > 0 %then %do I = 1 %to &N_NUM ; call symput( "NUMLEN&I", put( _num_len_( &I ), 1. )) ; end ; run ; 8

proc datasets nolist ; delete _cntnts_ ; run ; /* initialize SQZ_NUM, SQZ_CHAR global macro vars %let SQZ_NUM = LENGTH ; %let SQZ_CHAR = LENGTH ; %let SQZ_CHAR_FMT = FORMAT ; %if &N_CHAR > 0 %then %do I = 1 %to &N_CHAR ; %let SQZ_CHAR = &SQZ_CHAR %qtrim( &&CHAR&I ) $%left( &&CHARLEN&I ) ; %let SQZ_CHAR_FMT = &SQZ_CHAR_FMT %qtrim( &&CHAR&I ) $%left( &&CHARLEN&I ). ; %if &N_NUM > 0 %then %do I = 1 %to &N_NUM ; %let SQZ_NUM = &SQZ_NUM %qtrim( &&NUM&I ) &&NUMLEN&I ; /* build macro var containing order of all variables data _null_ ; length retain $32767 ; retain retain 'retain ' ; dsid = open( "&DSNIN", 'I' ) ; /* open dataset for read access only */ do _i_ = 1 to attrn( dsid, 'nvars' ) ; retain = trim( retain ) ' ' varname( dsid, _i_ ) ; end ; call symput( 'RETAIN', retain ) ; run ; /* apply SQZ_* to incoming data, create output dataset data &DSNOUT ; &RETAIN ; %if &N_CHAR > 0 %then %str( &SQZ_CHAR ; ); /* optimize char var lengths */ %if &N_NUM > 0 %then %str( &SQZ_NUM ; ); /* optimize numeric var lengths */ %if &N_CHAR > 0 %then %str( &SQZ_CHAR_FMT ; ) ; /* adjust char var format lengths */ set &DSNIN ; run ; %L9999: %mend SQUEEZE_1 ; 9

APPENDIX B: %DROPMISS %macro DROPMISS( DSNIN /* name of input SAS dataset */, DSNOUT /* name of output SAS dataset */, NODROP= /* [optional] variables to be omitted from dropping even if they have only missing values */ ) ; /* PURPOSE: To find both Character and Numeric the variables that have only * missing values and drop them if they are not in &NONDROP * * NOTE: if no char vars in dataset, produce no char var processing code * * EXAMPLE OF USE: * %DROP1( DSNIN, DSNOUT ) * %DROP1( DSNIN, DSNOUT, NODROP=A B C D--H X1-X100 ) * %DROP1( DSNIN, DSNOUT, NODROP=_numeric_ ) * %DROP1( DSNIN, DSNOUT, NOdrop=_character_ ) */ %global DROP1 ; %local I ; %if "&DSNIN" = "&DSNOUT" %then %do ; %put /------------------------------------------------\ ; %put ERROR from DROPMISS: ; %put Input Dataset has same name as Output Dataset. ; %put Execution terminating forthwith. ; %put \------------------------------------------------/ ; %goto L9999 ; /*###############################################################################*/ /* begin executable code /*###############################################################################*/ /* create dataset of variable names that have only missing values /* exclude from the computation all names in &NODROP proc contents data=&dsnin( drop=&nodrop ) memtype=data noprint out= _cntnts_( keep= name type ) ; run ; %let N_CHAR = 0 ; %let N_NUM = 0 ; data _null_ ; set _cntnts_ end=lastobs nobs=nobs ; if nobs = 0 then stop ; n_char + ( type = 2 ) ; n_num + ( type = 1 ) ; /* create macro vars containing final # of char, numeric variables */ if lastobs then do ; call symput( 'N_CHAR', left( put( n_char, 5. ))) ; call symput( 'N_NUM', left( put( n_num, 5. ))) ; end ; 10

run ; /* if there are NO numeric or character vars in dataset, stop further */ %if %eval( &N_NUM + &N_CHAR ) = 0 %then %do ; %put /----------------------------------\ ; %put ERROR from DROP1: ; %put No variables in dataset. ; %put Execution terminating forthwith. ; %put \----------------------------------/ ; %goto L9999 ; /* put global macro names into global symbol table for later retrieval */ %do I = 1 %to &N_NUM ; %global NUM&I ; %do I = 1 %to &N_CHAR ; %global CHAR&I ; /* create macro vars containing variable names /* efficiency note: could compute n_char, n_num here, but must declare macro /* names to be global b4 stuffing them /* note: if no char vars in data, do not create macro vars proc sql noprint ; %if &N_CHAR > 0 %then %str( select name into :CHAR1 - :CHAR&N_CHAR from _cntnts_ where type = 2 ; ) ; %if &N_NUM > 0 %then %str( select name into :NUM1 - :NUM&N_NUM _cntnts_ where type = 1 ; ) ; quit ; from /* put MAXIMUM values of the variables into macro variables %IF &N_CHAR > 1 %THEN %let N_CHAR_1 = %EVAL(&N_CHAR - 1); %IF &N_NUM > 1 %THEN %let N_NUM_1 = %EVAL(&N_NUM - 1); Proc sql ; %IF &N_NUM >1 %THEN %DO; %do I= 1 %to &N_NUM_1; max (&&NUM&I), %IF &N_NUM > 0 %THEN %DO; MAX(&&NUM&N_NUM) %IF &N_CHAR >0 AND &N_NUM >0 %THEN %DO;, %IF &N_CHAR > 1 %THEN %DO; %do I= 1 %to &N_CHAR_1; max(&&char&i), 11

%IF &N_CHAR >0 %THEN %DO; MAX(&&CHAR&N_CHAR) into %IF &N_NUM > 1 %THEN %DO; %do I= 1 %to &N_NUM_1; :NUMMAX&I, %IF &N_NUM > 0 %THEN %DO; :NUMMAX&N_NUM %IF &N_CHAR> 0 AND &N_NUM >0 %THEN %DO;, %IF &N_CHAR > 1 %THEN %DO; %do I= 1 %to &N_CHAR_1; :CHARMAX&I, %IF &N_CHAR > 0 %THEN %DO;:CHARMAX&N_CHAR from &DSNIN; /* initialize DROP_NUM, DROP_CHAR global macro vars %let DROP_NUM = ; %let DROP_CHAR = ; %if &N_CHAR > 0 %THEN %DO; %do I = 1 %to &N_CHAR ; %IF &&CHARMAX&I = %THEN %DO; %let DROP_CHAR = &DROP_CHAR %qtrim( &&CHAR&I ) ; %if &N_NUM > 0 %THEN %DO; %do I = 1 %to &N_NUM ; %IF &&NUMMAX&I =. %THEN %DO; %let DROP_NUM = &DROP_NUM %qtrim( &&NUM&I ) ; %End ; /* apply SQZ_* to incoming data, create output dataset */ data &DSNOUT ; %if &DROP_CHAR ^= %then %str( DROP &DROP_CHAR ; ) ; /* drop char variables that have only missing values */ %if &DROP_NUM ^= %then %str( DROP &DROP_NUM ; ) ; /* drop num variables that have only missing values */ set &DSNIN ; run ; %L9999: %mend DROPMISS ; 12

APPENDIX C: %SQ_DROPMISS OPTIONS MPRINT MLOGIC MSYMBOLGEN; %macro SQDROPMISS( DSNIN /* name of input SAS dataset */, DSNOUT /* name of output SAS dataset */, NOCOMPRESS= /* [optional] variables to be omitted from the minimum-length computation process */, NODROP= /* [optional] variables to be omitted from droping even if they have only missing values ) ; /* PURPOSE: Squeeze a data set to have minimum lengths required for the * variables excluding the variables in &NOCOMPRESS applying %SQUEEZE_1 and * then DROP the variables that have always missing values in a more * efficient way. * * EXAMPLE OF USE: * %SQ_DROPMISS( DSNIN, DSNOUT, NOCOMPRESS= ) * %SQ_DROPMISS( DSNIN, DSNOUT, NOCOMPRESS=A B C D--H X1-X100 ) * %SQ_DROPMISS( DSNIN, DSNOUT, NOCOMPRESS=_numeric_ ) * %SQ_DROPMISS DSNIN, DSNOUT, NOCOMPRESS=_character_ * %SQ_DROPMISS DSNIN, DSNOUT, NOCOMPRESS=_character_, NONDROP= A C D) */ /*###############################################################################*/ /* begin executable code /*###############################################################################*/ /* Squeezing part /* Include the code for the macro %SQUEEZE_1 here */ %SQUEEZE_1 (&DSNIN, DSNSQUEEZED, &NOCOMPRESS); /* Dropping part %global DROP1 ; %local I ; %if "&DSNIN" = "&DSNOUT" %then %do ; %put /------------------------------------------------\ ; %put ERROR from DROPMISS: ; %put Input Dataset has same name as Output Dataset. ; %put Execution terminating forthwith. ; %put \------------------------------------------------/ ; %goto L9999 ; /* create dataset of variable names that have only missing values /* exclude from the computation all names in &NODROP proc contents data=dsnsqueezed( drop=&nodrop ) memtype=data noprint out= _cntnts_( keep= name type length) ; run ; 13

%let N_CHAR = 0 ; %let N_NUM = 0 ; data _null_ ; set _cntnts_ end=lastobs nobs=nobs ; where (type =1 and length =3) or (type=2 and length =1); if nobs = 0 then stop ; n_char + ( type = 2 ) ; n_num + ( type = 1 ) ; /* create macro vars containing final # of char, numeric variables */ if lastobs then do ; call symput( 'N_CHAR', left( put( n_char, 5. ))) ; call symput( 'N_NUM', left( put( n_num, 5. ))) ; end ; run ; /* if there are NO numeric or character vars in dataset, stop further */ %if %eval( &N_NUM + &N_CHAR ) = 0 %then %do ; %put /----------------------------------\ ; %put ERROR from DROP1: ; %put No variables in dataset to drop. ; %put Execution terminating forthwith. ; %put \----------------------------------/ ; %goto L9999 ; /* put global macro names into global symbol table for later retrieval */ %do I = 1 %to &N_NUM ; %global NUM&I ; %do I = 1 %to &N_CHAR ; %global CHAR&I ; /* create macro vars containing variable names /* efficiency note: could compute n_char, n_num here, but must declare macro /* names to be global b4 stuffing them /* note: if no char vars in data, do not create macro vars proc sql noprint ; %if &N_CHAR > 0 %then %str( select name into :CHAR1 - :CHAR&N_CHAR from _cntnts_ where type = 2 ; ) ; %if &N_NUM > 0 %then %str( select name into :NUM1 - :NUM&N_NUM from _cntnts_ where type = 1 ; ) ; quit ; /* put MAXIMUM values of the variables into macro variables %IF &N_CHAR > 1 %THEN %let N_CHAR_1 = %EVAL(&N_CHAR - 1); %IF &N_NUM > 1 %THEN %let N_NUM_1 = %EVAL(&N_NUM - 1); 14

Proc sql ; select %IF &N_NUM >1 %THEN %DO; %do I= 1 %to &N_NUM_1; max (&&NUM&I), %IF &N_NUM > 0 %THEN %DO; MAX(&&NUM&N_NUM) %IF &N_CHAR >0 AND &N_NUM >0 %THEN %DO;, %IF &N_CHAR > 1 %THEN %DO; %do I= 1 %to &N_CHAR_1; max(&&char&i), %IF &N_CHAR >0 %THEN %DO; MAX(&&CHAR&N_CHAR) into %IF &N_NUM > 1 %THEN %DO; %do I= 1 %to &N_NUM_1; :NUMMAX&I, %IF &N_NUM > 0 %THEN %DO; :NUMMAX&N_NUM %IF &N_CHAR> 0 AND &N_NUM >0 %THEN %DO;, %IF &N_CHAR > 1 %THEN %DO; %do I= 1 %to &N_CHAR_1; :CHARMAX&I, %IF &N_CHAR > 0 %THEN %DO;:CHARMAX&N_CHAR from &DSNIN; quit; /* initialize DROP_NUM, DROP_CHAR global macro vars %let DROP_NUM = ; %let DROP_CHAR = ; %if &N_CHAR > 0 %THEN %DO; %do I = 1 %to &N_CHAR ; %IF &&CHARMAX&I = %THEN %DO; %let DROP_CHAR = &DROP_CHAR %qtrim( &&CHAR&I ) ; %if &N_NUM > 0 %THEN %DO; %do I = 1 %to &N_NUM ; %IF &&NUMMAX&I =. %THEN %DO; %let DROP_NUM = &DROP_NUM %qtrim( &&NUM&I ) ; 15

%End ; /* apply Drop_* to incoming data, create output dataset */ data &DSNOUT ; %if &DROP_CHAR ^= %then %str( DROP &DROP_CHAR ; ) ; /* drop char variables that have only missing values */ %if &DROP_NUM ^= %then %str( DROP &DROP_NUM ;) ; /* drop num variables that have only missing values */ set DSNSQUEEZED ; run ; %L9999: %mend SQDROPMISS ; 16