Dropping variables from a large SAS data set when all their values are missing Gwen Babcock, New York State Department of Health, Albany, NY

Similar documents
From The Little SAS Book, Fifth Edition. Full book available for purchase here.

How to Reduce the Disk Space Required by a SAS Data Set

Labels, Labels, and More Labels Stephanie R. Thompson, Rochester Institute of Technology, Rochester, NY

Reshaping & Combining Tables Unit of analysis Combining. Assignment 4. Assignment 4 continued PHPM 672/677 2/21/2016. Kum 1

Subsetting Observations from Large SAS Data Sets

THE POWER OF PROC FORMAT

The SET Statement and Beyond: Uses and Abuses of the SET Statement. S. David Riba, JADE Tech, Inc., Clearwater, FL

Storing and Using a List of Values in a Macro Variable

Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

B) Mean Function: This function returns the arithmetic mean (average) and ignores the missing value. E.G: Var=MEAN (var1, var2, var3 varn);

Paper Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA

USING PROCEDURES TO CREATE SAS DATA SETS... ILLUSTRATED WITH AGE ADJUSTING OF DEATH RATES 1

Tales from the Help Desk 3: More Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

You have got SASMAIL!

FINDING ORACLE TABLE METADATA WITH PROC CONTENTS

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

The Program Data Vector As an Aid to DATA step Reasoning Marianne Whitlock, Kennett Square, PA

EXST SAS Lab Lab #4: Data input and dataset modifications

Generic Automated Data Dictionary for Any SAS DATA Set or Format Library Dale Harrington, Kaiser Permanente, Oakland, CA

A Method for Cleaning Clinical Trial Analysis Data Sets

Innovative Techniques and Tools to Detect Data Quality Problems

Table Lookups: From IF-THEN to Key-Indexing

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Preparing your data for analysis using SAS. Landon Sego 24 April 2003 Department of Statistics UW-Madison

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

PharmaSUG Paper QT26

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Using SAS Views and SQL Views Lynn Palmer, State of California, Richmond, CA

Quick Start to Data Analysis with SAS Table of Contents. Chapter 1 Introduction 1. Chapter 2 SAS Programming Concepts 7

PharmaSUG Paper MS05

Integrating Data and Business Rules with a Control Data Set in SAS

Paper AD11 Exceptional Exception Reports

KEYWORDS ARRAY statement, DO loop, temporary arrays, MERGE statement, Hash Objects, Big Data, Brute force Techniques, PROC PHREG

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

C H A P T E R 1 Introducing Data Relationships, Techniques for Data Manipulation, and Access Methods

The entire SAS code for the %CHK_MISSING macro is in the Appendix. The full macro specification is listed as follows: %chk_missing(indsn=, outdsn= );

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation

Let the CAT Out of the Bag: String Concatenation in SAS 9 Joshua Horstman, Nested Loop Consulting, Indianapolis, IN

Eliminating Tedium by Building Applications that Use SQL Generated SAS Code Segments

Creating Dynamic Reports Using Data Exchange to Excel

Post Processing Macro in Clinical Data Reporting Niraj J. Pandya

Using the Magical Keyword "INTO:" in PROC SQL

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

Survey Analysis: Options for Missing Data

Efficient Techniques and Tips in Handling Large Datasets Shilong Kuang, Kelley Blue Book Inc., Irvine, CA

The Essentials of Finding the Distinct, Unique, and Duplicate Values in Your Data

Changing the Shape of Your Data: PROC TRANSPOSE vs. Arrays

Using Macros to Automate SAS Processing Kari Richardson, SAS Institute, Cary, NC Eric Rossland, SAS Institute, Dallas, TX

Programming Idioms Using the SET Statement

Introduction to Proc SQL Steven First, Systems Seminar Consultants, Madison, WI

SAS ODS HTML + PROC Report = Fantastic Output Girish K. Narayandas, OptumInsight, Eden Prairie, MN

Paper D Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

Using Pharmacovigilance Reporting System to Generate Ad-hoc Reports

Clinical Trial Data Integration: The Strategy, Benefits, and Logistics of Integrating Across a Compound

Using SAS/IntrNet as a Web-Enabled Platform for Clinical Reporting

MWSUG Paper S111

Transferring vs. Transporting Between SAS Operating Environments Mimi Lou, Medical College of Georgia, Augusta, GA

A Macro to Create Data Definition Documents

Instant Interactive SAS Log Window Analyzer

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7

Report Customization Using PROC REPORT Procedure Shruthi Amruthnath, EPITEC, INC., Southfield, MI

Methodologies for Converting Microsoft Excel Spreadsheets to SAS datasets

SAS Views The Best of Both Worlds

AN INTRODUCTION TO MACRO VARIABLES AND MACRO PROGRAMS Mike S. Zdeb, New York State Department of Health

Introduction to SAS Business Intelligence/Enterprise Guide Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN

Preserving Line Breaks When Exporting to Excel Nelson Lee, Genentech, South San Francisco, CA

Using SAS to Create Graphs with Pop-up Functions Shiqun (Stan) Li, Minimax Information Services, NJ Wei Zhou, Lilly USA LLC, IN

Katie Minten Ronk, Steve First, David Beam Systems Seminar Consultants, Inc., Madison, WI

Parallel Data Preparation with the DS2 Programming Language

Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC

ing Automated Notification of Errors in a Batch SAS Program Julie Kilburn, City of Hope, Duarte, CA Rebecca Ottesen, City of Hope, Duarte, CA

What You re Missing About Missing Values

PO-18 Array, Hurray, Array; Consolidate or Expand Your Input Data Stream Using Arrays

Importing Excel File using Microsoft Access in SAS Ajay Gupta, PPD Inc, Morrisville, NC

Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

Handling Missing Values in the SQL Procedure

Managing Tables in Microsoft SQL Server using SAS

ABSTRACT INTRODUCTION %CODE MACRO DEFINITION

Preparing Real World Data in Excel Sheets for Statistical Analysis

ABSTRACT THE ISSUE AT HAND THE RECIPE FOR BUILDING THE SYSTEM THE TEAM REQUIREMENTS. Paper DM

OS/390 SAS/MXG Computer Performance Reports in HTML Format

SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

Nine Steps to Get Started using SAS Macros

Macros from Beginning to Mend A Simple and Practical Approach to the SAS Macro Facility

Top Ten SAS Performance Tuning Techniques

Managing very large EXCEL files using the XLS engine John H. Adams, Boehringer Ingelheim Pharmaceutical, Inc., Ridgefield, CT

Coding for Posterity

Fun with PROC SQL Darryl Putnam, CACI Inc., Stevensville MD

Let SAS Modify Your Excel File Nelson Lee, Genentech, South San Francisco, CA

Automation of Large SAS Processes with and Text Message Notification Seva Kumar, JPMorgan Chase, Seattle, WA

Taming the PROC TRANSPOSE

Technical Paper. Migrating a SAS Deployment to Microsoft Windows x64

Intro to Longitudinal Data: A Grad Student How-To Paper Elisa L. Priest 1,2, Ashley W. Collinsworth 1,3 1

Writing cleaner and more powerful SAS code using macros. Patrick Breheny

Combining SAS LIBNAME and VBA Macro to Import Excel file in an Intriguing, Efficient way Ajay Gupta, PPD Inc, Morrisville, NC

Using SAS With a SQL Server Database. M. Rita Thissen, Yan Chen Tang, Elizabeth Heath RTI International, RTP, NC

HELP! - My MERGE Statement Has More Than One Data Set With Repeats of BY Values!

Search and Replace in SAS Data Sets thru GUI

The Art of Designing HOLAP Databases Mark Moorman, SAS Institute Inc., Cary NC

Transcription:

Dropping variables from a large SAS data set when all their values are missing Gwen Babcock, New York State Department of Health, Albany, NY ABSTRACT Several methods have been proposed to drop variables whose values are all missing from a SAS data set. These methods typically use PROC SQL, macro variables, and/or arrays, all of which may be slow or fail when applied to large data sets. One method uses hash tables and CALL EXECUTE, but it works only on character variables. Therefore, I have developed an alternative method which can be applied to data sets of any size. The strategy is to use PROC MEANS and PROC FREQ to identify variables whose values are all missing. These variables are then dropped using DATA steps generated by CALL EXECUTE. This method, which uses intermediate level coding techniques, will complement previously proposed methods because it works well on large data sets, and on both character and numeric variables. An example is given using US Census American Community Survey data with SAS 9.3 under Windows XP. INTRODUCTION With relatively small data sets, you can often run PROC FREQ or some other procedure which provides descriptive statistics, visually inspect the results, and use the DROP data set option to remove variables whose values are all missing (Zdeb, 2). However, for large data sets, this method is very tedious and time consuming. Therefore, two macros have been proposed to remove variables from a SAS data set when all its values are missing. The first, the %DROPMISS macro (Sridharma, 26 and 2) fails with large data sets. The second, the %drop_blank_vars macro proposed by Tanimoto (2), works only on character variables. Both use advanced coding techniques: Sridharma uses macros, and Tanimoto uses macros and hash tables. The method proposed here uses intermediate coding techniques, is fast, and works on the largest data sets. In addition, with minor modifications, this method can be used to remove coded missing values, such as 99999999 or, etc. PREPARATION To prepare to drop all variables with missing values, I first specified the libraries and data sets to use. I wished to preserve the label and compression of the original data set, so I obtained that information using PROC CONTENTS and placed it in macro variables using CALL SYMPUTX in a DATA step. (For a discussion of alternative ways to access table metadata, see Carpenter (23)). I also make a copy of the original data set to use for manipulation, so that the original is preserved in case of need. /*specify any libraries used*/ libname sas 'E:\census\ACS\27_2\SAS\sasblock'; /*specify the input and output data sets*/ %let dsin=sas.acs2_5yr_blkgrp; %let dsout=sas.acs2_5yr_blkgrp2; /*specify an alternative missing value for numeric variables. this value should be a number*/ %let altnumericmissvalue=; proc contents data=&dsin noprint out=mycontents (keep=memlabel compress); set mycontents(obs=); call symputx("mylabel",memlabel); call symputx("mycompress",compress); data &dsout (compress &mycompress.); set &dsin; FINDING NUMERIC VARIABLES WITH ONLY MISSING VALUES Once you have finished the preparation, You can find the number of non-missing values for all numeric variables by using the N statistic in PROC MEANS.

proc means data=&dsout noprint; output out=checknumeric(drop=_:) n=;/*get the number of NON-MISSING for all numeric variables*/ /*one observation, many variables*/ The data set output from PROC MEANS has only one observation. It has one variable for each numeric variable in the input data set. I transpose the data and select only those variables which have no values. proc transpose data=checknumeric out=checknumeric2 (drop=_label_ rename=(col=numhere _name_=vname)); data dropnumeric; set checknumeric2 (where=(numhere=)); Obs vname numhere BAe 2 BAe2 3 BAe3 4 BAe4 5 BAe5 6 BAe6 7 BAe7 8 BAe8 9 BAe9 BAe PROC MEANS is very fast. However, for datasets containing hundreds of variables and millions of observations, finding variables with all missing values can be done more quickly by first testing a subset of the data. For example, the OBS= data set option can be used to test the first, observations. Only the variables whose values are all missing in the subset would be tested in the entire dataset. Appendix A shows the code for this approach. FINDING NUMERIC VARIABLES WITH ONLY MISSING CODED VALUES If the missing values are coded, a slight variation on the theme is needed. The assumption behind this code is that if the maximum value and minimum value are both equal to the coded missing value, then all of the values for the variable are equal to the coded missing value. First, I use PROC MEANS to get the minimum and maximum values of all the numeric variables, transpose them as before, and then merge them. Then I select those whose minimum and maximum values are both equal to the coded missing value. proc means data=&dsout noprint; output out=biggest(drop=_:) max=; proc means data=&dsout noprint; output out=smallest(drop=_:) min=; 2

proc transpose data=biggest out=biggest2 (drop=_label_ rename=(col=greatest _name_=vname)); proc transpose data=smallest out=smallest2 (drop=_label_ rename=(col=least _name_=vname)); proc sort data=biggest2; by vname; proc sort data=smallest2; by vname; data dropnumericalt; merge biggest2 smallest2; by vname; if least=&altnumericmissvalue. and greatest=&altnumericmissvalue. then output; Obs vname greatest least Bm 2 B2m 3 B99m 4 B99m2 5 B99m3 6 B992m 7 B992m2 8 B992m3 9 B992m B992m2 FINDING CHARACTER VARIABLES WITH ONLY MISSING VALUES Since PROC MEANS can only be used with numeric variables, I use PROC FREQ to find character variables whose values are all missing. The NLEVELS option produces a list of variables with the number of distinct missing and non -missing values. If the number of distinct non-missing values is zero, then the variable should be dropped. I used the ODS OUTPUT statement to get the list of variables in a data set. I used _CHAR_ as a shortcut to list all of the character variables. ods output NLevels=checkchar; ods html close; /*suppress default output, because I only want a data set*/ proc freq data=&dsout nlevels; tables _char_ /noprint; ods html; ods output close; 3

The resulting data set looks like this: Obs TableVar TableVarLabel NLevels NMissLevels NNonMissLevels FILEID File Identification 2 STUSAB State Postal Abbreviation 3 SUMLEVEL Summary Level 5463 5463 4 COMPONENT geographic Component 5 LOGRECNO Logical Record Number 6 US US 7 REGION Region 8 DIVISION Division 9 STATECE State (Census Code) State (FIPS Code) STATE From this data set, I select only the variables to be dropped: data dropchar (rename=(tablevar=vname)); set checkchar (where=(nnonmisslevels=)); I could have used PROC FREQ with TABLES _ALL_ to get the missing values for all variables, including numeric ones, but it is slower than PROC MEANS. If your data set is small and you are not concerned about coded missing values, or you have few numeric variables, then using PROC FREQ alone may be more convenient. However, if your dataset has millions of observations, it would be faster to first use both PROC MEANS (for numeric variables) and PROC FREQ (for character variables) on a subset of the observations. Only those variables whose values are all missing in that subset would be used in a PROC MEANS/PROC FREQ of the entire dataset. That would require more complicated code (see Appendix A). With any of these approaches, I could also choose to drop character variables with only one distinct value; the information they contain may be more appropriately placed in the metadata rather than the data set itself. COMBINING AND PARSING THE LIST OF VARIABLES TO DROP A simple SET statement in a DATA step is used to combine the three lists of variables to drop. The new list contains one observation per variable to drop. However, for maximum speed, I want to drop as many variables as possible in one DATA step. Therefore, the list of variables must be converted to the minimum number observations that can hold the list of variables. Since the maximum length of a character variable is 32,767, and the maximum length of a variable name is 32, a cha racter variable should be able to hold a list of at least 992 SAS variable names. It may be able to hold more if the variable names are shorter. First, I combine the three lists of variables. The result is one data set containing one observation for each variable whose values are all missing or all the specified coded value. data allvarstodrop; format vname $32.; /*maximum length of SAS variable name. Use this statement to make sure the names are not truncated*/ set dropchar(keep=vname) dropnumeric (drop=numhere) dropnumericalt (drop=greatest least) ; varsize=length(vname)+; Next, I joined the variable names into lists using the CATX function. When the list reaches the maximum size of a SAS character variable, or the last variable name is read, an observation is output. 4

data varstodroplist (keep=varlist varsizesum); format varlist $32767.; /*this is the maximum size of a character variable*/ retain varsizesum varlist; set allvarstodrop end=eof; varsizesum=sum(of varsize varsizesum); if varsizesum>32767 then do; output; putlog vname varsizesum; varsizesum=varsize; varlist=vname; /*reset*/ else varlist=catx(' ',varlist,vname); if eof= then output; DROPPING THE VARIABLES CALL EXECUTE (Whitlock, 997) is used to generate the code that drops the variables with missing values. For each obse rvation in the varstodroplist data set, a DATA step is generated which drops a set of variables. The label and compression of the original data set are used for the new data set. set varstodroplist; call execute("data &dsout(label='&mylabel' compress=&mycompress.);"); call execute("set &dsout(drop="); call execute(varlist); call execute(");"); call execute(""); Here is an example of the code generated by one iteration of this DATA step, with the list of variables to drop shortened for clarity. The full code as written into the SAS log is shown in Appendix B. data sas.acs2_5yr_blkgrp2(label='27-2 US Census American Survey estimates and margins of error for NYS block groups, processed by gdb2' compress=binary); set sas.acs2_5yr_blkgrp2(drop= B2529m3 B2529m32 B2529m33 B2529m34 B2529m35 B2529m36 ); CONCLUSIONS This method for dropping variables whose values are all missing is fast and works on data sets with large numbers of variables. It complements other methods. Since it does not use macros or hash tables, the code is less complex, and easier to use and debug. It can also be used to remove numeric variables whose values have all codes for missing, such as or 999999. REFERENCES Carpenter, Arthur L. (23) How Do I? There is more than one way to solve that problem; Why continuing to learn is so important. Proceedings of the SAS Global Forum 23. 29-23 p.2. 5

Sridharma, S (2) Dropping Automatically Variable with Only Missing Values. 2. 48-2. Proceedings of the SAS Global Forum Sridharma, S. (26) How to Reduce the Disk Space Required by a SAS Data Set. NESUG 26 Proceedings. SAS Instutute, Inc. Sample 24622: Drop variables from a SAS data set when all its values are missing. http://support.sas.com/kb/24/622.html Accessed December 3, 2. Tanimoto, P. (2) Meta-Programming in SAS with DATA Step. MWSUG Proceedings. 224. Whitlock, HI (997) CALL EXECUTE: How and Why. Proceedings of the Twenty-Second Annual SAS Users Group International Conference. Xu, W. (27) Identify and Remove Any Variables, Character or Numeric, That Have Only Missing Values. Proceedings Zdeb (2) An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step. NESUG 27 NESUG 2 Proceedings. ACKNOWLEDGMENTS Thanks to Steve Forand, Thomas Talbot, and the New York State Department of Health, Center for Environmental Health for supporting this work. Thanks to Mark H. Keintz for helping make the code faster. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Gwen D LaSelva New York State Department of Health Bureau of Environmental and Occupational Epidemiology Empire State Plaza-Corning Tower, Room 23 Albany, NY 2237 Work Phone: 58-42-7785 Fax: 58-42-7959 Email: gdb2@health.state.ny.us SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Inst itute, Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. APPENDIX A: FASTER VERSION OF CODE USING PRE-TESTING This code is more complex, but faster than that discussed previously, especially for datasets containing hundreds of variables and millions of observations. The general approach is to pre-test a subset of observations using PROC MEANS or PROC FREQ to determine which variables contain only missing values in the subset. These variables are then tested in the entire dataset, and dropped if they contain only missing values. /************************************************************** *the purpose of this program is to drop variables with *missing values from a SAS dataset. *This code is designed to have a similar function to *The DROPMISS macro of Selvaratname Sridharma *see "Dropping Automatically Variables with Only Missing Values" * SAS Global Forum 2 *But the DROPMISS macro fails sometimes, so this is an alternative *Note: this program is more complex than the 'drop missing variables' program * but will be faster, especially on datasets with large numbers of observations * it was tested on NYS outpatient data with about 653 variables and over 9 million observations *programmed by Gwen LaSelva *inputs: library name name of input SAS dataset name of output SAS dataset 6

alternative missing value for numeric variables, for example or 999999 sample size for preliminary determination of which variables to drop *outputs: SAS dataset* *******************************************************************/ /*specify any libraries used*/ libname sas 'E:\census\ACS\27_2\SASblockout'; libname op "P:\Sections\EPHTdata\SPARCSdeidentified\Outpatient"; /*specify the input and output datasets*/ %let dsin=op.outpatient_s_2_prime; %let dsout=sas.want; /*specify an alternative missing value for numeric variables. this value should be a number*/ %let altnumericmissvalue=; /*specify the number of observations to sample. The size should be a fraction of the total number of observations. It should be your best guess of the minimum number of observations needed to find non-missing values in all variables whose values are not all missing. Changing this will affect the speed of the program*/ %let samplesize=; /*put the time in the log*/ temptime=put(time(),hhmm5.); putlog "start time=" temptime; /*shouldn't need to modify below this line*/ /*STEP : PREPARATION*/ /*get the label and compression of the input dataset and put them into a macro variables, use them at the end so the new dataset has the same label and compression as the original*/ proc contents data=&dsin noprint out=mycontents (keep=memlabel compress); set mycontents(obs=); call symputx("mylabel",memlabel); call symputx("mycompress",compress); /*first, make output dataset to work with, so we don't accidentally mess up the original dataset if something goes wrong*/ data &dsout (compress= &mycompress.); set &dsin; /*STEP 2: FIND ALL NUMERIC VARIABLES WITH ALL MISSING VALUES*/ /*proc means with the N statistic should count number of non-missing values*/ /*start with a subset of observations*/ proc means data=&dsout (obs=&samplesize.) noprint; output out=sampledropnumeric(drop=_:) n=;/*get the number of NON-MISSING for all numeric variables*/ proc transpose data=sampledropnumeric out=sampledropnumeric2 (rename=(col=numhere _name_=vname)); 7

data maybedropnumeric (keep=vname varsize); set sampledropnumeric2 (where=(numhere=)); varsize=length(vname)+; /*test the variables who are all missing in the first subset of observations in the entire dataset*/ data numvarstotestlist (keep=varlist varsizesum); format varlist $32767.; /*this is the maximum size of a character variable*/ retain varsizesum varlist; set maybedropnumeric end=eof; varsizesum=sum(of varsize varsizesum); if varsizesum>32767 then do; output; putlog vname varsizesum; varsizesum=varsize; varlist=vname; /*reset*/ else varlist=catx(' ',varlist,vname); if eof= then output; set numvarstotestlist ; call execute("proc means data=&dsout noprint;"); call execute("output out=checknumeric(drop=_:) n=;"); call execute("var"); call execute(varlist); call execute(";"); if _n_= then call execute("data checknumeric; set checknumeric; "); else call execute("data checknumeric; merge checknumeric checknumeric; "); proc transpose data=checknumeric out=checknumeric2 (drop=_label_ rename=(col=numhere _name_=vname)); data dropnumeric; set checknumeric2 (where=(numhere=)); /*Use PROC FREQ to get character variables whose values are all missing it is slower than PROC MEANS*/ /*start with a subset of observations*/ ods output NLevels=maybedropchar; ods html close; /*suppress default output, because we only want a dataset*/ proc freq data=&dsout (obs=&samplesize.) nlevels; tables _char_ /noprint; ods html; ods output close; data maybedropchar2 (rename=(tablevar=vname)); set maybedropchar (where=(nnonmisslevels=)); varsize=length(tablevar)+; 8

/*test the variables who are all missing in the first subset of observations in the entire dataset*/ data charvarstotestlist (keep=varlist varsizesum); format varlist $32767.; /*this is the maximum size of a character variable*/ retain varsizesum varlist; set maybedropchar2 end=eof; varsizesum=sum(of varsize varsizesum); if varsizesum>32767 then do; output; putlog vname varsizesum; varsizesum=varsize; varlist=vname; /*reset*/ else varlist=catx(' ',varlist,vname); if eof= then output; set charvarstotestlist ; call execute("ods output NLevels=checkchar;ods html close;"); call execute("proc freq data=&dsout nlevels; tables"); call execute(varlist); call execute("/noprint; ods html; ods output close;"); if _n_= then call execute("data dropchar; set checkchar (rename=(tablevar=vname) where=(nnonmisslevels=)); "); else call execute("data cropchar; set dropchar checkchar (rename=(tablevar=vname) where=(nnonmisslevels=)); "); /*You could use proc freq for all of the variables, but it would be slow*/ /*now, find out if any of the variables contain only the alternative missing value*/ /*let us assume that if the minimum and maximum values are both equal to the alternative missing value, the variable should be dropped*/ /*start with a subset of observations*/ proc means data=&dsout (obs=&samplesize.) noprint; output out=maybebiggest(drop=_:) max=; proc means data=&dsout (obs=&samplesize.) noprint; output out=maybesmallest(drop=_:) min=; proc transpose data=maybebiggest out=maybebiggest2 (drop= rename=(col=greatest _name_=vname)); proc transpose data=maybesmallest out=maybesmallest2 (drop= rename=(col=least _name_=vname)); proc sort data=maybebiggest2; by vname; proc sort data=maybesmallest2; by vname; data testnumericalt; merge maybebiggest2 maybesmallest2; by vname; varsize=length(vname)+; if least=&altnumericmissvalue. and greatest=&altnumericmissvalue. then output; 9

/*test the variables who are all the alternative missing value in the first subset of observations in the entire dataset*/ data numvarstotestlist (keep=varlist varsizesum); format varlist $32767.; /*this is the maximum size of a character variable*/ retain varsizesum varlist; set testnumericalt end=eof; varsizesum=sum(of varsize varsizesum); if varsizesum>32767 then do; output; putlog vname varsizesum; varsizesum=varsize; varlist=vname; /*reset*/ else varlist=catx(' ',varlist,vname); if eof= then output; set numvarstotestlist ; call execute("proc means data=&dsout noprint;"); call execute("output out=biggest(drop=_:) max=; var"); call execute(varlist); call execute(";"); call execute("proc means data=&dsout noprint;"); call execute("output out=smallest(drop=_:) min=; var"); call execute(varlist); call execute(";"); if _n_= then do; call execute("data biggest; set biggest; "); call execute("data smallest; set smallest; "); else do; call execute("data smallest; merge smallest smallest; "); call execute("data biggest; merge biggest biggest; "); proc transpose data=biggest out=biggest2 (drop= rename=(col=greatest _name_=vname)); proc transpose data=smallest out=smallest2 (drop= rename=(col=least _name_=vname)); proc sort data=biggest2; by vname; proc sort data=smallest2; by vname; data dropnumericalt; merge biggest2 smallest2; by vname; if least=&altnumericmissvalue. and greatest=&altnumericmissvalue. then output;

/*STEP 3: COMBINE LISTS OF CHARACTER AND NUMERIC VARIABLES TO DROP*/ /*for convience, combine the lists of character and numeric variables to drop*/ data allvarstodrop; format vname $32.; /*maximum length of SAS variable name. Use this statement to make sure the names are not truncated*/ set dropchar(keep=vname) dropnumeric (drop=numhere) dropnumericalt (drop=greatest least) ; varsize=length(vname)+; *proc print data=allvarstodrop (obs=); /*STEP 4: DROP THE VARIABLES*/ /* create a character variable containing the list of variables to be dropped. A character variable can be up to 32767 characters, and can therefore hold a list of at least 992 SAS variables, maybe more*/ /*if the sum of varsize is more than 32767 need to create additional observations in the dataset to hold the additional variables*/ data varstodroplist (keep=varlist varsizesum); format varlist $32767.; /*this is the maximum size of a character variable*/ retain varsizesum varlist; set allvarstodrop end=eof; varsizesum=sum(of varsize varsizesum); if varsizesum>32767 then do; output; putlog vname varsizesum; varsizesum=varsize; varlist=vname; /*reset*/ else varlist=catx(' ',varlist,vname); if eof= then output; /*use call execute to repeat the dropping for each set of variable names*/ set varstodroplist; call execute("data &dsout(label='&mylabel' compress=&mycompress.);"); call execute("set &dsout(drop="); call execute(varlist); call execute(");"); call execute(""); /*put the end time in the log*/ temptime=put(time(),hhmm5.); putlog "end time=" temptime; APPENDIX B: CODE GENERATED BY THE FINAL DATA STEP CALL EXECUTE STATEMENTS This code was created by one iteration of the final DATA step, using CALL EXECUTE statements. It is this code which drops the variables that were previously determined to have only missing values. The code is shown as it appears in the SAS log. 29 + data sas.acs2_5yr_blkgrp2(label='27-2 US Census American Survey estimates and margins of error for NYS block groups, processed by gdb2' compress=binary); 22 + set sas.acs2_5yr_blkgrp2(drop= 22 + B2529m3 B2529m32 B2529m33 B2529m34 B2529m35 B2529m36 B2529m37 B2529m38 B2529m39

B2529m4 B2529m4 B2529m42 B2529m43 B2529m44 B2529m45 B2529m46 B2529m47 B2529m48 B2529m49 B2529m5 B2529m5 B2529m52 B2529m53 B2529m54 B2529m55 222 + B2529m56 B2529m57 B2529m58 B2529m59 B2529m6 B2529m6 B2529m62 B2529m63 B2529m64 B2529m65 B2529m66 B2529m67 B2529m68 B2529m69 B2529m7 B2529m7 B2529m72 B2529m73 B2529m74 B2529m75 B2529m76 B2529m77 B2529m78 B2529m79 B2529m8 223 + B2529m8 B2529m82 B2529m83 B2529m84 B2529m85 B2529m86 B2529m87 B26e B26m B98e B98e2 B982e B982e2 B98e B982e B982e2 B982e3 B983e B983e2 B983e3 B983e4 B983e5 B983e6 B983e7 B984e B982e B982e2 224 + B982e3 B982e4 B982e5 B982e6 B982e7 B982e8 B982e9 B9822e B9822e2 B9822e3 B9822e4 B9822e5 B9822e6 B9822e7 B9822e8 B9822e9 B9822e B983e B9832e B98m B98m2 B982m B982m2 B98m B982m B982m2 B982m3 B983m 225 + B983m2 B983m3 B983m4 B983m5 B983m6 B983m7 B984m B982m B982m2 B982m3 B982m4 B982m5 B982m6 B982m7 B982m8 B982m9 B9822m B9822m2 B9822m3 B9822m4 B9822m5 B9822m6 B9822m7 B9822m8 B9822m9 B9822m B983m B9832m 226 + B9952PRe B9952PRe2 B9952PRe3 B9952PRe4 B9952PRe5 B9952PRe6 B9952PRe7 B9985e B9985e2 B9985e3 B9952PRm B9952PRm2 B9952PRm3 B9952PRm4 B9952PRm5 B9952PRm6 B9952PRm7 B9985m B9985m2 B9985m3 B9986e B9986e2 B9986e3 B9987e B9987e2 227 + B9987e3 B9987e4 B9987e5 B9988e B9988e2 B9988e3 B9988e4 B9988e5 B9989e B9989e2 B9989e3 B9986m B9986m2 B9986m3 B9987m B9987m2 B9987m3 B9987m4 B9987m5 B9988m B9988m2 B9988m3 B9988m4 B9988m5 B9989m B9989m2 B9989m3 B993e 228 + B993e2 B993e3 B9932e B9932e2 B9932e3 B9922e B9922e2 B9922e3 B99244e B99244e2 B99244e3 B99245e B99245e2 B99245e3 B99246e B99246e2 B99246e3 B992523e B992523e2 B992523e3 B993m B993m2 B993m3 B9932m B9932m2 B9932m3 B9922m B9922m2 229 + B9922m3 B99244m B99244m2 B99244m3 B99245m B99245m2 B99245m3 B99246m B99246m2 B99246m3 B992523m B992523m2 B992523m3 Bm B2m B99m B99m2 B99m3 B992m B992m2 B992m3 B992m B992m2 B992m3 B993m B993m2 B993m3 B995m 23 + B995m2 B995m3 B995m4 B995m5 B995m6 B995m7 B9952m B9952m2 B9952m3 B9952m4 B9952m5 B9952m6 B9952m7 B996m B996m2 B996m3 B997m B997m2 B997m3 B9972m B9972m2 B9972m3 B9972m4 B9972m5 B9972m6 B9972m7 B998m B998m2 23 + B998m3 B998m B998m2 B998m3 B998m4 B998m5 B9982m B9982m2 B9982m3 B9982m4 B9982m5 B9983m B9983m2 B9983m3 B9983m4 B9983m5 B9984m B9984m2 B9984m3 B9984m4 B9984m5 B9992m B9992m2 B9992m3 B992m B992m2 B992m3 B993m 232 + B993m2 B993m3 B994m B994m2 B994m3 B994m4 B994m5 B994m6 B994m7 B992m B992m2 B992m3 B994m B994m2 B994m3 B9942m B9942m2 B9942m3 B995m B995m2 B995m3 B996m B996m2 B996m3 B9962m B9962m2 B9962m3 B9962m4 233 + B9962m5 B9962m6 B9962m7 B9963m B9963m2 B9963m3 B9963m4 B9963m5 B997m B997m B997m B997m2 B997m3 B997m4 B997m5 B997m2 B997m3 B997m4 B997m5 B997m6 B997m7 B997m8 B997m9 B9972m B9972m B9972m B9972m2 234 + B9972m3 B9972m4 B9972m5 B9972m2 B9972m3 B9972m4 B9972m5 B9972m6 B9972m7 B9972m8 B9972m9 B999m B999m2 B999m3 B999m4 B999m5 B999m6 B999m7 B999m8 B9992m B9992m2 B9992m3 B9992m4 B9992m5 B9992m6 B9992m7 B9992m8 B9993m 2

235 + B9993m2 B9993m3 B9993m4 B9993m5 B9993m6 B9993m7 B9993m8 B9994m B9994m2 B9994m3 B9994m4 B9994m5 B9994m6 B9994m7 B9994m8 B992m B992m2 B992m3 B992m4 B992m5 B992m6 B992m7 B992m8 B992m B992m2 B992m3 B9922m B9922m2 236 + B9922m3 B9923m B9923m2 B9923m3 B99232m B99232m2 B99232m3 B99233m B99233m2 B99233m3 B99233m4 B99233m5 B99234m B99234m2 B99234m3 B99234m4 B99234m5 B9924m B9924m2 B9924m3 B99242m B99242m2 B99242m3 B99243m B99243m2 B99243m3 B9925m B9925m2 237 + B9925m3 B9925m B9925m2 B9925m3 B99252m B99252m2 B99252m3 B99253m B99253m2 B99253m3 B99254m B99254m2 B99254m3 B99255m B99255m2 B99255m3 B99256m B99256m2 B99256m3 B99258m B99258m2 B99258m3 B99259m B99259m2 B99259m3 238 + B99252m B99252m2 B99252m3 B99252m B99252m2 B99252m3 B992522m B992522m2 B992522m3 B992522m4 B992522m5 B992522m6 B992522m7 B99252m B99252m2 B99252m3 B99253m B99253m2 B99253m3 B99254m B99254m2 B99254m3 B99255m B99255m2 B99255m3 B99256m 239 + B99256m2 B99256m3 B99257m B99257m2 B99257m3 B99258m B99258m2 B99258m3 B99259m B99259m2 24 + ); 24 + NOTE: There were 5463 observations read from the data set SAS.ACS2_5YR_BLKGRP2. NOTE: The data set SAS.ACS2_5YR_BLKGRP2 has 5463 observations and 6462 variables. NOTE: Compressing data set SAS.ACS2_5YR_BLKGRP2 decreased size by 73.58 percent. Compressed is 494 pages; un-compressed would require 5495 pages. NOTE: DATA statement used (Total process time): real time 7.93 seconds cpu time 7.28 seconds 3