Coders' Corner. Paper 81-26



Similar documents
Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

How to Reduce the Disk Space Required by a SAS Data Set

Search and Replace in SAS Data Sets thru GUI

ing Automated Notification of Errors in a Batch SAS Program Julie Kilburn, City of Hope, Duarte, CA Rebecca Ottesen, City of Hope, Duarte, CA

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

A Method for Cleaning Clinical Trial Analysis Data Sets

Creating Dynamic Reports Using Data Exchange to Excel

A Macro to Create Data Definition Documents

Post Processing Macro in Clinical Data Reporting Niraj J. Pandya

Managing Tables in Microsoft SQL Server using SAS

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

Salary. Cumulative Frequency

Managing very large EXCEL files using the XLS engine John H. Adams, Boehringer Ingelheim Pharmaceutical, Inc., Ridgefield, CT

THE POWER OF PROC FORMAT

Using SAS With a SQL Server Database. M. Rita Thissen, Yan Chen Tang, Elizabeth Heath RTI International, RTP, NC

Preparing your data for analysis using SAS. Landon Sego 24 April 2003 Department of Statistics UW-Madison

Top Ten Reasons to Use PROC SQL

Using Macros to Automate SAS Processing Kari Richardson, SAS Institute, Cary, NC Eric Rossland, SAS Institute, Dallas, TX

Tales from the Help Desk 3: More Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

Innovative Techniques and Tools to Detect Data Quality Problems

SAS Programming Tips, Tricks, and Techniques

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

Labels, Labels, and More Labels Stephanie R. Thompson, Rochester Institute of Technology, Rochester, NY

REx: An Automated System for Extracting Clinical Trial Data from Oracle to SAS

From Database to your Desktop: How to almost completely automate reports in SAS, with the power of Proc SQL

An Introduction to SAS/SHARE, By Example

EXST SAS Lab Lab #4: Data input and dataset modifications

Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

Effective Use of SQL in SAS Programming

An macro: Exploring metadata EG and user credentials in Linux to automate notifications Jason Baucom, Ateb Inc.

Report Customization Using PROC REPORT Procedure Shruthi Amruthnath, EPITEC, INC., Southfield, MI

Importing Excel File using Microsoft Access in SAS Ajay Gupta, PPD Inc, Morrisville, NC

You have got SASMAIL!

Data Cleaning 101. Ronald Cody, Ed.D., Robert Wood Johnson Medical School, Piscataway, NJ. Variable Name. Valid Values. Type

SUGI 29 Coders' Corner

B) Mean Function: This function returns the arithmetic mean (average) and ignores the missing value. E.G: Var=MEAN (var1, var2, var3 varn);

Tips and Tricks for Creating Multi-Sheet Microsoft Excel Workbooks the Easy Way with SAS. Vincent DelGobbo, SAS Institute Inc.

Using Pharmacovigilance Reporting System to Generate Ad-hoc Reports

Instant Interactive SAS Log Window Analyzer

Internet/Intranet, the Web & SAS. II006 Building a Web Based EIS for Data Analysis Ed Confer, KGC Programming Solutions, Potomac Falls, VA

OS/390 SAS/MXG Computer Performance Reports in HTML Format

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

Ten Things You Should Know About PROC FORMAT Jack Shoemaker, Accordant Health Services, Greensboro, NC

The SET Statement and Beyond: Uses and Abuses of the SET Statement. S. David Riba, JADE Tech, Inc., Clearwater, FL

Using SAS to Control and Automate a Multi SAS Program Process. Patrick Halpin November 2008

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

SAS ODS HTML + PROC Report = Fantastic Output Girish K. Narayandas, OptumInsight, Eden Prairie, MN

AN ANIMATED GUIDE: SENDING SAS FILE TO EXCEL

What You re Missing About Missing Values

SUGI 29 Posters. Mazen Abdellatif, M.S., Hines VA CSPCC, Hines IL, 60141, USA

PharmaSUG Paper AD11

Integrity Constraints and Audit Trails Working Together Gary Franklin, SAS Institute Inc., Austin, TX Art Jensen, SAS Institute Inc.

Overview. NT Event Log. CHAPTER 8 Enhancements for SAS Users under Windows NT

4 Other useful features on the course web page. 5 Accessing SAS

Applications Development ABSTRACT PROGRAM DESIGN INTRODUCTION SAS FEATURES USED

AN INTRODUCTION TO THE SQL PROCEDURE Chris Yindra, C. Y. Associates

The entire SAS code for the %CHK_MISSING macro is in the Appendix. The full macro specification is listed as follows: %chk_missing(indsn=, outdsn= );

Automating SAS Macros: Run SAS Code when the Data is Available and a Target Date Reached.

ABSTRACT INTRODUCTION

Oracle Database: SQL and PL/SQL Fundamentals

CDW DATA QUALITY INITIATIVE

FIRST STEP - ADDITIONS TO THE CONFIGURATIONS FILE (CONFIG FILE) SECOND STEP Creating the to send

We begin by defining a few user-supplied parameters, to make the code transferable between various projects.

Oracle Database: SQL and PL/SQL Fundamentals

Methodologies for Converting Microsoft Excel Spreadsheets to SAS datasets

Everything you wanted to know about MERGE but were afraid to ask

SAS Comments How Straightforward Are They? Jacksen Lou, Merck & Co.,, Blue Bell, PA 19422

How To Create An Audit Trail In Sas

SAS Credit Scoring for Banking 4.3

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

AN INTRODUCTION TO MACRO VARIABLES AND MACRO PROGRAMS Mike S. Zdeb, New York State Department of Health

ABSTRACT THE ISSUE AT HAND THE RECIPE FOR BUILDING THE SYSTEM THE TEAM REQUIREMENTS. Paper DM

Ad Hoc Advanced Table of Contents

Efficient Techniques and Tips in Handling Large Datasets Shilong Kuang, Kelley Blue Book Inc., Irvine, CA

Taming the PROC TRANSPOSE

Oracle SQL. Course Summary. Duration. Objectives

Data-driven Validation Rules: Custom Data Validation Without Custom Programming Don Hopkins, Ursa Logic Corporation, Durham, NC

SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA

Using the Magical Keyword "INTO:" in PROC SQL

Oracle Database: SQL and PL/SQL Fundamentals NEW

Using Edit-Distance Functions to Identify Similar Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC

Health Services Research Utilizing Electronic Health Record Data: A Grad Student How-To Paper

PS004 SAS , using Data Sets and Macros: A HUGE timesaver! Donna E. Levy, Dana-Farber Cancer Institute

Choosing the Best Method to Create an Excel Report Romain Miralles, Clinovo, Sunnyvale, CA

Handling Missing Values in the SQL Procedure

ABSTRACT INTRODUCTION %CODE MACRO DEFINITION

Do It Yourself (DIY) Data: Creating a Searchable Data Set of Available Classrooms using SAS Enterprise BI Server

IBM SPSS Direct Marketing 23

Transcription:

Paper 81-26 Automating the Process of Listing the Most Frequent Values of Thousands of Variables in Large Datasets Haiping Luo, Dept. of Veterans Affairs, Washington, DC Philip Friend, Dept. of Agriculture, Washington, DC ABSTRACT Listing variable values with high frequency for large datasets which have thousands of variables can be a time-consuming process. Using common procedures like 'proc freq' and 'proc tabulate' involves individually identifying possible value range, data discretion and/or class group before running the procedure. This macro automates the process of listing the most frequent values of all or selected variables in a dataset. The value list can be tailored to reflect only those values with desirable frequencies and percentages, in the selected value ranges, and by certain groups. The resulting value list can serve as a useful reference to dataset users. A CASE CALLING FOR THE NEED In a process of recoding datasets from the Agriculture Resources Management Survey, we needed to know the range and the most frequent values of more than one thousand variables. There are some SAS procedures that can accomplish this task, but all of them require efforts to look into each target variable's features before running the procedure. 'Proc freq' can list the frequency count and percentage for each value of the target variable. But if the values are not very discrete, the 'freq' output table can be huge. Also this procedure can only output one variable s frequency count to a file. 'Proc tabulate' can list the value and its frequency count across variables. But it is suitable only to discrete variables with only a few values. Proc univariate, proc summary both work only for numeric variables. While these procedures list values, frequencies, and percentages variable by variable, extra coding efforts are needed to pull the output of each variable together into an integrated list for easy reference for a large dataset. When the datasets in hand are large with thousands of variables, the coding efforts can be tedious. This case triggered the development of macro, aimed at automating the process of listing target variables' most frequent values for datasets of any size. DESIGN The core process of has three steps: 1. Determine target variables and user-defined conditions; 2. Process the numeric variables - list their names, most frequent values, frequency counts, percentages, means, maximums, and minimums; 3. Process the character variables - list their names, most frequent values, frequency counts, and percentages. One twist to this seemingly simple process is to allow the user to run batch by setting parameters in the calling code or to run interactively by inputting parameter values in an input window. Another twist is to allow the flexibility of selecting all or part of the variables, selecting value ranges, determining by groups, and determining frequency and percentage filtering criteria. A third twist is to append each variable's output value list to a comprehensive list which includes all target variables and their value statistics. DETAIL OF THE CODE 1. COLLECT USER INPUT Macro needs these user inputs: input and output libnames and their physical paths; input and output datasets' names; and input variable filtering and sorting criteria. Two input methods are incorporated into the code: one is input by submitting parameter values in the ( ) string; the other is input by typing in parameter values in the pop up windows. The string method allows batch processing, while the pop-up-window method allows interactive processing. To allow the user to choose the run method, a macro variable screen is defined to represent the batch or interactive mode. An %if structure detects the value of &screen. When &screen is 1, a display macro will display pop-up windows to collect user s input, otherwise the parameter values in the () call will be used to run the macro: %macro getfreq(screen=1, ) *** starts do window conditioning on &screen; %if &screen=1 %then %do; ** define window; %WINDOW first color=cyan ; ** define a display macro; %MACRO INTRFACE; %DISPLAY first.fileinfo; %DISPLAY first.varinfo blank; %MEND INTRFACE; ** run the display macro %intrface; %INTRFACE; *** ends do window; After the input selection portion, the main code of the is wrapped in a nested macro called %freqmain. The contents of %freqmain are described in the following sections 2 through 4. 2. DETERMINE VARIABLE NUMBERS BY TYPE To determine how many numeric and character variables are in the user defined dataset, several data steps and procedures are used, together with %if structures to insert code conditionally: *** check variable type and collect variable numbers; data _inall; ** read in dataset with its conditions; set %IF &PATHIN NE %THEN %STR("&PATHIN.&indat"); %ELSE %STR(&LIBIN..&INDAT); (%if &varlist ne %then %do; keep=&varlist &byvar %if (&filtlgic) ne %then %do; where=(&filtlgic) ) end=lastobs; if lastobs then do; n=_n_; ** collect total observations; call symput('dataobs',n); drop n; ** sort if by group is defined; %LET byproc = %str(bygroup="&byvar"); %LET bylab = bygroup="frequency by This Group (bygroup)"; proc sort data=_inall; by &byvar;

proc format; value tp 1 = "Number" 2 = "Char"; proc format; value tpvar 1 = "No. of Numeric Variables" 2 = "No. of Character Variables"; ** keep only the type for each variable; proc contents noprint data=_inall out=_intype(keep=type); ** list the count of variables by type; proc freq data=_intype; tables type / list out=_intpsum; format type tpvar.; data _null_; set _intpsum; ** check if there are numeric and/or charactor variables; if type=1 then do; ** collect numeric var number; call symput('nvnumber',count); else if type=2 then do; ** collect charactor var number; call symput('cvnumber',count); 3. PROCESS NUMERIC VARIABLES If there are numeric variables in the defined dataset, a %do loop is used to process each numeric variable to get its value list and each value's frequency and percentage. The variable s maximum, minimum and mean values by the defined by group are also listed: %if %eval(&nvnumber)>0 %then %do; *** get numeric variables; title "Numeric Variables"; data _in2; set _inall; keep _numeric_ &byvar; ** get the name, of the numeric variabls; data _in4; set _in2(keep=_numeric_); proc contents noprint data=_in4 out=_in3(keep=name type varnum nobs); ** print a list of all numeric variables to run frequency for; title2 "Dataset &indat - variable information: number of selected numeric variables=&nvnumber"; proc print data=_in3; format type tp.; title2 "First 3 observations of the selected numeric variables in dataset &indat"; proc print data=_in4(obs=3) ; footnote; ** go through each variable to get frequency; %do i=1 %to &nvnumber; * starts do i; ** determine the variable name; data _null_; set _in3; if varnum = &i then do; call symput('nname',name); data _2n; set _in2(keep=&nname &byvar); length value 8 varname $ 32; ** assign the value of the variable to a common variable; value = &nname; ** assign the variable name as text to a common variable; varname = "&nname"; value = "Value of Frequecy Variable (value)" varname = "Frequency Variable (varname)"; The critical steps in the above code are to put the value of the processed variable into a common variable value, and the name of the processed variable into another common variable varname. These enable the frequency calculation and the output appendance to focus on these two common variables instead of individual input variables. ** get frequency for each value of the frequency variable; ** keep only those with frequency count > &dropfrq in the &byvar group; proc freq data=_2n order=freq noprint; tables varname*value / list out=_2out(where=(count gt &dropfrq)); by &byvar; ** get group mean, max, min, total obs number for the frequency variable; proc means data=_2n (keep=&nname &byvar) noprint; var &nname; by &byvar; output out=_2mean mean=mean max=max min=min n=totalobs nmiss=missobs; data _2mean; set _2mean(drop=_type freq_); length varname $ 32; varname = "&nname"; mean = "Group Mean of This Variable (mean)" max = "Group Maximum (max)" min = "Group Minimum (min)" totalobs = "Non-missing Obs in This By Group (totalobs)" missobs = "Missing Obs in This By Group (missobs)" varname = "Frequency Variable (varname)"; ** merge group means, etc. back to freq list; data _2out; merge _2out(in=a) _2mean; by varname &byvar ; if a; 2

&byproc; &bylab; ** create or append frequency list; %if &i=1 %then %do; data OUD.&outdat.n; set _2out; %else %do; proc datasets ; append base=oud.&outdat.n data=_2out (where=(percent gt &droppct or percent =.)); delete _2n _2out; quit; *end do i; ** print frequency list; title1 "&nvnumber Numeric variables (Dataset observations = &dataobs)"; title2 "Highest (>&droppct%) frequent (count > &dropfrq) values (by &byvar)"; title3 "of numeric variables in dataset &&indat (this output was saved as &pathout.&outdat.n)"; proc sort data=oud.&outdat.n; by varname %if &byvar ne %then %str(&byvar;); %else %str(;); proc print data=oud.&outdat.n ; by varname; var bygroup &byvar value count percent mean max min totalobs missobs; sum count; sumby varname; footnote; title; *** ends numeric variable process; 4. PROCESS CHARACTER VARIABLES The procedure for getting character variables frequent value list is similar to that for the numeric variables, except that there is no calculation of group means, maximums, and minimums. The following code presents in both the numeric and the character portions of the program, but it is more critical to the character portion. By default, SAS numeric variables have a length of eight bytes, while a character variable s length is either defined explicitly by the user or implicitly by the length of the first value ever read in. To prevent the common output variable value being truncated to the length of the first character variable in the input sequence, the following code obtains the maximal length of all character variables; the maximal length is then used to set the length of the common output variable value. ** get the name,, length, and number of observations of the character variables; data _ic4; set _ic2(keep=_character_); proc contents noprint data=_ic4 out=_ic3(keep=name type varnum nobs length); ** get the longest charactor value to set the length of the value variable; proc means data=_ic3(keep=length) noprint; output out=_ic32 max=max; data _null_; set _ic32; call symput('maxlen',max);... ** go through each variable to get frequency; %do i=1 %to &cvnumber; * starts do i; ** determine the variable name; data _null_; set _ic3; if varnum = &i then do; call symput('cname',name); data _2c; set _ic2(keep=&cname &byvar); ** set the value to the maximal length; length value $ &maxlen varname $ 32; ** assign the value of this variable; value = &cname; ** assign the variable name as text to varname; varname = "&cname"; value = "Value of Frequecy Variable (value)" varname = "Frequency Variable (varname)" ;... *end do i; After processing the character variables and performing some housekeeping chores, the definition of the %freqmain sub macro ends. The parent macro then calls %freqmain, and itself is closed by the %mend getfreq; statement. USAGE %Getfreq has been tested under SAS for Windows versions 6.12, 7, 8, and for IBM MVS/TSO version 6.09. The following code can call on the Windows platform: /* put this option at the beginning of your code*/ options sasautos="path-to-this-macrodirectory"; /*assume popup windows, works for V8*/ /*assume popup windows, works for V6.12, V7*/ () /*no windows, take all default values defined in the macro code*/ (screen=0) /*no windows, run the test dataset work.a */ (screen=0, pathin=, libin=work, indat=a, pathout=d:\temp\, outdat=getfreq, varlist=%str(x y datet ), dropfrq=1, droppct=0, filtlgic=%str(x>0 or y ne. ), byvar=%str(studname) ) 3

There is a detail explanation of the parameters in the code file. To control the frequency output, the most important parameters are &droppct and &dropfrq. Setting dropfrq=1 will drop all the values which only occur once. This will prevent the continuous value being included in the output file. For a dataset with millions of observations, only drop frequency count equals to one may not be enough. The &droppct parameter allows the values with low frequency percentage being dropped from the output. For example, droppct=5 will keep only those values which s frequencies are higher than 5 percent of the total observations in the calculation group. %Getfreq generates one dataset for numeric variables and one for character variables. The dataset(s) contains variables name, frequent values, the frequency count and the percentage of each value, and some group statistics if it is a numeric variable. The datasets are saved in the &pathout directory as getfreqn.sas7bdat for numeric variables and getfreqc.sas7bdat for character variables. The file extension assumes that the code was run under SAS version 7 or 8. %Getfreq also generates the following lists at the output window for numeric and/or character variables: Number of variables; Variable names and s; The first 3 observations; The most frequent values for each variable. The following is a sample output generated to the output window: A 3 Numeric variables (Dataset observations = 24) Highest (>0%) frequent (count > 1) values (by studname) of numeric variables in dataset a (this output was saved as d:\temp\getfreqn.*) Frequency Variable (varname)=x B Value of Frequency by Frequecy Percent of This Group Student name Variable Frequency Total (bygroup) (studname) (value) Count Frequency studname Alex Dickins 0 2 66.667 studname John Graves 1 2 66.667 studname Larry Lin 0 2 66.667 studname Linda Brooks 0 2 66.667 studname Lynne Miller 3 2 66.667 studname Mitch Tramp 1 3 100.000 studname Susan Williams 1 2 100.000 --------- VARNAME 15 (print out continues) Group Mean Non-missing Missing Obs of This Group Group Obs in This in This By Variable Maximum Minimum By Group Group (mean) (max) (min) (totalobs) (missobs) 1.00000 3 0 3 0 0.66667 1 0 3 0 0.33333 1 0 3 0 1.00000 3 0 3 0 2.33333 3 1 3 0 1.00000 1 1 3 0 1.00000 1 1 2 0 VARNAME C Input and output PATHS: or work; d:\temp\ Input dataset: a(where=( ) keep=( )) Output datasets: numeric freq - d:\temp\getfreqn, char. freq - d:\temp\getfreqc, sorted by studname, where=((count > 1) and (percent > 0)) Part A in the above list is a title. The title indicates that: the dataset name is a; there are 24 observations in it; there are 3 numeric variables in dataset a; this portion of the list is for variable x (frequency variable=x); this list displays the value of x which have frequency count more than 1 (count>1); the frequency is run by a variable studname ; the output file is saved as d:\temp\getfreqn.*. Part B is the contents of the list. There are 10 columns in the table. We can read the table this way: under the group variable(studname) s first value, Alex Dickins, the frequency variable x has a most frequent value, 0, which occurred twice, and accounted for 66.67 percent of all values of x under this student name Alex Dickins ; the mean value of x under Alex Dickins is 1, while the maximum is 3 and the minimum is 0. The last two columns show the total number of non-missing and missing observations for the group calculation. There is a summation by varname at the end of the table under column 4 Frequency Count. The value 15 indicates that 15 out of 24 total observations from the dataset are included in this most frequent value list for this variable x. Part C is a footnote for the list. It reports the input and output file paths, libnames, dataset names, variable filtering criteria, etc. The mainframe version of %Getfreq does not contain the input window and the output library libname OUD... statement. Assuming that a JCL statement defines the output files like this: //OUD DD DSN=USER1.SAS.DATA.GETFREQ,... The calling statement for the test dataset a is: (pathin=, libin=work, indat=a, pathout=oud, outdat=getfreq, varlist=, filtlgic=, dropfrq=1,droppct=0, byvar=studname ); The mainframe code also limited the line length to 80 characters, moved / from the first column to avoid confusion with JCL, and removed quotation marks from comment area to prevent error in reading. The output of the mainframe version of are the same as the Windows version. CONCLUSION Obtaining most frequent values of many variables in large datasets can be a tedious task. This macro automates the process and saves the frequent values to files. This macro allows a user to tailor the results to include all or part of variables, to report only the values with desirable frequencies and percentages, and to run and group results by certain variables. The resulting frequent value list can serve as a useful reference for dataset users. To obtain a complete copy of the most current version of, please go to these links: http://chinafoundation1.org/getfreq.sas http://chinafoundation1.org/getfreq-mainframe.sas Further development of will focus on improving efficiency, allowing more options, testing for other platforms, and making the execution more robust. 4

CONTACT INFORMATION Your comments and questions are valued and encouraged. Please contact the authors at: Author Names Haiping Luo (hpluo@yahoo.com) Philip Friend (pfriend@ers.usda.gov) Code Web URL http://chinafoundation1.org/getfreq.sas SAS Discussion Board http://clubs.yahoo.com/clubs/sas http://go.to/sas-net SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other Countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. 5