SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA

Transcription

1 SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA ABSTRACT A codebook is a summary of a collection of data that reports significant features of the assembled data. It can be used to provide insights into the data collection. Survey data are often assembled into codebooks for rapid understanding of the questions asked, and typically are limited to variable name, variable label, categorical variable values and their frequency counts, and simple descriptive statistics for continuous variables. Ideally, a codebook provides more value to a statistician or data miner when it presents variables in a format suitable for exploratory data analysis. The %CODEBOOK macro described in this paper uses the Enterprise Miner heuristic rules for producing metadata, and classifies variables according to the Enterprise Miner rules into nominal, ordinal, or interval measurement scales 1. It creates presentations of the variables appropriate to their measurement scale and may be used as an exploratory data analysis (EDA) tool for a first look at a set of data. KEYWORDS Codebook, exploratory data analysis, measurement scale, metadata, SAS Enterprise Miner INTRODUCTION A codebook is an abstract of a collection of data items that have been assembled for purposes of studying some topic of interest. Surveys are frequently summarized in codebook format so that interesting features of the data may be revealed for further analysis. The characteristics of the data may be restricted to a few descriptive attributes such as variable name and label, number of categories and frequency count per category for a categorical variable, or mean and standard deviation and other descriptive statistics for a continuous variable. Many of these characteristics of the data may be obtained from a PROC CONTENTS listing. We statisticians and data miners and others who must rapidly assimilate the salient features of large ensembles of data require more descriptive measures than enumeration of data. We would like to see extreme values of a variable whether categorical or continuous, the distribution of the data represented as histograms for categorical variables or box-and-whisker plots for continuous variables, and additional information appropriate to a variable s information content. 1 The measurement scale of a variable refers to its comparison characteristics. For example, colors of the rainbow are labels that do not have an intrinsic ordering property. We do not know if blue is more or less than red, so colors are declared to be nominal. They can be enumerated but not ordered. If we attach a frequency to a color, then we can say that blue has a higher frequency than red so the ordering by increasing frequency would be red followed by blue, and color ordered by frequency is said to be ordinal. But we do not know yet whether the difference between colors is meaningful, i.e., if the computation blue red is defined. Using frequency in terahertz (10^12 Hertz) ( if blue has a frequency of 638 THz and red has a frequency of 428 THz, then the computation blue frequency red frequency is defined and the ordering is thus interval. 1

2 The process of creating a codebook can be automated by applying the Enterprise Miner rules of metadata creation to the variables in a dataset, and then producing a standardized report for each variable according to its metadata. ENTERPRISE MINER RULES OF METADATA CREATION The Enterprise Miner 4.3 rules of metadata creation [1] as adapted for the %CODEBOOK macro are: If the measurement scale of a variable was specified at macro invocation, it is not changed. If the variable is character or numeric with a numeric format (BINARY, HEX, IB, OC- TAL, PD, PIB, PK, RB, SSN, Z, ZD, or S370), then the measurement level is set to nominal. If the number of distinct values of a numeric variable is greater than min( 10, #{ nonmissing observations in the sample / 5 }), then the measurement level is set to interval. If the variable is assigned a numeric-type format (BEST, COMMA, COMMAX, DOL- LAR, DOLLARX, E, FRACT, NEGPAREN, PERCENT, or a standard w.d number format, such as 5.1, etc.), then the measurement level is set to ordinal. For all other formats, including user-defined formats, the measurement level is set to nominal. It is a user-specified option to apply the EM metadata rules to the data. The default option is to apply them. If the EM rules are not applied, then the measurement scale for each variable must be supplied by the user. PRESENTATION FORMAT The presentation of each variable is determined by its metadata classification. All the variables within a given measurement scale are summarized in tabular form and are then described individually. Descriptive statistics are computed for interval-scaled variables as a group and then each variable is further analyzed. Nominal and ordinal variables are also first summarized as a group and then individually. Nominal and ordinal variables share the same presentation format since an ordinal variable may be considered to be a nominal variable with an ordering property. Presentation Format for Interval-Scaled Variables Interval-scaled variables are also called real numbers or continuous variables in mathematics or machine learning, respectively. We can perform arithmetic operations on them and take comfort that the difference between two interval-scaled variables measured on the same scale is meaningful. 2 2 For example, two temperatures must be compared on the same scale, e.g., Fahrenheit or Celsius, to be meaningful, since a Fahrenheit degree of temperature is 5/9 of a Celsius degree. See for a lucid discussion of the origins of the two temperature measurement scales. 2

3 The group summary of interval-scaled variables includes the variable name and simple descriptive statistics. An example, from the Home Equity Loan Scoring dataset 3, illustrates the format. Summary of Interval Data % Variable Name Min Max Mean Std Dev N Missing Skewness Kurtosis clage clno debtinc delinq loan mortdue ninq value yoj An analyst can use the table to see which variables might be unusable due to a high percentage of missing values, which are skewed and need to be transformed, etc. The individual variable summary contains more detail: Interval Data Variable Name Metadata Statistic Box Plot clage Type: Num Basic Measures Format: N Length: 8 N Missing Label: "Age of oldest trade % Missing line in months" Range Mean Std Dev Skewness Kurtosis Quantiles % Max % % % % Q P99 50% Median P95 25% Q P90 10% P75 5% *--+--*AVG 1% P25 0% Min P5 There is repetition of the summary statistics in a different format for convenience, and a boxand-whisker plot 4 is placed adjacent to the quantiles of the interval-scaled variable s distribution as a visual representation of the data. 3 For SAS 9.2, it may be found at C:\Program Files\SAS\SASFoundation\9.2\dmine\sample\dmahmeq.sas7bdat. 4 A 0 represents a value between 1.5 and 3 times the inter-quantile range. A * represents a value greater than 3 times the IQR. 3

4 Presentation Format for Nominal- and Ordinal-Scaled Variables Nominal- and ordinal-scaled variables usually impart non-quantitative information on a different axis of meaning compared to interval-scaled variables and must be considered with respect to their different information content. The group summary of nominal- and ordinal-scaled variables includes the variable name and enumerative information that will be useful to a data miner in creating metadata. The group summary for two nominal Home Equity variables is shown below. Note that there are three sections of the summary, the first two of which are shown below. Summary of Nominal Data # Distinct % Variable Name # of Obs Values Missing JOB REASON Highest, Lowest Frequencies of Nominal Data Variable Name Value Count Percent JOB Other ProfEx Office Mgr <Missing> Self Sales Other ProfEx Office Mgr <Missing> Self Sales REASON DebtCon HomeImp <Missing> DebtCon HomeImp <Missing> In this example, we see that JOB has 6 distinct values and that they are of character type. Since there is no obvious ordering, JOB is a nominal variable. If we had prefixed each value with an integer, e.g., 1Mgr, 2ProfEx, 3Sales, etc., SAS would have used the lexicographic ordering to sort the values accordingly. But JOB would still be nominally-scaled since it is a character variable. The REASON variable has two values (three, counting <Missing> ), so it could be classified as binary in the Enterprise Miner measurement scale schema. We note that 4.23% of the values are missing, so these missing values must either be imputed or observations containing them must be omitted from the study if completely-populated observations are required 5. Furthermore, the 10 highest and lowest frequencies are listed as a way of indicating which values are outliers in the distribution. 5 An exception to this statement will be the situation when a data miner builds a model using an algorithm that is tolerant of missing data, such as a decision tree. 4

5 The third section of the summary presents information about each variable and histograms of each variable s values. Nominal Data Cum. Cum. Variable Name Metadata Value Histogram Freq Percent Freq Percent JOB Type: Num <Missing> * Format: Mgr *** Length: 6 Office *** Label: Other ******** "Prof/exec ProfEx **** sales mngr Sales office self Self * other" REASON Type: Num <Missing> * Format: DebtCon ************* Length: 7 HomeImp ****** Label: "Home improvement or debt consolidation" In addition to the metadata computed by the %CODEBOOK macro, the histogram of frequency counts provides a quick visual summary of the distribution of values. The three summary sections for ordinal variables are equivalent to those of nominal variables. Summary of Ordinal Data # Distinct % Variable Name # of Obs Values Missing BAD DEROG Ordinal Data Cum. Cum. Variable Name Metadata Value Histogram Freq Percent Freq Percent BAD Type: Num 0 **************** Format: 1 **** Length: 8 Label: "Default or seriously delinquent" DEROG Type: Num <Missing> ** Format: 0 *************** Length: 8 1 * Label: "Number 2 * of major derogatory reports"

6 USING THE %CODEBOOK MACRO The %CODEBOOK macro uses PROC IML and Base/SAS procedures. The macro is invoked with the following parameters: %macro CODEBOOK( DSNIN /* name of dataset for which codebook is to be produced */, CLASS_LEVELS=10 /* [optional] threshold for defining a categorical variable */, DROP= /* [optional] names of variables to exclude from processing */, EMRULES=Y /* [optional] use Enterprise Miner rules for creating metadata */, FREQ /* [optional] indicate that one observation represents &FREQ obs */, INTERVAL= /* [optional] names of interval variables */, MAXVALUE=10 /* [optional] number of values to display in max, min freq report */, METASAMPLE=2000 /* [optional] # of obs to sample from &DSNIN in creating metadata */, NOMINAL= /* [optional] names of [char numeric] nominal variables */, ORDINAL= /* [optional] names of [char numeric] ordinal variables */, ODS_LISTING= /* [optional] path for ODS LISTING file output */, WEIGHT= /* [optional] indicate the relative importance of an observation */ ) If we turn off the Enterprise Miner metadata rules processing, we can create the same results as shown above by invoking %CODEBOOK with the following parameters. libname HMEQ 'C:/Program Files/SAS/SASFoundation/9.2/dmine/sample' ; %CODEBOOK( HMEQ.DMAHMEQ, EMRULES=N, INTERVAL=clage clno debtinc delinq loan mortdue ninq value yoj, METASAMPLE=6000, NOMINAL=job reason, ORDINAL=bad derog ) Note that, since there are 5,960 observations in the Home Equity dataset, using the default sample size of 2,000 may not create a representative sample. If the sample size is not large enough to represent the entire population, the %CODEBOOK macro may not produce a codebook that will accurately summarize the original data. For example, using the default setting of METASAMPLE=2000, the variable DEROG is classified as ordinal but using all of the data results in its classification as interval. If METASAMPLE=max, all of the observations in the dataset will be used to build the codebook. A frequency variable may be used to indicate that a single observation represents several observations, and a weight variable may be used to indicate relative importance of observations. If a weight variable is specified, skew and kurtosis are not computed for interval variables. 6

7 EXAMPLES OF USE The simplest invocation of the %CODEBOOK macro is %CODEBOOK( DATASET ) which will create a sample of 2,000 observations and draw metadata inferences from the sample. If there is a variable in the dataset that contains an extremely large number of distinct values, i.e., 5-digit ZIP code that is contained in a character variable, %CODEBOOK will create a list of the individual values until the list requires more characters than the largest character variable allowed. If the list length is greater than 32,767 6 characters, it will be truncated with an error message printed at the end of the list. In this case, the variable can be put into a drop list so it will be excluded from processing. Any variables that are not of interest can be dropped to reduce the execution time required for codebook production, which can be lengthy for big datasets. The example below drops unwanted variables and stores the resulting output in an ODS listing file. %CODEBOOK( DATASET, DROP=account_id zip5 first_name last_name, ODS_LISTING= //server/path/codebook txt ) If %CODEBOOK is invoked with user-specified names of interval-, nominal-, or ordinal-scaled variables, they supercede the measurement scale assignments produced by the Enterprise Miner metadata rules. If &FREQ and/or &WEIGHT variables are specified as %CODEBOOK parameters, they will be reported according to their measurement scales if &EMRULES=Y (default). They must be explicitly dropped if their exclusion is desired. For example, the macro invocation %CODEBOOK( HMEQ.DMAHMEQ, DROP=FREQUENCY WGT, FREQ=FREQUENCY, WEIGHT=WGT ) will produce a codebook that uses FREQUENCY and WGT variables in producing summaries of the HMEQ.DMAHMEQ variables but will not report FREQUENCY or WGT in its results. CONCLUSION We have developed a SAS macro that produces a codebook summary of a set of data in a form suitable for exploratory data analysis. Enterprise Miner metadata rules are applied to each variable in the dataset to classify variables as interval, nominal, or ordinal in measurement scale. There are options for user-specified overrides of the metadata rules, specification of sample size, frequency and weighting variables, and storing the results in an ODS-created listing file separate from the standard output. A data miner can use the codebook summary to perform initial exploration of a set of data before starting an Enterprise Miner-based modeling effort. 6 The maximum length of a character variable is 32,767 characters as of this writing. The macro code may be easily modified to accommodate future changes. 7

8 REFERENCES 1. Enterprise Miner 4.3 Input Data Source, Data Tab, Metadata Sample, SAS Institute, Inc., Cary, NC, ACKNOWLEDGMENTS We thank Joseph Naraguma and Jay King for reviewing preliminary versions of this paper and contributing their ideas. Contact Information Your comments and questions are valued and encouraged. Contact the author at: Ross Bettinger SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 8