SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA
|
|
- Bertina Watson
- 8 years ago
- Views:
Transcription
1 SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA ABSTRACT A codebook is a summary of a collection of data that reports significant features of the assembled data. It can be used to provide insights into the data collection. Survey data are often assembled into codebooks for rapid understanding of the questions asked, and typically are limited to variable name, variable label, categorical variable values and their frequency counts, and simple descriptive statistics for continuous variables. Ideally, a codebook provides more value to a statistician or data miner when it presents variables in a format suitable for exploratory data analysis. The %CODEBOOK macro described in this paper uses the Enterprise Miner heuristic rules for producing metadata, and classifies variables according to the Enterprise Miner rules into nominal, ordinal, or interval measurement scales 1. It creates presentations of the variables appropriate to their measurement scale and may be used as an exploratory data analysis (EDA) tool for a first look at a set of data. KEYWORDS Codebook, exploratory data analysis, measurement scale, metadata, SAS Enterprise Miner INTRODUCTION A codebook is an abstract of a collection of data items that have been assembled for purposes of studying some topic of interest. Surveys are frequently summarized in codebook format so that interesting features of the data may be revealed for further analysis. The characteristics of the data may be restricted to a few descriptive attributes such as variable name and label, number of categories and frequency count per category for a categorical variable, or mean and standard deviation and other descriptive statistics for a continuous variable. Many of these characteristics of the data may be obtained from a PROC CONTENTS listing. We statisticians and data miners and others who must rapidly assimilate the salient features of large ensembles of data require more descriptive measures than enumeration of data. We would like to see extreme values of a variable whether categorical or continuous, the distribution of the data represented as histograms for categorical variables or box-and-whisker plots for continuous variables, and additional information appropriate to a variable s information content. 1 The measurement scale of a variable refers to its comparison characteristics. For example, colors of the rainbow are labels that do not have an intrinsic ordering property. We do not know if blue is more or less than red, so colors are declared to be nominal. They can be enumerated but not ordered. If we attach a frequency to a color, then we can say that blue has a higher frequency than red so the ordering by increasing frequency would be red followed by blue, and color ordered by frequency is said to be ordinal. But we do not know yet whether the difference between colors is meaningful, i.e., if the computation blue red is defined. Using frequency in terahertz (10^12 Hertz) ( if blue has a frequency of 638 THz and red has a frequency of 428 THz, then the computation blue frequency red frequency is defined and the ordering is thus interval. 1
2 The process of creating a codebook can be automated by applying the Enterprise Miner rules of metadata creation to the variables in a dataset, and then producing a standardized report for each variable according to its metadata. ENTERPRISE MINER RULES OF METADATA CREATION The Enterprise Miner 4.3 rules of metadata creation [1] as adapted for the %CODEBOOK macro are: If the measurement scale of a variable was specified at macro invocation, it is not changed. If the variable is character or numeric with a numeric format (BINARY, HEX, IB, OC- TAL, PD, PIB, PK, RB, SSN, Z, ZD, or S370), then the measurement level is set to nominal. If the number of distinct values of a numeric variable is greater than min( 10, #{ nonmissing observations in the sample / 5 }), then the measurement level is set to interval. If the variable is assigned a numeric-type format (BEST, COMMA, COMMAX, DOL- LAR, DOLLARX, E, FRACT, NEGPAREN, PERCENT, or a standard w.d number format, such as 5.1, etc.), then the measurement level is set to ordinal. For all other formats, including user-defined formats, the measurement level is set to nominal. It is a user-specified option to apply the EM metadata rules to the data. The default option is to apply them. If the EM rules are not applied, then the measurement scale for each variable must be supplied by the user. PRESENTATION FORMAT The presentation of each variable is determined by its metadata classification. All the variables within a given measurement scale are summarized in tabular form and are then described individually. Descriptive statistics are computed for interval-scaled variables as a group and then each variable is further analyzed. Nominal and ordinal variables are also first summarized as a group and then individually. Nominal and ordinal variables share the same presentation format since an ordinal variable may be considered to be a nominal variable with an ordering property. Presentation Format for Interval-Scaled Variables Interval-scaled variables are also called real numbers or continuous variables in mathematics or machine learning, respectively. We can perform arithmetic operations on them and take comfort that the difference between two interval-scaled variables measured on the same scale is meaningful. 2 2 For example, two temperatures must be compared on the same scale, e.g., Fahrenheit or Celsius, to be meaningful, since a Fahrenheit degree of temperature is 5/9 of a Celsius degree. See for a lucid discussion of the origins of the two temperature measurement scales. 2
3 The group summary of interval-scaled variables includes the variable name and simple descriptive statistics. An example, from the Home Equity Loan Scoring dataset 3, illustrates the format. Summary of Interval Data % Variable Name Min Max Mean Std Dev N Missing Skewness Kurtosis clage clno debtinc delinq loan mortdue ninq value yoj An analyst can use the table to see which variables might be unusable due to a high percentage of missing values, which are skewed and need to be transformed, etc. The individual variable summary contains more detail: Interval Data Variable Name Metadata Statistic Box Plot clage Type: Num Basic Measures Format: N Length: 8 N Missing Label: "Age of oldest trade % Missing line in months" Range Mean Std Dev Skewness Kurtosis Quantiles % Max % % % % Q P99 50% Median P95 25% Q P90 10% P75 5% *--+--*AVG 1% P25 0% Min P5 There is repetition of the summary statistics in a different format for convenience, and a boxand-whisker plot 4 is placed adjacent to the quantiles of the interval-scaled variable s distribution as a visual representation of the data. 3 For SAS 9.2, it may be found at C:\Program Files\SAS\SASFoundation\9.2\dmine\sample\dmahmeq.sas7bdat. 4 A 0 represents a value between 1.5 and 3 times the inter-quantile range. A * represents a value greater than 3 times the IQR. 3
4 Presentation Format for Nominal- and Ordinal-Scaled Variables Nominal- and ordinal-scaled variables usually impart non-quantitative information on a different axis of meaning compared to interval-scaled variables and must be considered with respect to their different information content. The group summary of nominal- and ordinal-scaled variables includes the variable name and enumerative information that will be useful to a data miner in creating metadata. The group summary for two nominal Home Equity variables is shown below. Note that there are three sections of the summary, the first two of which are shown below. Summary of Nominal Data # Distinct % Variable Name # of Obs Values Missing JOB REASON Highest, Lowest Frequencies of Nominal Data Variable Name Value Count Percent JOB Other ProfEx Office Mgr <Missing> Self Sales Other ProfEx Office Mgr <Missing> Self Sales REASON DebtCon HomeImp <Missing> DebtCon HomeImp <Missing> In this example, we see that JOB has 6 distinct values and that they are of character type. Since there is no obvious ordering, JOB is a nominal variable. If we had prefixed each value with an integer, e.g., 1Mgr, 2ProfEx, 3Sales, etc., SAS would have used the lexicographic ordering to sort the values accordingly. But JOB would still be nominally-scaled since it is a character variable. The REASON variable has two values (three, counting <Missing> ), so it could be classified as binary in the Enterprise Miner measurement scale schema. We note that 4.23% of the values are missing, so these missing values must either be imputed or observations containing them must be omitted from the study if completely-populated observations are required 5. Furthermore, the 10 highest and lowest frequencies are listed as a way of indicating which values are outliers in the distribution. 5 An exception to this statement will be the situation when a data miner builds a model using an algorithm that is tolerant of missing data, such as a decision tree. 4
5 The third section of the summary presents information about each variable and histograms of each variable s values. Nominal Data Cum. Cum. Variable Name Metadata Value Histogram Freq Percent Freq Percent JOB Type: Num <Missing> * Format: Mgr *** Length: 6 Office *** Label: Other ******** "Prof/exec ProfEx **** sales mngr Sales office self Self * other" REASON Type: Num <Missing> * Format: DebtCon ************* Length: 7 HomeImp ****** Label: "Home improvement or debt consolidation" In addition to the metadata computed by the %CODEBOOK macro, the histogram of frequency counts provides a quick visual summary of the distribution of values. The three summary sections for ordinal variables are equivalent to those of nominal variables. Summary of Ordinal Data # Distinct % Variable Name # of Obs Values Missing BAD DEROG Ordinal Data Cum. Cum. Variable Name Metadata Value Histogram Freq Percent Freq Percent BAD Type: Num 0 **************** Format: 1 **** Length: 8 Label: "Default or seriously delinquent" DEROG Type: Num <Missing> ** Format: 0 *************** Length: 8 1 * Label: "Number 2 * of major derogatory reports"
6 USING THE %CODEBOOK MACRO The %CODEBOOK macro uses PROC IML and Base/SAS procedures. The macro is invoked with the following parameters: %macro CODEBOOK( DSNIN /* name of dataset for which codebook is to be produced */, CLASS_LEVELS=10 /* [optional] threshold for defining a categorical variable */, DROP= /* [optional] names of variables to exclude from processing */, EMRULES=Y /* [optional] use Enterprise Miner rules for creating metadata */, FREQ /* [optional] indicate that one observation represents &FREQ obs */, INTERVAL= /* [optional] names of interval variables */, MAXVALUE=10 /* [optional] number of values to display in max, min freq report */, METASAMPLE=2000 /* [optional] # of obs to sample from &DSNIN in creating metadata */, NOMINAL= /* [optional] names of [char numeric] nominal variables */, ORDINAL= /* [optional] names of [char numeric] ordinal variables */, ODS_LISTING= /* [optional] path for ODS LISTING file output */, WEIGHT= /* [optional] indicate the relative importance of an observation */ ) If we turn off the Enterprise Miner metadata rules processing, we can create the same results as shown above by invoking %CODEBOOK with the following parameters. libname HMEQ 'C:/Program Files/SAS/SASFoundation/9.2/dmine/sample' ; %CODEBOOK( HMEQ.DMAHMEQ, EMRULES=N, INTERVAL=clage clno debtinc delinq loan mortdue ninq value yoj, METASAMPLE=6000, NOMINAL=job reason, ORDINAL=bad derog ) Note that, since there are 5,960 observations in the Home Equity dataset, using the default sample size of 2,000 may not create a representative sample. If the sample size is not large enough to represent the entire population, the %CODEBOOK macro may not produce a codebook that will accurately summarize the original data. For example, using the default setting of METASAMPLE=2000, the variable DEROG is classified as ordinal but using all of the data results in its classification as interval. If METASAMPLE=max, all of the observations in the dataset will be used to build the codebook. A frequency variable may be used to indicate that a single observation represents several observations, and a weight variable may be used to indicate relative importance of observations. If a weight variable is specified, skew and kurtosis are not computed for interval variables. 6
7 EXAMPLES OF USE The simplest invocation of the %CODEBOOK macro is %CODEBOOK( DATASET ) which will create a sample of 2,000 observations and draw metadata inferences from the sample. If there is a variable in the dataset that contains an extremely large number of distinct values, i.e., 5-digit ZIP code that is contained in a character variable, %CODEBOOK will create a list of the individual values until the list requires more characters than the largest character variable allowed. If the list length is greater than 32,767 6 characters, it will be truncated with an error message printed at the end of the list. In this case, the variable can be put into a drop list so it will be excluded from processing. Any variables that are not of interest can be dropped to reduce the execution time required for codebook production, which can be lengthy for big datasets. The example below drops unwanted variables and stores the resulting output in an ODS listing file. %CODEBOOK( DATASET, DROP=account_id zip5 first_name last_name, ODS_LISTING= //server/path/codebook txt ) If %CODEBOOK is invoked with user-specified names of interval-, nominal-, or ordinal-scaled variables, they supercede the measurement scale assignments produced by the Enterprise Miner metadata rules. If &FREQ and/or &WEIGHT variables are specified as %CODEBOOK parameters, they will be reported according to their measurement scales if &EMRULES=Y (default). They must be explicitly dropped if their exclusion is desired. For example, the macro invocation %CODEBOOK( HMEQ.DMAHMEQ, DROP=FREQUENCY WGT, FREQ=FREQUENCY, WEIGHT=WGT ) will produce a codebook that uses FREQUENCY and WGT variables in producing summaries of the HMEQ.DMAHMEQ variables but will not report FREQUENCY or WGT in its results. CONCLUSION We have developed a SAS macro that produces a codebook summary of a set of data in a form suitable for exploratory data analysis. Enterprise Miner metadata rules are applied to each variable in the dataset to classify variables as interval, nominal, or ordinal in measurement scale. There are options for user-specified overrides of the metadata rules, specification of sample size, frequency and weighting variables, and storing the results in an ODS-created listing file separate from the standard output. A data miner can use the codebook summary to perform initial exploration of a set of data before starting an Enterprise Miner-based modeling effort. 6 The maximum length of a character variable is 32,767 characters as of this writing. The macro code may be easily modified to accommodate future changes. 7
8 REFERENCES 1. Enterprise Miner 4.3 Input Data Source, Data Tab, Metadata Sample, SAS Institute, Inc., Cary, NC, ACKNOWLEDGMENTS We thank Joseph Naraguma and Jay King for reviewing preliminary versions of this paper and contributing their ideas. Contact Information Your comments and questions are valued and encouraged. Contact the author at: Ross Bettinger SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 8
Data Mining Using SAS Enterprise Miner : A Case Study Approach, Second Edition
Data Mining Using SAS Enterprise Miner : A Case Study Approach, Second Edition The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2003. Data Mining Using SAS Enterprise
More informationCool Tools for PROC LOGISTIC
Cool Tools for PROC LOGISTIC Paul D. Allison Statistical Horizons LLC and the University of Pennsylvania March 2013 www.statisticalhorizons.com 1 New Features in LOGISTIC ODDSRATIO statement EFFECTPLOT
More informationRisk Management : Using SAS to Model Portfolio Drawdown, Recovery, and Value at Risk Haftan Eckholdt, DayTrends, Brooklyn, New York
Paper 199-29 Risk Management : Using SAS to Model Portfolio Drawdown, Recovery, and Value at Risk Haftan Eckholdt, DayTrends, Brooklyn, New York ABSTRACT Portfolio risk management is an art and a science
More informationImproving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationLecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
More informationData exploration with Microsoft Excel: analysing more than one variable
Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical
More informationData Exploration Data Visualization
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
More information4 Other useful features on the course web page. 5 Accessing SAS
1 Using SAS outside of ITCs Statistical Methods and Computing, 22S:30/105 Instructor: Cowles Lab 1 Jan 31, 2014 You can access SAS from off campus by using the ITC Virtual Desktop Go to https://virtualdesktopuiowaedu
More informationInnovative Analytics for Traditional, Social, and Text Data. Dr. Gerald Fahner, Senior Director Analytic Science, FICO
Innovative Analytics for Traditional, Social, and Text Data Dr. Gerald Fahner, Senior Director Analytic Science, FICO Hot Trends in Predictive Analytics Big Data the Fuel is high-volume, high-velocity
More informationDongfeng Li. Autumn 2010
Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis
More informationChapter 32 Histograms and Bar Charts. Chapter Table of Contents VARIABLES...470 METHOD...471 OUTPUT...472 REFERENCES...474
Chapter 32 Histograms and Bar Charts Chapter Table of Contents VARIABLES...470 METHOD...471 OUTPUT...472 REFERENCES...474 467 Part 3. Introduction 468 Chapter 32 Histograms and Bar Charts Bar charts are
More informationIntroduction; Descriptive & Univariate Statistics
Introduction; Descriptive & Univariate Statistics I. KEY COCEPTS A. Population. Definitions:. The entire set of members in a group. EXAMPLES: All U.S. citizens; all otre Dame Students. 2. All values of
More informationDESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS
DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 seema@iasri.res.in 1. Descriptive Statistics Statistics
More informationSalary. Cumulative Frequency
HW01 Answering the Right Question with the Right PROC Carrie Mariner, Afton-Royal Training & Consulting, Richmond, VA ABSTRACT When your boss comes to you and says "I need this report by tomorrow!" do
More informationWebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
More informationFoundation of Quantitative Data Analysis
Foundation of Quantitative Data Analysis Part 1: Data manipulation and descriptive statistics with SPSS/Excel HSRS #10 - October 17, 2013 Reference : A. Aczel, Complete Business Statistics. Chapters 1
More informationIBM SPSS Direct Marketing 19
IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS
More informationEvaluating the results of a car crash study using Statistical Analysis System. Kennesaw State University
Running head: EVALUATING THE RESULTS OF A CAR CRASH STUDY USING SAS 1 Evaluating the results of a car crash study using Statistical Analysis System Kennesaw State University 2 Abstract Part 1. The study
More informationMake Better Decisions with Optimization
ABSTRACT Paper SAS1785-2015 Make Better Decisions with Optimization David R. Duling, SAS Institute Inc. Automated decision making systems are now found everywhere, from your bank to your government to
More informationLecture 1: Review and Exploratory Data Analysis (EDA)
Lecture 1: Review and Exploratory Data Analysis (EDA) Sandy Eckel seckel@jhsph.edu Department of Biostatistics, The Johns Hopkins University, Baltimore USA 21 April 2008 1 / 40 Course Information I Course
More informationCOM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3
COMP 5318 Data Exploration and Analysis Chapter 3 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler
Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?
More informationTHE POWER OF PROC FORMAT
THE POWER OF PROC FORMAT Jonas V. Bilenas, Chase Manhattan Bank, New York, NY ABSTRACT The FORMAT procedure in SAS is a very powerful and productive tool. Yet many beginning programmers rarely make use
More informationEXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationData Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
More informationIBM SPSS Direct Marketing 20
IBM SPSS Direct Marketing 20 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This edition applies to IBM SPSS Statistics 20 and to
More informationA Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND
Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression
More informationSTATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI
STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members
More informationInnovative Techniques and Tools to Detect Data Quality Problems
Paper DM05 Innovative Techniques and Tools to Detect Data Quality Problems Hong Qi and Allan Glaser Merck & Co., Inc., Upper Gwynnedd, PA ABSTRACT High quality data are essential for accurate and meaningful
More informationVisualization Quick Guide
Visualization Quick Guide A best practice guide to help you find the right visualization for your data WHAT IS DOMO? Domo is a new form of business intelligence (BI) unlike anything before an executive
More informationData Cleaning 101. Ronald Cody, Ed.D., Robert Wood Johnson Medical School, Piscataway, NJ. Variable Name. Valid Values. Type
Data Cleaning 101 Ronald Cody, Ed.D., Robert Wood Johnson Medical School, Piscataway, NJ INTRODUCTION One of the first and most important steps in any data processing task is to verify that your data values
More informationExploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary
More informationAn Introduction to Statistical Tests for the SAS Programmer Sara Beck, Fred Hutchinson Cancer Research Center, Seattle, WA
ABSTRACT An Introduction to Statistical Tests for the SAS Programmer Sara Beck, Fred Hutchinson Cancer Research Center, Seattle, WA Often SAS Programmers find themselves in situations where performing
More informationData Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through
More informationPreparing Data for Data Mining
Chapter 2 Preparing Data for Data Mining 2.1 Introduction Data are the backbone of data mining and knowledge discovery; however, real-world business data usually are not available in data-mining-ready
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.
More informationPASW Direct Marketing 18
i PASW Direct Marketing 18 For more information about SPSS Inc. software products, please visit our Web site at http://www.spss.com or contact SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6412
More informationHistogram of Numeric Data Distribution from the UNIVARIATE Procedure
Histogram of Numeric Data Distribution from the UNIVARIATE Procedure Chauthi Nguyen, GlaxoSmithKline, King of Prussia, PA ABSTRACT The UNIVARIATE procedure from the Base SAS Software has been widely used
More informationData exploration with Microsoft Excel: univariate analysis
Data exploration with Microsoft Excel: univariate analysis Contents 1 Introduction... 1 2 Exploring a variable s frequency distribution... 2 3 Calculating measures of central tendency... 16 4 Calculating
More informationSTATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
More informationIT Service Level Management 2.1 User s Guide SAS
IT Service Level Management 2.1 User s Guide SAS The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2006. SAS IT Service Level Management 2.1: User s Guide. Cary, NC:
More informationImproved Interaction Interpretation: Application of the EFFECTPLOT statement and other useful features in PROC LOGISTIC
Paper AA08-2013 Improved Interaction Interpretation: Application of the EFFECTPLOT statement and other useful features in PROC LOGISTIC Robert G. Downer, Grand Valley State University, Allendale, MI ABSTRACT
More information!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
More informationReevaluating Policy and Claims Analytics: a Case of Non-Fleet Customers In Automobile Insurance Industry
Paper 1808-2014 Reevaluating Policy and Claims Analytics: a Case of Non-Fleet Customers In Automobile Insurance Industry Kittipong Trongsawad and Jongsawas Chongwatpol NIDA Business School, National Institute
More informationIntroduction to SAS Business Intelligence/Enterprise Guide Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN
Paper TS600 Introduction to SAS Business Intelligence/Enterprise Guide Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN ABSTRACT This paper provides an overview of new SAS Business Intelligence
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationIris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode
Iris Sample Data Set Basic Visualization Techniques: Charts, Graphs and Maps CS598 Information Visualization Spring 2010 Many of the exploratory data techniques are illustrated with the Iris Plant data
More informationPaper 70-27 An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois
Paper 70-27 An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois Abstract This paper introduces SAS users with at least a basic understanding of SAS data
More informationA fast, powerful data mining workbench designed for small to midsize organizations
FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business
More informationStep-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER
Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER JMP Genomics Step-by-Step Guide to Bi-Parental Linkage Mapping Introduction JMP Genomics offers several tools for the creation of linkage maps
More informationChoosing a classification method
Chapter 6 Symbolizing your data 103 Choosing a classification method ArcView offers five classification methods for making a graduated color or graduated symbol map: Natural breaks Quantile Equal area
More informationEasily Identify the Right Customers
PASW Direct Marketing 18 Specifications Easily Identify the Right Customers You want your marketing programs to be as profitable as possible, and gaining insight into the information contained in your
More informationDescriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics
Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),
More informationAn Overview and Evaluation of Decision Tree Methodology
An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com
More informationModeling Lifetime Value in the Insurance Industry
Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting
More informationABSTRACT INTRODUCTION DATA FEEDS TO THE DASHBOARD
Dashboard Reports for Predictive Model Management Jifa Wei, SAS Institute Inc., Cary, NC Emily (Yan) Gao, SAS Institute Inc., Beijing, China Frank (Jidong) Wang, SAS Institute Inc., Beijing, China Robert
More informationPaper AD11 Exceptional Exception Reports
Paper AD11 Exceptional Exception Reports Gary McQuown Data and Analytic Solutions Inc. http://www.dasconsultants.com Introduction This paper presents an overview of exception reports for data quality control
More informationMEASURES OF LOCATION AND SPREAD
Paper TU04 An Overview of Non-parametric Tests in SAS : When, Why, and How Paul A. Pappas and Venita DePuy Durham, North Carolina, USA ABSTRACT Most commonly used statistical procedures are based on the
More informationGetting Started with R and RStudio 1
Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following
More informationUsing SPSS, Chapter 2: Descriptive Statistics
1 Using SPSS, Chapter 2: Descriptive Statistics Chapters 2.1 & 2.2 Descriptive Statistics 2 Mean, Standard Deviation, Variance, Range, Minimum, Maximum 2 Mean, Median, Mode, Standard Deviation, Variance,
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationSAS Portfolio Analysis - Serving the Kitchen Sink Haftan Eckholdt, DayTrends, Brooklyn, NY
SAS Portfolio Analysis - Serving the Kitchen Sink Haftan Eckholdt, DayTrends, Brooklyn, NY ABSTRACT Portfolio management has surpassed the capabilities of financial software platforms both in terms of
More informationEXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA
EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA Michael A. Walega Covance, Inc. INTRODUCTION In broad terms, Exploratory Data Analysis (EDA) can be defined as the numerical and graphical examination
More informationCopyr i g ht 2014, SAS Ins titut e Inc. All rights res er ve d. WHAT S NEW IN SAS ANALYTICS 9.4
WHAT S NEW IN SAS ANALYTICS 9.4 AGENDA SAS ANALYTICS 13.1 SAS Enterprise Guide 6.1 SAS/STAT 13.1 SAS/ETS 13.1 SAS Enterprise Miner 13.1 4 SAS ANALYTICS 13.1 LATEST RELEASE OF SAS ANALYTIC SOFTWARE SAS
More informationPredicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS
Paper 114-27 Predicting Customer in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS Junxiang Lu, Ph.D. Sprint Communications Company Overland Park, Kansas ABSTRACT
More informationBNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
More informationData! Data! Data! I can t make bricks without clay. Sherlock Holmes, The Copper Beeches, 1892.
1. Data: exploratory data analysis Content 1.1 Introduction 1.2 Tables and diagrams 1.3 Describing univariate numerical data Ref: Pagano and Gauvreau, Chapters 2 and 3. Data! Data! Data! I can t make bricks
More informationDescriptive Statistics and Measurement Scales
Descriptive Statistics 1 Descriptive Statistics and Measurement Scales Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample
More informationIn this presentation, you will be introduced to data mining and the relationship with meaningful use.
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine
More informationPaper 159 2010. Exploring, Analyzing, and Summarizing Your Data: Choosing and Using the Right SAS Tool from a Rich Portfolio
Paper 159 2010 Exploring, Analyzing, and Summarizing Your Data: Choosing and Using the Right SAS Tool from a Rich Portfolio ABSTRACT Douglas Thompson, Assurant Health This is a high level survey of Base
More informationIBM SPSS Statistics for Beginners for Windows
ISS, NEWCASTLE UNIVERSITY IBM SPSS Statistics for Beginners for Windows A Training Manual for Beginners Dr. S. T. Kometa A Training Manual for Beginners Contents 1 Aims and Objectives... 3 1.1 Learning
More informationThe Complete Software Package to Developing a Complete Set of Ratio Edits
The Complete Software Package to Developing a Complete Set of Ratio Edits Roger Goodwin and Maria Garcia Roger.L.Goodwin@Census.gov Abstract We present documentation for running the GenBounds software
More informationAPPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING
Wrocław University of Technology Internet Engineering Henryk Maciejewski APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING PRACTICAL GUIDE Wrocław (2011) 1 Copyright by Wrocław University of Technology
More informationCan SAS Enterprise Guide do all of that, with no programming required? Yes, it can.
SAS Enterprise Guide for Educational Researchers: Data Import to Publication without Programming AnnMaria De Mars, University of Southern California, Los Angeles, CA ABSTRACT In this workshop, participants
More informationIBM SPSS Data Preparation 22
IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release
More informationUsing Data Mining Techniques for Analyzing Pottery Databases
BAR-ILAN UNIVERSITY Using Data Mining Techniques for Analyzing Pottery Databases Zachi Zweig Submitted in partial fulfillment of the requirements for the Master s degree in the Martin (Szusz) Department
More informationWhat is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling
MS4424 Data Mining & Modelling MS4424 Data Mining & Modelling Lecturer : Dr Iris Yeung Room No : P7509 Tel No : 2788 8566 Email : msiris@cityu.edu.hk 1 Aims To introduce the basic concepts of data mining
More informationDeveloping Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1
Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1 SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2012. Developing
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More information2015 Workshops for Professors
SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market
More informationSAS Credit Scoring for Banking 4.3
SAS Credit Scoring for Banking 4.3 Hot Fix 1 SAS Banking Intelligence Solutions ii SAS Credit Scoring for Banking 4.3: Hot Fix 1 The correct bibliographic citation for this manual is as follows: SAS Institute
More informationAn introduction to using Microsoft Excel for quantitative data analysis
Contents An introduction to using Microsoft Excel for quantitative data analysis 1 Introduction... 1 2 Why use Excel?... 2 3 Quantitative data analysis tools in Excel... 3 4 Entering your data... 6 5 Preparing
More informationFrom Database to your Desktop: How to almost completely automate reports in SAS, with the power of Proc SQL
From Database to your Desktop: How to almost completely automate reports in SAS, with the power of Proc SQL Kirtiraj Mohanty, Department of Mathematics and Statistics, San Diego State University, San Diego,
More informationa. mean b. interquartile range c. range d. median
3. Since 4. The HOMEWORK 3 Due: Feb.3 1. A set of data are put in numerical order, and a statistic is calculated that divides the data set into two equal parts with one part below it and the other part
More informationSession Attribution in SAS Web Analytics
Session Attribution Session Attribution Session Attribution Session Attribution Session Attribution Session Attribution Session Attributi Technical Paper Session Attribution in SAS Web Analytics The online
More informationScatter Chart. Segmented Bar Chart. Overlay Chart
Data Visualization Using Java and VRML Lingxiao Li, Art Barnes, SAS Institute Inc., Cary, NC ABSTRACT Java and VRML (Virtual Reality Modeling Language) are tools with tremendous potential for creating
More informationA Microsoft Access Based System, Using SAS as a Background Number Cruncher David Kiasi, Applications Alternatives, Upper Marlboro, MD
AD006 A Microsoft Access Based System, Using SAS as a Background Number Cruncher David Kiasi, Applications Alternatives, Upper Marlboro, MD ABSTRACT In Access based systems, using Visual Basic for Applications
More informationAutomate Data Integration Processes for Pharmaceutical Data Warehouse
Paper AD01 Automate Data Integration Processes for Pharmaceutical Data Warehouse Sandy Lei, Johnson & Johnson Pharmaceutical Research and Development, L.L.C, Titusville, NJ Kwang-Shi Shu, Johnson & Johnson
More informationSAS Data Set Encryption Options
Technical Paper SAS Data Set Encryption Options SAS product interaction with encrypted data storage Table of Contents Introduction: What Is Encryption?... 1 Test Configuration... 1 Data... 1 Code... 2
More informationEXST SAS Lab Lab #4: Data input and dataset modifications
EXST SAS Lab Lab #4: Data input and dataset modifications Objectives 1. Import an EXCEL dataset. 2. Infile an external dataset (CSV file) 3. Concatenate two datasets into one 4. The PLOT statement will
More informationVisualizing Relationships and Connections in Complex Data Using Network Diagrams in SAS Visual Analytics
Paper 3323-2015 Visualizing Relationships and Connections in Complex Data Using Network Diagrams in SAS Visual Analytics ABSTRACT Stephen Overton, Ben Zenick, Zencos Consulting Network diagrams in SAS
More informationVisualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures
Introductory Statistics Lectures Visualizing Data Descriptive Statistics I Department of Mathematics Pima Community College Redistribution of this material is prohibited without written permission of the
More information2. Filling Data Gaps, Data validation & Descriptive Statistics
2. Filling Data Gaps, Data validation & Descriptive Statistics Dr. Prasad Modak Background Data collected from field may suffer from these problems Data may contain gaps ( = no readings during this period)
More informationABSTRACT INTRODUCTION EXERCISE 1: EXPLORING THE USER INTERFACE GRAPH GALLERY
Statistical Graphics for Clinical Research Using ODS Graphics Designer Wei Cheng, Isis Pharmaceuticals, Inc., Carlsbad, CA Sanjay Matange, SAS Institute, Cary, NC ABSTRACT Statistical graphics play an
More informationIntegrating SAS with JMP to Build an Interactive Application
Paper JMP50 Integrating SAS with JMP to Build an Interactive Application ABSTRACT This presentation will demonstrate how to bring various JMP visuals into one platform to build an appealing, informative,
More informationProject Request and Tracking Using SAS/IntrNet Software Steven Beakley, LabOne, Inc., Lenexa, Kansas
Paper 197 Project Request and Tracking Using SAS/IntrNet Software Steven Beakley, LabOne, Inc., Lenexa, Kansas ABSTRACT The following paper describes a project request and tracking system that has been
More information