1 Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation is the most crucial, most difficult, and longest part of the mining process. A lot of steps are involved. Consider the simple distribution analysis of the variables, the diagnosis and reduction of the influence of variables' multicollinearity, the imputation of missing values, and the construction of categories in variables. In this presentation, we use data mining models in different areas like marketing, insurance, retail and credit risk. We show how to implement data preparation through SAS Enterprise Miner, using different approaches. We use simple code routines and complex processes involving statistical insights, cluster variables, transform variables, graphical analysis, decision trees, and more. INTRODUCTION The process of knowledge discovery in data mining involves three main parts: data preprocessing, data selection and data information. Figure 1: Process of Knowledge Discovery This article focus on data selection and covers techniques used in analytical data preparation. Rather than addressing the ETL (Extraction, Transform and Load) process, our goal is to show the fundamental steps involved in preparing and selecting context relevant variables. This will ensure an improved data mining process as a consequence of a less polluted set of variables. It will prevent problems, such as multicollinearity or missing values. In addition, processing time will be shortened. The success of predictive models largely depends on input selection. Our goal is to show some techniques to reduce the input count without compromising the quality of the final model.
2 The data set used in this article is called t97nk. This data set was taken and adapted from the Association for Computing Machinery's (ACM) 1998 KDD - Cup competition. Details of the set and the competition are publicly available at the UCI Archive at This data set has 49 variables and rows, a binary Target. USING DATA SOURCE CREATION A data source is a link between an existing SAS Table and the project in SAS Enterprise Miner. Figure 2 shows how to use the Metadata Advisor Options to improve variable selection, during the creation of data source in data mining modelling. On the Advanced Advisor Options panel we have: Figure 2: Advanced Advisor Options 1) Reject variables with excessive number of missing values. The default here is 50%. At first, variables missing more than 50% of their values are not taken into acount. 2) Detect the number of levels of Numeric variables and assign Nominal role to those with class count below the selected threshold. The default is 20, but in this case I chose 2 as threshold. 3) Detect the number of levels of Character variables and assign a rejected role to those with class count above the selected threshold. The default is also 20, but I chose 100. This option helps to eliminate unwanted variables. Figure 3 shows the group of variables considering the settings chosen on the Advanced advisor Options panel. Figure 3: Data Source Wizard
3 Setting the Advanced Advisory Option allows you to see some descriptive statistics, such as Number of Levels, Minimum, Maximum, Mean, Standard Deviation, Skewness and Kurtosis, by checking the Statistics box as shown in Figure 4. Figure 4: Statistics Option in Data Source Wizard STATEXPLORE NODE: USING METRICS AND GRAPHS TO EXPLORE DATA The StatExplore node is a multipurpose tool that you use to examine variable distributions and statistics in your data sets. The StatExplore node generates summarization statistics. In this section we use the StatExplore node to find relevant information about the variables. Figure 4 shows the StatExplore node connected at the data source T97Nk. Figure 4: Using the StatExplore node in the flow
4 Select the StatExplore node and examine its Properties Panel. To have the Chi-Square between all variables and the Target change the option Interval Variables below Chi-Square Statistics to Yes as shown in Figure 5. As you know, the Chi-Square test is used for categorical variables. One way to include interval variables in this test is doing as mentioned above. This will generate chi-square statistics including the interval variables by binning the type of variables. The default is 5 bins. Figure 5: Panel Properties to StatExplore Node Figure 6 shows the chi-square test including the transformed interval variables. Figure 6: Chi_Square Plot Results
5 The chart shows the association between each variable and the Target in decreasing order of association. If you want to see these created bins, go to the results part and select View > Summary Statistics > Cell Chi-Squares. Figure 7 shows the bins to the variable Card_Prom_12. Figure 7: Card_Prom_12 Bins Figure 8 shows the results of each Chi-Square test with all the variables. Figure 8: Chi_Square Results
6 Figure 9 shows the Scale Mean Deviation plot. This plot shows the scale mean deviation between the overall mean of an interval input variable and its mean for each level of the target variable. Each variable is represented vertically by two squares, one blue and the other red. The blue is the scale mean deviation of the variable when the target is 0 and the red is the scale mean deviation of the same variable when the target is 1. Figure 9: Scale Mean Deviation Plot For example, let's assume there is a class target with level 0 and 1 and that there is an interval variable called x, the scaled mean deviation is calculated as follows: SMD(x, 0) = [mean(x, where target=0) - mean(x)] / mean(x) SMD(x, 1) = [mean(x, where target=1) - mean(x)] / mean(x) The results from the StatExplore node help us find the best set of variables which could be used in the process of data mining. TRANSFORM VARIABLES NODE The Transform Variables node allows the transformation of all variables. For interval variables, simple transformations, binning transformations and best transformations are available. For class variables, the grouping of rare levels and the creation of dummy indicators are possible. In general, with Transform Variables node, you can do both computed transformations and customized actions choosing the best transformation based on your choice.
7 Let s take a look at three very common automatic transformations. Figure 10: Transform Variable Node Figure 10 shows the Transform Variable node connected to the data source. For didactic purposes I renamed the node to Optimal. Figure 11: Transform Variable Properties Panel Figure 11 shows the Default Methods including the Optimal Binning for interval inputs. The optimal binning allows you to transform continuous variables into an ordered set of bins. The binning is done so that the log odds of the predicted categorical variable is monotonically increasing or decreasing. The optimal binning transformation is useful when there is a nonlinear relationship between an input variable and the target. Figure 12: Transform Variable Node
8 Figure 12 shows the Transform Variable node connected to the data source that I renamed to Best. Figure 13: Transform Variable Properties Panel Figure 13 shows the Default Methods including the Best option for interval inputs. The Best option perform several transformations and uses the transformation that has the best association with the target variable considering the Chi Squared test. Figure 14: Results of Computed Transformations Figure 14 shows some results after we select the Best option.
9 Figure 15: Transform Variable Node Figure 15 shows the Transform Variable node connected to the data source that I renamed to Max Normal. In this case, in Default Methods I changed from none to Maximum Normal as shown in Figure 16. Figure 16: Transform Variable Properties Panel Figure 17: Results of Computed Transformations Figure 17 shows the results. When the option Maximum Normal is chosen, the node will look for the best transformation to maximize the normality of the variable.
10 In addition to the computed transformation options, the Transform Variable node allows us to do customized transformations when selecting the Formulas option, as shown in Figure 18. Figure 18: Properties Panel This option shows graphics with the distribution of each variable as shown in Figure 19. Figure 19: Distribution of Each Variable In addition to providing the distribution of each variable graphically, if you select the Expression Builder from the options Create > Build, you can do customized transformations based on your requirement as shown in Figure 20.
11 Figure 20: Expression Builder INTERACTIVE BINNING NODE This node groups the values of each variable, both class and interval, in bins or buckets to be used in predictive modeling as a way to improve the predictive power of each input. In addition, this node can be used as a variable selection that applies Gini statistic as metric. This metric shows the predictive power of a characteristic, for example ability to separate high-risk applicants from low-risk ones. Through this you can configure the node so that the missing values are treated as a level for each variable. Figure 21: Interactive Binning Node
12 Figure 21 shows the Interactive Binning node results. In it, both rejected and approved variables are arranged after going through Gini statistics. Figure 22: Interactive Binning Results Figure 22 shows the results from Interactive Binning node results. These results show which variables were rejected and which variables were approved regarding the results from Gini statistics. Figure 23: Interactive Binning Options
13 If you want to change the ranges or the groups, this node allows us to interact with the results found, as shown in Figure 23. Figure 24: Interactive Binning Panel This can be done by selecting the square in Interactive Binning panel as in Figure 24. IMPUTE NODE: IMPUTING MISSING VALUES In the processing of Data Mining, databases often contain observations that have missing values for one or more variables. Missing values can result from data collection errors, incomplete customer responses, actual system and measurement failures, or from a revision of the data collection scope over time, such as tracking new variables that were not included in the previous data collection scheme. In SAS Enterprise Miner, models such as regressions and neural networks ignore the whole record that contains missing values. This reduces the size of the training data set. Less training data can substantially weaken the predictive power of these models. To overcome this obstacle of missing data, you can impute missing values before you fit the models. How should missing data values be treated? There is no single correct answer. Choosing the "best" missing value replacement technique inherently requires the researcher to make assumptions about the true (missing) data. For example, researchers often replace a missing value with the mean of the variable. This approach assumes that the variable's data distribution follows a normal population response. Replacing missing values with the mean, median, or another measure of central tendency is simple, but it can greatly affect a variable's sample distribution. You should use these replacement statistics carefully and only when the effect is minimal. The Imputation Node does impute a value based on various metrics like mean, mode, or median as well as Decision Trees (with or without Surrogates) to capture the essence of the variable with missing entries using other variables in the data set. Let s take a look in some way to do the imputation values through Impute node. Figure 25 shows the Impute node connected on the Transformation node renamed as Default.
14 Figure 25: Impute Node Figure 26: Impute Node Panel Properties Figure 26 shows the default option in the panel properties. If the variable has an interval role then the missing values will be replaced by the average of the variable. On the other hand, if the variable has a class role then the missing values will be replaced by the count or mode of the variable. Because a predicted response might be different for cases with a missing input value, a binary imputation indicator variable is often added to the training data. Adding this variable enables a model to adjust its predictions in cases where missingness itself is correlated with the target. Figure 27: Indicator Variables in Impute Node Panel Properties Figure 27 shows where to choose this option in Impute Properties Panel.
15 Figure 28: Indicator Variables in Impute Node Panel Properties Figure 28 shows the Impute node connected on the Transformation node renamed as Tree. On Properties Panel I replace both Default Input Method Class and Interval variables to Tree. Using this option the replacement values are estimated by analyzing each input as a target, and the remaining input and rejected variables are used as predictors. Because the imputed value for each input variable is based on the other input variables, this imputation technique may be more accurate than simply using the variable mean or median to replace the missing tree values. PRINCIPAL COMPONENT NODE The Principal Component node calculates eigenvalues and eigenvectors from the uncorrected covariance matrix, corrected covariance matrix or the correlation matrix of input variables. Principal components are calculated from the eigenvectors and are usually treated as the new set of input variables for successor nodes. This approach is useful for data interpretation and data dimension reduction. Figure 29 shows the Principal Components node connected with the previous nodes. Figure 29: Principal Components
16 Figure 30: Principal Components Matrix The Principal Components Matrix chart displays scattered plots for all pairs of the selected principal components. By default, the initial matrix chart shows the scattered plots for the first five principal components considering de Target variable as in Figure 30. On the Eigenvalue Plot, you can see some useful metrics like eigenvalues, proportional eigenvalues, cumulative proportional eigenvalue and log eigenvalue. You can select an item from the drop-down menu to change the values. Figure 31: Eigenvalue Plot
17 Figure 31 shows the Eigenvalue Plot with values. The vertical reference line indicates the number of principal components that will be used in the subsequent node. In this case we have 14 principal components. VARIABLE CLUSTERING NODE The Variable Clustering node is useful for data reduction, such as choosing the best variables or cluster components for analysis purposes. Variable Clustering removes collinearity, decreases variable redundancy, and helps to reveal the underlying structure of input variable in a data set. You can use both Correlation Matrix and Covariance Matrix as source of information to build the variables clusters. Using the Correlation Matrix, the SAS Enterprise Miner uses the correlation matrix of standardized variables. When the Cluster Component property is set to Principal Components, the source matrix provides eigenvalues. Using the Covariance Matrix the SAS Enterprise Miner uses the covariance matrix of raw variables. The Covariance property setting permits variables that have a large variance to have more effect on the cluster components than variables that have a small variance. The Covariance property setting permits variables that have a large variance to have more effect on the cluster components than variables that have a small variance. When the Cluster Component property is set to Principal Components, the source matrix provides eigenvectors. Figure 32 shows the Variable Clustering node connected with the previous nodes. Figure 32: Variable Clustering The Results window dendrogram uses a tree hierarchy to display how the clusters were formed. Figure 33 shows the dendogram.
18 Figure 33: Dendogram Figure 34 shows the Cluster Plot with four clusters that were created from the set of interval input variables in the data source. Figure 34: Cluster Plot The components in the Variable Selection table will be exported to the node that follows the Variable Clustering node in a process flow diagram. The R-Square with Next Cluster Component column of the table indicates the R-square scores with the nearest cluster. If the clusters are well separated, R-square score values should be low. Small values in the 1 R 2 Ratio column of the Variable Selection table also indicate good clustering.
19 VARIABLE SELECTION NODE The Variable Selection node enables you to evaluate the importance of input variables in predicting or classifying the target variable. To select the important inputs, the tool uses either an R- Squared or a Chi-Squared selection criterion. When the option Default is used this the uses target and model information to choose the selection method. If the target is binary and the model has greater than 400 degrees of freedom, the Chi-Square method is selected. Otherwise, the R-Square method is used. The R-Square method can be used with a binary as well as with an interval-scaled target. In the R-Square method, variable selection is performed in two steps. In the first step R-Square between the input and the target is calculated. All variables with a correlation above a specified threshold are selected in the first step. Those variables which are selected in the first step enter the second step of variable selection. In the second step, a sequential forward selection process is used. This process starts by selecting the input variable that has the highest correlation coefficient with the target. A regression equation is estimated with the selected input. The Chi-Square method can be used when the target is binary. When this criterion is selected, the selection process does not have two distinct steps, as in the case of the R-square criterion. Instead, a tree is constructed. The inputs selected in the construction of the tree are passed to the next node with the assigned Role of Input. Figure 35: Variable Selection Figure 35 shows the flow using the Variable Selection node with default options.
20 Figure 36: R 2 Values Figure 36 shows the R 2 Values Plot. This plot displays a histogram with ranked variable effects. The ranking uses the variable R 2, sorted from highest to lowest. CONCLUSION This paper shows how you can use the SAS Enterprise Miner 13.2 to the process of preparing data for analytical modeling. My goal here is to show some ways to do this. These approaches should be used as a reference and not as a road map to follow. ACKNOWLEDGMENTS The author thanks Djalma de Azevedo for the input and help in reviewing the text. REFERENCES Kattamuri S. Sarma. Predictive Modeling with SAS Enterprise Miner : Practical Solutions for Business Applications, Second Edition. Patricia B. Cerrito. Introduction to Data Mining Using SAS Enterprise Miner. SAS Enterprise Miner 13.2: Reference Help.
21 CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Ricardo Galante. SAS SAS Institute Brasil 9º andar - R. Leopoldo Couto Magalhães Júnior, Itaim Bibi, São Paulo - SP, SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration.
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business
Wrocław University of Technology Internet Engineering Henryk Maciejewski APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING PRACTICAL GUIDE Wrocław (2011) 1 Copyright by Wrocław University of Technology
Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression
Application of SAS! Enterprise Miner in Credit Risk Analytics Presented by Minakshi Srivastava, VP, Bank of America 1 Table of Contents Credit Risk Analytics Overview Journey from DATA to DECISIONS Exploratory
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel
STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE Perhaps Microsoft has taken pains to hide some of the most powerful tools in Excel. These add-ins tools work on top of Excel, extending its power and abilities
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
MS4424 Data Mining & Modelling MS4424 Data Mining & Modelling Lecturer : Dr Iris Yeung Room No : P7509 Tel No : 2788 8566 Email : firstname.lastname@example.org 1 Aims To introduce the basic concepts of data mining
Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly
STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members
Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1 SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2012. Developing
Medical Information Management & Mining You Chen Jan,15, 2013 You.email@example.com 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.
SAS Analytics Day Predictive Modeling of Titanic Survivors: a Learning Competition Linda Schumacher Problem Introduction On April 15, 1912, the RMS Titanic sank resulting in the loss of 1502 out of 2224
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical
Table of contents: Access Data for Analysis Data file types Format assumptions Data from Excel Information links Add multiple data tables Create & Interpret Visualizations Table Pie Chart Cross Table Treemap
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 1: Descriptive Statistics Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the
Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research
Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...
Bill Burton Albert Einstein College of Medicine firstname.lastname@example.org April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Calculate counts, means, and standard deviations Produce
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX email@example.com Carole Jesse Cargill, Inc. Wayzata, MN firstname.lastname@example.org
FACT SHEET SAS ENTERPRISE MINER 5.3 Unearthing valuable insight profitable data mining results with less time and effort What does SAS Enterprise Miner do? SAS Enterprise Miner streamlines the data mining
Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market
SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING
Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through
White Paper Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices. Contents Data Management: Why It s So Essential... 1 The Basics of Data Preparation... 1 1: Simplify Access
Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive
Formula Guide The purpose of the Weight of Evidence (WoE) module is to provide flexible tools to recode the values in continuous and categorical predictor variables into discrete categories automatically,
Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Occam s razor.......................................................... 2 A look at data I.........................................................
Paper AA-08-2015 Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM Delali Agbenyegah, Alliance Data Systems, Columbus, Ohio 0.0 ABSTRACT Traditional
Contents An introduction to using Microsoft Excel for quantitative data analysis 1 Introduction... 1 2 Why use Excel?... 2 3 Quantitative data analysis tools in Excel... 3 4 Entering your data... 6 5 Preparing
TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY Before we begin: Turn on the sound on your computer. There is audio to accompany this presentation. Audio will accompany most of the online
The Basics of SAS Enterprise Miner 5.2 1.1 Introduction to Data Mining...1 1.2 Introduction to SAS Enterprise Miner 5.2...4 1.3 Exploring the Data Set... 14 1.4 Analyzing a Sample Data Set... 19 1.5 Presenting
SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,
Paper 10961-2016 Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables Vinoth Kumar Raja, Vignesh Dhanabal and Dr. Goutam Chakraborty, Oklahoma State
Entering and Formatting Data Using Microsoft Excel to Plot and Analyze Kinetic Data Open Excel. Set up the spreadsheet page (Sheet 1) so that anyone who reads it will understand the page (Figure 1). Type
Summarizing and Displaying Categorical Data Categorical data can be summarized in a frequency distribution which counts the number of cases, or frequency, that fall into each category, or a relative frequency
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
Data exploration with Microsoft Excel: univariate analysis Contents 1 Introduction... 1 2 Exploring a variable s frequency distribution... 2 3 Calculating measures of central tendency... 16 4 Calculating
SAS BI Dashboard 4.3 User's Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2010. SAS BI Dashboard 4.3: User s Guide. Cary, NC: SAS Institute
DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 email@example.com 1. Descriptive Statistics Statistics
Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application
Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,
ETH Zürich Seminar für Statistik Introduction Course in SPSS - Evening 1 Seminar für Statistik, ETH Zürich All data used during the course can be downloaded from the following ftp server: ftp://stat.ethz.ch/u/sfs/spsskurs/
SPSS Introduction Yi Li Note: The report is based on the websites below http://glimo.vub.ac.be/downloads/eng_spss_basic.pdf http://academic.udayton.edu/gregelvers/psy216/spss http://www.nursing.ucdenver.edu/pdf/factoranalysishowto.pdf
Paper 1863-2014 Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers Sai Vijay Kishore Movva, Vandana Reddy and Dr. Goutam Chakraborty;
Business Analytics Using SAS Enterprise Guide and SAS Enterprise Miner A Beginner s Guide Olivia Parr-Rud From Business Analytics Using SAS Enterprise Guide and SAS Enterprise Miner. Full book available
Data Mining Using SAS Enterprise Miner : A Case Study Approach, Second Edition The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2003. Data Mining Using SAS Enterprise
M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1 15.7 Analytics and Data Mining 15.7 Analytics and Data Mining 1 Section 1.5 noted that advances in computing processing during the past 40 years have
TIPS FOR DOING STATISTICS IN EXCEL Before you begin, make sure that you have the DATA ANALYSIS pack running on your machine. It comes with Excel. Here s how to check if you have it, and what to do if you
FACT SHEET Using Excel for descriptive statistics Introduction Biologists no longer routinely plot graphs by hand or rely on calculators to carry out difficult and tedious statistical calculations. These
Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses Salford Systems Data Mining 2006 March 27-31 2006 San Diego, CA By Dean Abbott Abbott Analytics
SAS Enterprise Guide for Educational Researchers: Data Import to Publication without Programming AnnMaria De Mars, University of Southern California, Los Angeles, CA ABSTRACT In this workshop, participants
SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA ABSTRACT A codebook is a summary of a collection of data that reports significant
Principles of Statistics STA-201-TE This TECEP is an introduction to descriptive and inferential statistics. Topics include: measures of central tendency, variability, correlation, regression, hypothesis
DHL Data Mining Project Customer Segmentation with Clustering Timothy TAN Chee Yong Aditya Hridaya MISRA Jeffery JI Jun Yao 3/30/2010 DHL Data Mining Project Table of Contents Introduction to DHL and the
MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain
Business Intelligence and Process Modelling F.W. Takes Universiteit Leiden Lecture 2: Business Intelligence & Visual Analytics BIPM Lecture 2: Business Intelligence & Visual Analytics 1 / 72 Business Intelligence
Hierarchical Clustering Analysis What is Hierarchical Clustering? Hierarchical clustering is used to group similar objects into clusters. In the beginning, each row and/or column is considered a cluster.
Paper 1808-2014 Reevaluating Policy and Claims Analytics: a Case of Non-Fleet Customers In Automobile Insurance Industry Kittipong Trongsawad and Jongsawas Chongwatpol NIDA Business School, National Institute
Using Excel Jeffrey L. Rummel Emory University Goizueta Business School BBA Seminar Jeffrey L. Rummel BBA Seminar 1 / 54 Excel Calculations of Descriptive Statistics Single Variable Graphs Relationships
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting
Variable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal ank of Scotland, ridgeport, CT ASTRACT The credit card industry is particular in its need for a wide variety
Data used in this guide: studentp.sav (http://people.ysu.edu/~gchang/stat/studentp.sav) Organize and Display One Quantitative Variable (Descriptive Statistics, Boxplot & Histogram) 1. Move the mouse pointer
Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous
GeoGebra Statistics and Probability Project Maths Development Team 2013 www.projectmaths.ie Page 1 of 24 Index Activity Topic Page 1 Introduction GeoGebra Statistics 3 2 To calculate the Sum, Mean, Count,
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers