DATA MINING DEMYSTIFIED:

Similar documents
Data Mining Methods: Applications for Institutional Research

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Data mining is used to develop models for the early prediction of freshmen GPA. Since

Gerry Hobbs, Department of Statistics, West Virginia University

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Data Mining Practical Machine Learning Tools and Techniques

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Data Mining Techniques Chapter 6: Decision Trees

6 Classification and Regression Trees, 7 Bagging, and Boosting

Data Mining - Evaluation of Classifiers

IBM SPSS Direct Marketing 23

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

A Property & Casualty Insurance Predictive Modeling Process in SAS

IBM SPSS Direct Marketing 22

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

A Property and Casualty Insurance Predictive Modeling Process in SAS

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Weight of Evidence Module

Decision Trees from large Databases: SLIQ

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Leveraging Ensemble Models in SAS Enterprise Miner

The Operational Value of Social Media Information. Social Media and Customer Interaction

Understanding Characteristics of Caravan Insurance Policy Buyer

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Using multiple models: Bagging, Boosting, Ensembles, Forests

Copyright 2006, SAS Institute Inc. All rights reserved. Predictive Modeling using SAS

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Data Mining. Nonlinear Classification

Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables

S The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

How To Make A Credit Risk Model For A Bank Account

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Advanced analytics at your hands

Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Evaluation & Validation: Credibility: Evaluating what has been learned

Predictive Modeling Techniques in Insurance

Successfully Implementing Predictive Analytics in Direct Marketing

Better credit models benefit us all

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Knowledge Discovery and Data Mining

How To Model Fraud In A Tree Based Model

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

Model Validation Techniques

JetBlue Airways Stock Price Analysis and Prediction

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Didacticiel Études de cas

Predictive Modeling of Titanic Survivors: a Learning Competition

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Simple linear regression

Chapter 12 Discovering New Knowledge Data Mining

IBM SPSS Direct Marketing 19

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Modeling Lifetime Value in the Insurance Industry

The Data Mining Process

Gini in a Bottle: The Mathematics of Income Inequality

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Selecting Data Mining Model for Web Advertising in Virtual Communities

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Azure Machine Learning, SQL Data Mining and R

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regression Modeling Strategies

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Factor Analysis. Chapter 420. Introduction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Identifying SPAM with Predictive Models

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

Customer Life Time Value

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

CART 6.0 Feature Matrix

Chapter 6. The stacking ensemble approach

Efficiency in Software Development Projects

Knowledge Discovery and Data Mining

Distances, Clustering, and Classification. Heatmaps

An Overview and Evaluation of Decision Tree Methodology

Strategies for Identifying Students at Risk for USMLE Step 1 Failure

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Microsoft Azure Machine learning Algorithms

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

2013 CAIR Conference

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Transcription:

DATA MINING DEMYSTIFIED: TECHNIQUES AND INFORMATION FOR MODEL IMPROVEMENT Nora Galambos, PhD Senior Data Scientist AIR Forum 2016 New Orleans, LA 1

Why Use Data Mining? Enables the extraction of information from large amounts of data. Incorporates analytic tools for data-driven decision making. Uses modeling techniques to apply results to future data. The goal is to develop a model, not only to find factors significantly associated with the outcomes. Incorporates statistics, pattern recognition, and mathematics. Few assumptions to satisfy relative to traditional hypothesis driven methods. A variety of different methods for different types of data and predictive needs. Able to handle a great volume of data with hundreds of predictors. 2

Data Mining Terminology Decision tree methods/algorithms: CHAID and CART Bagging and Boosting Regression algorithms: LARS and LASSO Diagnostics measures: ROC Curves Gini Coefficient and Gini Index Model testing and improvement: Cross-validation 3

Receiver Operating Characteristic (ROC) Curves Used to discriminate between binary outcomes. Originally used during WWII in signal detection to distinguish between enemy and friendly aircraft on the radar screen. In the 1970 s it was realized that signal detection theory could be useful in evaluating medical test results. ROC curves can evaluate how well a diagnostic test can correctly separate results into those with and without a disease. It can be used for other types of binary outcomes, such as models predicting retention and graduation. The area under the ROC curve is used to evaluate how well a model correctly discriminates between the outcomes. 4

Ture Positve ROC Curves Sensitivity: True positive rate probability of a positive result when the disease is present or the outcome occurs. a/(a+c) Specificity: True negative rate probability of a negative result when the disease is not present or the outcome does not occur. d/(b+d) False positive rate: 1 - specificity Observed Predicted + - + a b - c d 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% False Positive The greater the area under the ROC curve, the better the discrimination between positive and negative outcomes. It is imperative to examine both the true positive and false positive rates, not just the percentage of correct predictions. 5

True Positive ROC Curve: Cutpoints The sensitivity and specificity vary with the cutpoints of the diagnostic test or the predicted probability of a logistic model. A ROC curve can be created by plotting the true and false positive rates at different model cutpoints. As the sensitivity increases the specificity decreases. 0.9 The area under the ROC curve indicates how well the test 0.8 discriminates between positive and negative outcome 0.7 0.6 groups. 0.5 AUC = 0.5 indicates no ability to discriminate between 0.4 0.3 groups. 0.2 Compare ROC curves to determine the model that 0.1 0 does the best job at discriminating between groups. Classification Plot and Corresponding ROC Curve 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive 6

Gini Coefficient The Lorenz Curve was developed by the economist Max Lorenz in 1905 to represent the distribution of wealth and the Italian statistician Corrado Gini developed the Gini Coefficient in 1912 as a measure of income inequality. The x-axis is the cumulative population proportion. The y-axis is the cumulative income. Each point on the Lorenz curve gives the proportion of income (y value) possessed by the corresponding proportion of the population. For example on the curve to the right, 70% of the population has only 40% of the wealth. The closer the Lorenz curve is to the diagonal, the more equal the income distribution. 7

Gini Coefficient Gini = A A + B A + B = 1 2 base X height = 1 2 n 1 A = 1 2 2 x i x i 1 y i + y i 1 k=0 1 G = 2 1 2 σ n k=0 x i x i 1 y i + y i 1 1 2 n =1 σ k=0 x i x i 1 = 1 2B y i + y i 1 A B The Gini Coefficient is 1 minus twice the area under the Lorenz Curve. -X i-1 In predictive modeling the Gini Coefficient is used to evaluate the strength of a model, in which case a greater Gini coefficient is better. 8

Gini Index Gini Index: An Impurity Measure The Gini Index is used for selecting the best way to split data between nodes in decision trees by measuring the degree of impurity of the nodes. When there are only two categories the maximum value for the Gini Index = 0.5, indicating the greatest degree of impurity. Gini t = 1 σ c 1 2 i=0 p i t c = the number of classes, t represents the node and p is the probability of group i membership. A word of caution: Sometimes the Gini Coefficient is referred to as the Gini Index. 0.6 0.5 0.4 0.3 0.2 0.1 0 Gini Index Graph: Two Categories 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p 1 For two categories x axis: group 1 probability, y axis: Gini Index 9

Gain Chart Gain and lift are often used to evaluate models used for direct marketing, but they can be used to compare decision tree models for other uses, as well. The diagonal line represents random response with no model. To evaluate the effectiveness of a model, the data are sorted in descending order of the probability of a positive response, often binned by deciles. The cumulative rate of a positive response rate at each decile is evaluated. In the example 70% of the positive responses came from the first three deciles or 30% of the subjects. Jaffery T., Liu SX. Measuring Campaign Performance using Cumulative Lift and Gain Chart. SAS Global Forum 2009. 10

Lift Charts The cumulative lift chart evaluates how many times better the model performs compared to having no model at all. Considering the example on the previous slide at the third decile, 30% of the subjects accounted for 70% of the total responses. 70/30 = 2.33, so the model performs 2.33 times better than having no model. Lift and gain charts are used to compare models and to determine whether using a particular model is worthwhile. If this example was for a direct mailing, we would expect that 70% of the total responses could be obtained by sending the mailing to just 30% of the sample. Jaffery T., Liu SX. Measuring Campaign Performance using Cumulative Lift and Gain Chart. SAS Global Forum 2009. 11

Linear Regression in a Data Mining Environment Linear regression is an available data mining modeling tool, however it is important to be mindful of missing data and multicollinearity. Decision tree methods are able to handle missing values by combining them with another category or placing them in a category with other values. They can also be replaced by using surrogate rules. Linear regression, on the other hand, will listwise delete the missing values. When using data having dozens or even hundreds of potential predictors it could happen that not much data remains. It is important to note the number of observations remaining in your model and consider using an imputation method provided by the software package if listwise deletion is a serious problem. 12

Clustering With a large volume of predictors, it would be difficult and time consuming to evaluate all of the potential multicollinearity issues. Clustering can be used to group highly correlated variables. In each cluster, the variable with the highest correlation coefficient can be retained and entered into the modeling process, and the others are eliminated. 13

LASSO: Least Absolute Shrinkage and Selection Operator The LASSO algorithm shrinks the independent variable coefficients and sets some of them to zero to simplify the model. Ordinary Least Squares regression equation: y = x 0 + β 1 x 1 + β 2 x 2 + + β n x n The sum of the squared residuals is minimized in OLS regression. However, LASSO requires that this minimization is subject to the sum of the absolute values of the coefficients being less than or equal to some value t, which is referred to as a tuning parameter. j β j t The constraint forces some coefficients to be set to zero and results in a simpler model. As the tuning parameter increases to a very large value, it approaches the OLS regression model. It is important to test a range of values for t. 14

LARS: Least Angle Regression 1 Efron B, et al., Least Angle Regression. 2002. retrieved from http://web.stanford.edu/~ hastie/papers/lars/least Angle_2002.pdf The LARS procedure is an efficient method computationally and begins the like ordinary least squares regression by adding the predictor x i with the highest correlation to y. Next the coefficient β i is increased (or decreased if β i is negative) until some other predictor x k is just as correlated with the residuals as is x i. The coefficient of that predictor is increased until that predictor is no longer the one most correlated with the residual r. At this point LARS parts company with Forward Selection. Instead of continuing along x j1, LARS proceeds in a direction equiangular between the two predictors until a third variable x j3 earns its way into the most correlated set. LARS then proceeds equiangularily between x j1,x j2 and x j3, i.e. along the least angle direction, until a fourth variable enters, etc. 1 15

Bagging: Bootstrap Aggregating Bootstrapping is used to form datasets We have M datasets by sampling with replacement A separate tree is created using each of the M datasets. The improved final model is a combination or average of the M decision trees. 16

AdaBoost: Adaptive Boosting The decision tree is fit to the entire training set. Misclassified observations are weighted to encourage correct classification. This is done repeatedly with new weights assigned at each iteration to improve the accuracy of the predictions. The result is a series of decision trees, each one adjusted with new weights based on the accuracy of the estimates or classifications of the previous tree. Although boosting generally results in an improved model, because the results at each stage are weighted and combined into a final model, there is no resulting tree diagram. Scoring code is generated by the software package allowing the model to be used to score new data for predicting outcomes. 17

BFOS CART Method Breiman, Friedman, Olshen, Stone Classification and Regression Trees The method does an exhaustive search for the best binary split. It splits categorical predictors into two groups, or finds the optimal binary split in numerical measures. Each successive split is again split in two until no further splits are possible. The result is a tree of maximum possible size, which is then pruned back. For interval targets the variance is use to assess the splits; For nominal targets the Gini impurity measure is used. Pruning starts with the split that has the smallest contribution to the model The missing data is assigned to the largest node of a split Creates a set of nested binary decision rules to predict an outcome. 18

CHAID: Chi-squared Automatic Interaction Detection Unlike CART with binary splits evaluated by misclassification measures, the CHAID algorithm uses the chi-square test (or the F test for interval targets) to determine significant splits and find independent variables with the strongest association with the outcome. A Bonferroni correction to the p-value is applied prior to the split. It may find multiple splits in continuous variables, and allows splitting of data into more than two categories. As with CART, CHAID allows different predictors for different sides of a split. The CHAID algorithm will halt when statistically significant splits are no longer found in the data. 19

Data Partitioning Partitioning is used to avoid over- or under-fitting. The data are divided into three parts: training, validation, and testing. The training partition is used to build the model. The validation partition is set aside and is used to test the accuracy and fine tune the model. The prediction error is calculated using the validation data. An increase in the error in the validation set may be caused by over-fitting. The model may need modification. Often 60% is used for training the model and 40% is used for validation--or 40% for training, 30% for validation, and 30% for testing Problem: What if the sample size is small? E.g., predicting freshmen retention where the retention rate is usually around 90% and there are 3,000 freshmen. That means that the training sample may only have around 180 students who are not retained to develop a model that may have 50 or more predictors. Using K-fold cross validation is the answer. 20

K-fold Cross-validation for Evaluating Model Performance Why use k-fold cross-validation? It works with limited data. The initial steps are the similar to traditional data analysis. The entire dataset is used to choose the predictors. Cross-validation is used to evaluate the model, not to develop the model. The error is estimated by averaging the error of the K test samples. 21

5-Fold Cross Validation Plan For each K i the entire dataset was divided into 5 equal parts K = 1 Validate K = 2 Validate K = 3 Validate K = 4 Validate K = 5 Validate Office of Institutional Research, Planning & Effectiveness 22

5-Fold Cross Validation Plan For each K i the entire dataset was divided into 5 equal parts 23

K Folds Cross Validation Results: Average Squared Error (ASE) Results for Five Data Mining Methods to Predict Freshmen GPA Gradiant Boosting BFOS-CART CHAID Decision Tree Linear Regression Validation ASE ing ASE Validation ASE ing ASE Validation ASE ing ASE Validation ASE ing ASE Validation ASE ing ASE 1 0.333 0.363 0.394 0.427 0.444 0.355 0.421 0.335 0.374 0.396 2 0.353 0.358 0.425 0.423 0.479 0.325 0.432 0.330 0.477 0.388 3 0.377 0.351 0.429 0.432 0.508 0.312 0.472 0.325 0.515 0.363 4 0.391 0.351 0.436 0.433 0.510 0.304 0.495 0.304 0.522 0.376 5 0.422 0.343 0.525 0.393 0.511 0.345 0.515 0.312 0.561 0.371 Average ASE 0.375 0.353 0.442 0.422 0.490 0.328 0.467 0.321 0.490 0.379 ASE = (Sum of Squared Errors)/N Gradient boosting had the smallest average ASE followed by CART. Gradient boosting and BFOS- CART, on average, had the smallest differences between the validation and training errors. The CART method was chosen for the modeling process Had relatively low ASE. Gradient boosting, without an actual tree diagram, would make the results more difficult to explain. 24

Portion of CART Tree for HS GPA<=92.0 LMS logins per non-stem crs, wk 2-6 >=11.3 or missing LMS logins per non-stem crs, wks 2-6<11.3 SAT of the HS CR >570 SAT of the HS CR<=570 SAT of the HS CR >=540 SAT of the HS CR < 540 SAT Math CR >1360 SAT Math CR <=1360 Logins per STEM crs, wk 2-6 >=32.2 Logins per STEM crs, wk 2-6 <32.2 AP STEM Crs. >=1 AP STEM Crs = 0 Logs per STEM crs, wk 2-6 >=5.3 or miss Logs per STEM crs. wk 2-6 < 5.3 AP STEM Crs>=1 AP Stem Crs = 0 Highest DFW STEM Crs. Rate>= 17% Highest DFW STEM Crs. Rate <17% SAT Math >=680 SAT Math< 680 or miss. Non- STEM crs logs > = 3 or miss. Non- STEM crs logins <3 STEM crs logs Wk. 1>=5 or miss. STEM crs logs Wk 1 < 5 STEM logs Wk. 1 >=5 or miss. STEM crs logs Wk. 1 <5 STEM crs logs Wk 1 >=1 or miss. STEM crs kogs Wk 1 = 0 1.59 N = 13 3.63 N = 46 3.20 N = 23 2.92 N= 34 3.25 N=94 3.35 N=78 3.09 N = 121 2.94 N = 371 2.53 N = 57 3.21 N = 64 2.69 N=16 2.75 N = 73 2.12 N= 18 2.62 N = 305 1.94 N = 25 25