THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Size: px
Start display at page:

Download "THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell"

Transcription

1 THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether it be sifting profitable from unprofitable, detecting fraudulent cases, identifying repeat buyers; profiling high-value customers likely to attrite, or flagging high credit-risk applications. This paper proposes a new technique for solving these types of classification problems by hybridizing two popular classification tools, decision trees and logistic regression. CAT and LOGIT, datamining tools by Salford Systems, are used to demonstrate how to implement this new approach both from a theoretical and practical perspective. Strengths and Weaknesses of CAT CAT (Classification and egression Trees) is a state-of-the-art decision-tree tool that can investigate any classification task and provide a robust, accurate predictive model. This methodology is characterized by its ability to automate the modeling process, communicate via pictures and handle complex data structure. The core features of Salford Systems CAT include: automatic separation of relevant from irrelevant predictors automatic interaction detection impervious to outliers unaffected by missing values (using surrogates) invariant to variable transformations (e.g., log, square root, etc). CAT and decision trees in general, however, are notoriously weak at capturing strong linear structure While CAT recognizes the structure, it can not effectively represent it, producing very large trees in an attempt to represent very simple relationships. A second weakness of tree-base d tools is that they sometimes produce very coursegrained response images; that is, the tree may contain only a small number of terminal nodes. Given that the classification outcome or score is shared by all cases in a terminal node, a 12-node tree for example can only predict 12 different probabilities. For some problems, this may be considered a classification success (e.g., identifying a large set of cases as responders ) however for other types of problems it can be problematic. A final weakness, again only problematic for particular problem types, is that decision trees generate discontinuous responses; thus, a small change in x could lead to a large change in y. Strengths and Weaknesses of Logit Logistic regression (or Logit) is a traditional methodology that relies on classical Page 1 of 6

2 statistical principles and, like CAT, demonstrates remarkable accuracy in a broad range of contexts. In the STATLOG project, for example, variations on logisticdiscriminant analysis consistently ranked among the top performers, scoring highest in 5 of 21 problems. Logistic regression models effectively capture global linear structure in data and, given that non-linear structure can be reasonably approximated with linear structure, even incorrectly specified models can perform well. This technique also provides a smooth continuous predicted probability of class membership so that a small change in the predictor variable yields a small change in the predicted probability. However, while Logit excels at handling linear and smooth curvilinear data structures, it often requires experts to hand craft the models and the results can be challenging to interpret, often understood only via simulation. In addition, while Logit permits some flexibility with transformations, polynomials and interactions, it is incumbent upon the analyst to correctly identify the best variable representations. Juxtaposing the strengths and weaknesses of CAT and Logit, a natural question arises: given they excel at different tasks, can we capitalize on their strengths by combining them? CAT LOGIT Automatic analysis equires experts Uses surrogates for missing values Deletes records or imputes missing values Unaffected by outliers Sensitive to outliers Discontinuous response small change in x could lead to large change in y Continuous smooth response small change in x leads to small change in y Course-grained response images finite number of probabilities Unique predicted probability for every record High speed Low speed to infeasible if too many inputs Early Attempts to Hybridize The first attempt to capitalize on the different strengths of the two techniques involved running logistic regression models in terminal nodes of deliberately shallow decision trees. CAT Tree Page 2 of 6

3 LOGIT LOGIT LOGIT LOGIT An examination of how CAT works explains why these first attempts were unsuccessful. As noted above, CAT excels in the detection of local data structure. Once a data set is partitioned into two subsets at the root node, each half of the tree is then separately analyzed. As the partitioning continues, the analysis is always restricted to the node in focus. The discovery of patterns becomes progressively more localized and the fit at one node is never adjusted to take into account the fit at another. In this manner, CAT reaches its goal: to split the data into homogeneous subsets. The farther down the tree, the less the variability in the target (dependent) variable. CAT splits send cases with x c to the left and x > c to the right; thus, the variance in predictor (independent) variables is also drastically reduced. For example, if x is normally distributed and the splitting cut point is at the mean of x, the variance in the two child nodes is reduced by about 64%. For subsequent mean splits, the variance reduction will always be greater than 50%. This reduction in predictor variable variance will also apply to correlated predictors. By the time CAT has declared a node as terminal, the information remaining in the node is insufficient to support further statistical analysis. The sample size in the terminal nodes is drastically reduced as is the target and predictor variable variance. Thus, in a well-developed CAT tree, no parametric model should be supportable within terminal nodes. Estimating Logits or other parametric models earlier in the tree for example, after just a few splits, has the same drawback as terminal node models but the results are less extreme. At best, this latter approach provides a mechanism for identifying switching regression albeit this is not very successful in practice. Proposed New Hybrid Approach The key to running a successful hybrid is to run the Logit in the root node thereby taking advantage of Logit s strength in detecting global structure and to include CAT terminal node dummies in the Logit model. This new hybrid approach is implemented as follows: 1. use CAT to assign every case to a terminal node with CAT surrogates, assignment is possible in every case even those with some or all missing values 2. create a new categorical variable that is equal to the terminal node Page 3 of 6

4 assignment the new categorical will have as many levels as terminal nodes 3. feed the categorical dummy variable into the Logit model (LOGIT will automatically expand x-level categorical into x dummy variables) this first-run Logit model is then used as a baseline model (more on this below) 4. add main-effects variables to the baseline CAT-Logit model added variables constitute the hybrid component and can be tested as a group via a log likelihood ratio test CAT Tree Variable Dummy Dummy Dummy Dummy Variable Variable Variable Logit un on Entire Dataset The Logit formulas for the CAT and hyrbid models can be represented as follows: CAT only: y = 0 + 1NODE 1 + 2NODE 2 + KNODE K where NODE I is a dummy variable for i th CAT node, and CAT-Logit Hybrid: y = 0 + 1NODE + 1 X X + 2NODE X NODE + X iqi + iz j i = 1 = + 1 = CAT Node Dummies + Hybrid Covariates Page 4 of 6

5 The Logit model fit to the CAT terminal node dummies converts the dummies into estimated probabilities; otherwise, it is an exact representation of the CAT model. Each dummy represents the rules and interaction structure discovered by CAT, albeit buried in a black box. The likelihood score on this model can then be used as a baseline score for further testing and model assessment. Note also this simple hybrid model is an excellent way to incorporate sampling weights and recalibrate a CAT tree. The addition of main effect variables to the baseline model then allows the now expanded model to capture effects common across all nodes (i.e., global structure). Because all strong effects have already been detected in the initial CAT run, the effects detected across the terminal nodes are likely to be weak; nevertheless, a collection of weak effects can be very significant. A good starting point for expanding on the LOGIT component of the hybrid model is to: add variables already selected as important by CAT add competitor variables in the root node that never actually appeared as splitters or surrogates in the CAT tree add variables known to be important from other studies A stepwise selection procedure can be used to pare down the variable list and then the pared-down list of main-effect variables tested as a group via a likelihood ratio test. In sum, by looking across nodes, Logit finds effects that CAT cannot detect. Because these effects are not very strong, they are not detected by CAT and not used as primary node splitters. Once the sample is split by CAT, these effects become progressively more difficult to detect as the subsamples become increasingly more homogeneous in the child nodes. While these effects may not be the strong individually, collectively they can add enormous predictive power to the model. Finessing the Hybrid Model Other considerations that must be addressed in building a CAT-Logit hybrid model are missing values, variable transformations and interactions. The simplest approach for handling missing values of course is to simply ignore the problem by dropping all records with missing values on model variables. Alternatively, CAT-predicted probabilities can be assigned to those cases with missing values while hybrid-predicted probabilities assigned to all other cases. More complicated approaches include missing value imputation and adding missing-value dummy indicators to the model plus nesting for non-missing. Given that CAT tree give good results, these more complicated approaches are usually not required. Page 5 of 6

6 Variable transforms, such as logs and square roots, will also need to be considered. Given interactions will already be captured in the CAT terminal node dummies, the only interaction terms worth considering are those that capture node-specific effects (i.e., terminal node interactions with selected variables) and interactions with missing value indicators. Salford Systems MAS, which automates the process of identifying optimal variable transformations and interactions, can be effectively utilized as this stage. Assessing Node-Specific Logit Fit CAT segments the data into very different subsamples so why expect that a single common Logit is valid? First, the terminal node dummies capture all of the complex interactions and non-commonality of the hyper-segments. Second, a simple test can be carried out to check if the Logit developed in each node resulted in an improvement over the CAT score. To test whether the hybrid model shows a lack of fit in any node or subset of nodes, perform a simple likelihood test node by node. If the CAT likelihood is greater than the hybrid model likelihood, do not apply the hybrid model to that particular node. eal World esults In our experience, models have often been improved dramatically by the hybrid CAT-Logit technique. In the following direct mail and financial market applications, the hybrid out performed both CAT and LOGIT alone: Direct mail applications: response model for catalog response model for credit card offer response model for an insurance product Financial applications: mortgage model loan delinquency model fraud detection model While these real-world examples are valuable case studies, they constitute a small sample and the results can not be shared due to confidentiality. Because of this, extensive experiments on artificial data sets were conducted. This analysis permitted a more accurate assessment of the possible benefit and flaws of the hybrid methodology. Monte Carlo Test esults For the Monte Carlo hybrid model assessment, samples of various sizes (2,000 to 100,000 records) were randomly drawn (from?). Each experiment was run 100 different times by resetting the random seed. The resulting models were assessed on Page 6 of 6

7 the basis of fit and also performance (e.g., profit yielded if model guides policy). Training and hold-out samples were used to assess possible over-fitting of the data. A description of the specific Monte Carlo experiments is summarized below: simple Logit: one variable CAT tree: one variable, highly non-linear hybrid process Logit: several variables (possibly missing) hybrid: several variables (possibly missing) highly non-linear smooth function (not Logit) complex Logit with informative missingness The Monte Carlo results indicated that in smaller samples (n=2,000), LOGIT performed very well even when it was not the true model. In larger samples (n=20,000), the hybrid model dominated on out-of-sample performance measures. And, in the larger samples, both CAT and the hybrid model manage problems with missing values whereas the Logit model collaspes. Finally, in larger samples with high frequencies of missings, the hybrid outperforms other models regardless of which model is true. eferences Breiman, L., J. Friedman,. Olshen and C. Stone (1994), Classification and egression Trees, Pacific Grove: Wadsworth. Friedman, J. H. (1991), Multivariate Adaptive egression Splines (with discussion), Annals of Statistics, 19, (March). Michie, D., D. J. Spiegelhalter, and C. C. Taylor, eds (1994), Machine Learning, Neural and Statistical Classification, London: Ellis Horwood Ltd. Steinberg, D. and P. Colla (1995) CAT: Tree-Structured Non-Parametric Data Analysis, San Diego, CA: Salford Systems. CAT is a registered trademark of California Statistical Software and licensed exclusively to Salford Systems. All other trademarks mentioned are the property of their respective owners. Page 7 of 6

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

CART 6.0 Feature Matrix

CART 6.0 Feature Matrix CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window

More information

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association Addressing Analytics Challenges in the Insurance Industry Noe Tuason California State Automobile Association Overview Two Challenges: 1. Identifying High/Medium Profit who are High/Low Risk of Flight Prospects

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

Data Mining Approaches to Modeling Insurance Risk. Dan Steinberg, Mikhail Golovnya, Scott Cardell. Salford Systems 2009

Data Mining Approaches to Modeling Insurance Risk. Dan Steinberg, Mikhail Golovnya, Scott Cardell. Salford Systems 2009 Data Mining Approaches to Modeling Insurance Risk Dan Steinberg, Mikhail Golovnya, Scott Cardell Salford Systems 2009 Overview of Topics Covered Examples in the Insurance Industry Predicting at the outset

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Handling attrition and non-response in longitudinal data

Handling attrition and non-response in longitudinal data Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

More information

Word Count: Body Text = 5,500 + 2,000 (4 Figures, 4 Tables) = 7,500 words

Word Count: Body Text = 5,500 + 2,000 (4 Figures, 4 Tables) = 7,500 words PRIORITIZING ACCESS MANAGEMENT IMPLEMENTATION By: Grant G. Schultz, Ph.D., P.E., PTOE Assistant Professor Department of Civil & Environmental Engineering Brigham Young University 368 Clyde Building Provo,

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds

Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds Combining Linear and Non-Linear Modeling Techniques: Getting the Best of Two Worlds Outline Who is EMB? Insurance industry predictive modeling applications EMBLEM our GLM tool How we have used CART with

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression

More information

REPORT DOCUMENTATION PAGE

REPORT DOCUMENTATION PAGE REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 Public Reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,

More information

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully

More information

Smart Grid Data Analytics for Decision Support

Smart Grid Data Analytics for Decision Support 1 Smart Grid Data Analytics for Decision Support Prakash Ranganathan, Department of Electrical Engineering, University of North Dakota, Grand Forks, ND, USA Prakash.Ranganathan@engr.und.edu, 701-777-4431

More information

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive

More information

Martian Chronicles: Is MARS better than Neural Networks? by Louise Francis, FCAS, MAAA

Martian Chronicles: Is MARS better than Neural Networks? by Louise Francis, FCAS, MAAA Martian Chronicles: Is MARS better than Neural Networks? by Louise Francis, FCAS, MAAA Abstract: A recently developed data mining technique, Multivariate Adaptive Regression Splines (MARS) has been hailed

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Customer and Business Analytic

Customer and Business Analytic Customer and Business Analytic Applied Data Mining for Business Decision Making Using R Daniel S. Putler Robert E. Krider CRC Press Taylor &. Francis Group Boca Raton London New York CRC Press is an imprint

More information

The Predictive Data Mining Revolution in Scorecards:

The Predictive Data Mining Revolution in Scorecards: January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms

More information

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

More information

A Decision Theoretic Approach to Targeted Advertising

A Decision Theoretic Approach to Targeted Advertising 82 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 A Decision Theoretic Approach to Targeted Advertising David Maxwell Chickering and David Heckerman Microsoft Research Redmond WA, 98052-6399 dmax@microsoft.com

More information

On Cross-Validation and Stacking: Building seemingly predictive models on random data

On Cross-Validation and Stacking: Building seemingly predictive models on random data On Cross-Validation and Stacking: Building seemingly predictive models on random data ABSTRACT Claudia Perlich Media6 New York, NY 10012 claudia@media6degrees.com A number of times when using cross-validation

More information

Risk pricing for Australian Motor Insurance

Risk pricing for Australian Motor Insurance Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Bob Stine Dept of Statistics, School University of Pennsylvania Trees Familiar metaphor Biology Decision tree Medical diagnosis Org chart Properties Recursive, partitioning

More information

Benchmarking of different classes of models used for credit scoring

Benchmarking of different classes of models used for credit scoring Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want

More information

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University Agency Internal User Unmasked Result Subjects

More information

Step 5: Conduct Analysis. The CCA Algorithm

Step 5: Conduct Analysis. The CCA Algorithm Model Parameterization: Step 5: Conduct Analysis P Dropped species with fewer than 5 occurrences P Log-transformed species abundances P Row-normalized species log abundances (chord distance) P Selected

More information

DATA MINING METHODS WITH TREES

DATA MINING METHODS WITH TREES DATA MINING METHODS WITH TREES Marta Žambochová 1. Introduction The contemporary world is characterized by the explosion of an enormous volume of data deposited into databases. Sharp competition contributes

More information

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

More information

6 Classification and Regression Trees, 7 Bagging, and Boosting

6 Classification and Regression Trees, 7 Bagging, and Boosting hs24 v.2004/01/03 Prn:23/02/2005; 14:41 F:hs24011.tex; VTEX/ES p. 1 1 Handbook of Statistics, Vol. 24 ISSN: 0169-7161 2005 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(04)24011-1 1 6 Classification

More information

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges A Basic Guide to Modeling Techniques for All Direct Marketing Challenges Allison Cornia Database Marketing Manager Microsoft Corporation C. Olivia Rud Executive Vice President Data Square, LLC Overview

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

Identifying SPAM with Predictive Models

Identifying SPAM with Predictive Models Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

Churn Modeling for Mobile Telecommunications:

Churn Modeling for Mobile Telecommunications: Churn Modeling for Mobile Telecommunications: Winning the Duke/NCR Teradata Center for CRM Competition N. Scott Cardell, Mikhail Golovnya, Dan Steinberg Salford Systems http://www.salford-systems.com June

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Data Mining for Fun and Profit

Data Mining for Fun and Profit Data Mining for Fun and Profit Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. - Ian H. Witten, Data Mining: Practical Machine Learning Tools

More information

Better decision making under uncertain conditions using Monte Carlo Simulation

Better decision making under uncertain conditions using Monte Carlo Simulation IBM Software Business Analytics IBM SPSS Statistics Better decision making under uncertain conditions using Monte Carlo Simulation Monte Carlo simulation and risk analysis techniques in IBM SPSS Statistics

More information

Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

More information

11/20/2014. Correlational research is used to describe the relationship between two or more naturally occurring variables.

11/20/2014. Correlational research is used to describe the relationship between two or more naturally occurring variables. Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

A fast, powerful data mining workbench designed for small to midsize organizations

A fast, powerful data mining workbench designed for small to midsize organizations FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business

More information

Regression Modeling Strategies

Regression Modeling Strategies Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

More information

Data mining is used to develop models for the early prediction of freshmen GPA. Since

Data mining is used to develop models for the early prediction of freshmen GPA. Since 1 USING DATA MINING TO PREDICT FRESHMEN OUTCOMES Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University Abstract Data mining is used

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Identifying and Overcoming Common Data Mining Mistakes Doug Wielenga, SAS Institute Inc., Cary, NC

Identifying and Overcoming Common Data Mining Mistakes Doug Wielenga, SAS Institute Inc., Cary, NC Identifying and Overcoming Common Data Mining Mistakes Doug Wielenga, SAS Institute Inc., Cary, NC ABSTRACT Due to the large amount of data typically involved, data mining analyses can exacerbate some

More information

Drug Store Sales Prediction

Drug Store Sales Prediction Drug Store Sales Prediction Chenghao Wang, Yang Li Abstract - In this paper we tried to apply machine learning algorithm into a real world problem drug store sales forecasting. Given store information,

More information

Customer Life Time Value

Customer Life Time Value Customer Life Time Value Tomer Kalimi, Jacob Zahavi and Ronen Meiri Contents Introduction... 2 So what is the LTV?... 2 LTV in the Gaming Industry... 3 The Modeling Process... 4 Data Modeling... 5 The

More information

Credibility and Pooling Applications to Group Life and Group Disability Insurance

Credibility and Pooling Applications to Group Life and Group Disability Insurance Credibility and Pooling Applications to Group Life and Group Disability Insurance Presented by Paul L. Correia Consulting Actuary paul.correia@milliman.com (207) 771-1204 May 20, 2014 What I plan to cover

More information

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

More information

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University Machine Learning Methods for Causal Effects Susan Athey, Stanford University Guido Imbens, Stanford University Introduction Supervised Machine Learning v. Econometrics/Statistics Lit. on Causality Supervised

More information

Automated Statistical Modeling for Data Mining David Stephenson 1

Automated Statistical Modeling for Data Mining David Stephenson 1 Automated Statistical Modeling for Data Mining David Stephenson 1 Abstract. We seek to bridge the gap between basic statistical data mining tools and advanced statistical analysis software that requires

More information

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are

More information

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING ABSTRACT The objective was to predict whether an offender would commit a traffic offence involving death, using decision tree analysis. Four

More information

Studying Auto Insurance Data

Studying Auto Insurance Data Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

More information

Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses

Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses Salford Systems Data Mining 2006 March 27-31 2006 San Diego, CA By Dean Abbott Abbott Analytics

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

The Generalization Paradox of Ensembles

The Generalization Paradox of Ensembles The Generalization Paradox of Ensembles John F. ELDER IV Ensemble models built by methods such as bagging, boosting, and Bayesian model averaging appear dauntingly complex, yet tend to strongly outperform

More information

Machine Learning Methods for Demand Estimation

Machine Learning Methods for Demand Estimation Machine Learning Methods for Demand Estimation By Patrick Bajari, Denis Nekipelov, Stephen P. Ryan, and Miaoyu Yang Over the past decade, there has been a high level of interest in modeling consumer behavior

More information

Classification and Regression Trees as a Part of Data Mining in Six Sigma Methodology

Classification and Regression Trees as a Part of Data Mining in Six Sigma Methodology , October 20-22, 2010, San Francisco, USA Classification and Regression Trees as a Part of Data Mining in Six Sigma Methodology Andrej Trnka, Member, IAENG Abstract The paper deals with implementation

More information

How to Get More Value from Your Survey Data

How to Get More Value from Your Survey Data Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................2

More information

Using Adaptive Random Trees (ART) for optimal scorecard segmentation

Using Adaptive Random Trees (ART) for optimal scorecard segmentation A FAIR ISAAC WHITE PAPER Using Adaptive Random Trees (ART) for optimal scorecard segmentation By Chris Ralph Analytic Science Director April 2006 Summary Segmented systems of models are widely recognized

More information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN

More information

An Introduction to Ensemble Learning in Credit Risk Modelling

An Introduction to Ensemble Learning in Credit Risk Modelling An Introduction to Ensemble Learning in Credit Risk Modelling October 15, 2014 Han Sheng Sun, BMO Zi Jin, Wells Fargo Disclaimer The opinions expressed in this presentation and on the following slides

More information

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University) 260 IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011 Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

OPTIMIZATION AND FORECASTING WITH FINANCIAL TIME SERIES

OPTIMIZATION AND FORECASTING WITH FINANCIAL TIME SERIES OPTIMIZATION AND FORECASTING WITH FINANCIAL TIME SERIES Allan Din Geneva Research Collaboration Notes from seminar at CERN, June 25, 2002 General scope of GRC research activities Econophysics paradigm

More information

Introduction to time series analysis

Introduction to time series analysis Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples

More information

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d. EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information