Weight of Evidence Module



Similar documents
STATISTICA Formula Guide: Logistic Regression. Table of Contents

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

IBM SPSS Direct Marketing 23

Association Between Variables

IBM SPSS Direct Marketing 22

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

The Predictive Data Mining Revolution in Scorecards:

Testing Research and Statistical Hypotheses

SAS Software to Fit the Generalized Linear Model

Two Correlated Proportions (McNemar Test)

11. Analysis of Case-control Studies Logistic Regression

CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA

Data Mining Techniques Chapter 6: Decision Trees

UNDERSTANDING THE TWO-WAY ANOVA

VI. Introduction to Logistic Regression

Credit Risk Models. August 24 26, 2010

Chapter 23. Two Categorical Variables: The Chi-Square Test

Binary Logistic Regression

Ordinal Regression. Chapter

Chapter 3 RANDOM VARIATE GENERATION

Logistic Regression.

Analysing Questionnaires using Minitab (for SPSS queries contact -)

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Odds ratio, Odds ratio test for independence, chi-squared statistic.

Data Mining Methods: Applications for Institutional Research

LOGISTIC REGRESSION ANALYSIS

Credit Risk Analysis Using Logistic Regression Modeling

Issues in Credit Scoring

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Variable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal Bank of Scotland, Bridgeport, CT

Module 14: Missing Data Stata Practical

Benchmarking of different classes of models used for credit scoring

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Decision Trees from large Databases: SLIQ

Non-Inferiority Tests for Two Proportions

Chi-square test Fisher s Exact test

Using Excel for Statistics Tips and Warnings

SUGI 29 Statistics and Data Analysis

Statistics in Retail Finance. Chapter 2: Statistical models of default

Measuring the Discrimination Quality of Suites of Scorecards:

IBM SPSS Direct Marketing 19

Least-Squares Intersection of Lines

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Section 13, Part 1 ANOVA. Analysis Of Variance

1.6 A LIBRARY OF PARENT FUNCTIONS. Copyright Cengage Learning. All rights reserved.

Despite its emphasis on credit-scoring/rating model validation,

Piecewise Logistic Regression: an Application in Credit Scoring

Study Guide for the Final Exam

SPSS Explore procedure

Row vs. Column Percents. tab PRAYER DEGREE, row col

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Basic Statistical and Modeling Procedures Using SAS

Generalized Linear Models

Additional sources Compilation of sources:

The first three steps in a logistic regression analysis with examples in IBM SPSS. Steve Simon P.Mean Consulting

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Data Preprocessing. Week 2

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Cool Tools for PROC LOGISTIC

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Statistics, Data Analysis & Econometrics

Descriptive Statistics

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Testing Random- Number Generators

Use of the Chi-Square Statistic. Marie Diener-West, PhD Johns Hopkins University

Statistical skills example sheet: Spearman s Rank

Lecture 14: GLM Estimation and Logistic Regression

Simple Regression Theory II 2010 Samuel L. Baker

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Multivariate Logistic Regression

Math 115 Spring 2011 Written Homework 5 Solutions

Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1

Modeling Lifetime Value in the Insurance Industry

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

2.1 Increasing, Decreasing, and Piecewise Functions; Applications

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

DESCRIPTIVE STATISTICS & DATA PRESENTATION*

Business Analytics and Credit Scoring

The Statistics Tutor s Quick Guide to

Factors affecting online sales

Fairfield Public Schools

Logistic Regression (1/24/13)

MORTGAGE LENDER PROTECTION UNDER INSURANCE ARRANGEMENTS Irina Genriha Latvian University, Tel ,

Monitoring the Behaviour of Credit Card Holders with Graphical Chain Models

Chapter 5 Analysis of variance SPSS Analysis of variance

Logit Models for Binary Data

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

6.4 Normal Distribution

Friedman's Two-way Analysis of Variance by Ranks -- Analysis of k-within-group Data with a Quantitative Response Variable

Linear Models in STATA and ANOVA

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Biostatistics: Types of Data Analysis

SPSS Modules Features Statistics Premium

Transcription:

Formula Guide The purpose of the Weight of Evidence (WoE) module is to provide flexible tools to recode the values in continuous and categorical predictor variables into discrete categories automatically, and to assign to each category a unique WoE value. This recoding is conducted in a manner that will produce the largest differences between the recoded groups with respect to the WoE values. In addition, other constraints are observed while the program determines solutions for the optimal binning of predictors. Optimal Coding of Predictors Specifically, the goal of the algorithms implemented in the automated WoE module is to identify the best groupings for predictor variables that will result in the greatest differences in WoE between groups. For continuous variables the automated WoE module identifies the best recoding to weight-of-evidence values. For categorical predictors or interactions between coded predictors, users can combine groups with similar observed WoE to create new coded predictors with continuous weight-of-evidence value. Continuous Variables For continuous predictors, first a default coding is derived using the Classification and Regression Trees (C&RT) algorithm. For default categories with fewer than 20 groups STATISTICA will explicitly search through all possible combinations of default groups to achieve the least numbers of groups with the greatest Information Value (IV). When the number of groups is greater than 20, STATISTICA uses the CHAID approach. The CHAID approach is a modification to the CHAID algorithm where instead of the customary criterion, the change in WoE is used as the criterion. Three types of constrained WoE recoding solutions are provided subject to their existence: Monotone solutions, where the WoE values of all adjacent recoded groups (intervals) will either increase (positive monotone relationship of predictor intervals to WoE), or the WoE values of all adjacent recoded groups will always decrease (negative monotone relationship of predictor intervals to WoE). Quadratic solutions, where the relationship between the coded value ranges (intervals) to WoE can have a single reversal so that the resulting function is either U-shaped or inverse-u-shaped. Cubic solutions, where the relationship between the coded value ranges (intervals) to WoE values can have two reversals so that the resulting function is S-shaped. Two types of unconstrained WoE recoding solutions are provided: Copyright 2013 Version 1 PAGE 1 OF 5

Custom coding is based on the default binning scheme with either C&RT or 10 equal groups of approximately equal size. The no restrictions coding is based on the custom solution after the running either the exhaustive search or the CHAID algorithm. Note that the initial bins maybe adjusted prior to the algorithm in order to make sure that each bin satisfies the minimum N and minimum Bad N user specified parameters. Categorical Variables Interactions Statistics Chi-square Cramer s V For categorical (discrete) predictors, the default (original) grouping is further refined using the modified CHAID approach. Two types of unconstrained WoE recoding solutions are provided: Custom coding is based on the default binning of the group. The no restrictions coding is based on the default categorization provided by the modified CHAID algorithm. Note that the initial bins maybe adjusted prior to the algorithm in order to make sure that each bin satisfies the minimum N and minimum Bad N user specified parameters. For pairs of coded predictors the modified CHAID approach is implemented using interaction coding of the two-way interaction table or user-defined coding. This statistic is distributed according to a chi-square distribution with degrees of freedom equal to the difference between the number of parameters under the alternative hypothesis and the number of parameters under the null hypothesis. N =! Total number of observations! = Minimum of row dimension minus 1 and column dimension minus 1 Copyright 2013 Version 1 PAGE 2 OF 5

F-test " $ %. $% '1 )$! $%. * +! ' $%. = sample mean of the i th group = number of observations in the i th group $% = overall mean of the data ' = denotes the number of groups $! = j th observation in the i th out of K groups = overall sample size Gini / 0123,2. N = Total number of observations / 01500 4. 4 Information Value (IV) 6 789:3 ";/< 01 500 9:3 ";/< 01 23 9:3 ";/< 01500 :. 9:3 ";/< 0123 4> The IV of a predictor is related to the sum of the (absolute) values for WoE over all groups. Thus, it expresses the amount of diagnostic information of a predictor variable for separating the Goods from the Bads. Kolmogorov-Smirnov (KS) test For all Good observations, predicted probability of Bad is computed, that is the relative frequency of bad cases in the bin a Good observation is placed. This process is repeated for all Bad observations. The KS test is then completed with the Good/Bad indicator as the group variable and the predicted probability of Bad as the response. Copyright 2013 Version 1 PAGE 3 OF 5

?3! @A B @ C D Significance level (p) approximation is based on the formula: E 2 1 Logit Transformation (Logg Odds) Mean Somer s D / 01500 T0,:U / 0123 If ties are present: Y Z If ties are not present: I L S H JK M NO.NO. F L H M NL R F P H H K L R M R L M NL F R G, W, 2 1 where ) YC0.5 Y Z * (Note: Sorting of cases for calculation of Somer s d is based on the relative frequency of bad, that is, estimated probably of bad.) t = total number of pairs with different responses of good/bad n c = number of pairs of cases where the case with the lower ordered response value has a lower predicted mean score than the case with the higher ordered response value. n d = number of pairs of cases where the case with the lower ordered response value has a higher predicted mean score than the case with the higher ordered response value. Q F Copyright 2013 Version 1 PAGE 4 OF 5

Weight of Evidence (WoE) Notes 9:3 ";/< 01500 ]0 8:. 9:3 ";/< 0123 4> 100 The value of WoE will be 0 if the odds of Relative Frequency of Goods / Relative Frequency Bads is equal to 1. If the Relative Frequency of Bads in a group is greater than the Relative Frequency of Goods, the odds ratio will be less than 1 and the WoE will be a negative number; if the Relative Frequency of Goods is greater than the Relative Frequency of Bads in a group, the WoE value will be a positive number. The WoE recoding of predictors is particularly well suited for subsequent modeling using Logistic Regression. Specifically, logistic regression will fit a linear regression equation of predictors (or WoEcoded continuous predictors) to predict the logit-transformed binary Goods/Bads dependent or Y variable. Therefore, by using WoE-coded predictors in logistic regression, the predictors are all prepared and coded to the same WoE scale, and the parameters in the linear logistic regression equation can be directly compared, for example, when using the new modeling tools for Marginal Stepwise Logistic Regression. Copyright 2013 Version 1 PAGE 5 OF 5