The Predictive Data Mining Revolution in Scorecards:



Similar documents
Benchmarking of different classes of models used for credit scoring

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends

Using Data Mining Methods to Optimize Boiler Performance: Successes and Lessons Learned

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Why Ensembles Win Data Mining Competitions

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

2015 Workshops for Professors

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

STATISTICA Solutions for Financial Risk Management Management and Validated Compliance Solutions for the Banking Industry (Basel II)

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Classification of Bad Accounts in Credit Card Industry

About Dell Statistica

Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Application of SAS! Enterprise Miner in Credit Risk Analytics. Presented by Minakshi Srivastava, VP, Bank of America

not possible or was possible at a high cost for collecting the data.

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Weight of Evidence Module

Azure Machine Learning, SQL Data Mining and R

How To Make A Credit Risk Model For A Bank Account

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Better planning and forecasting with IBM Predictive Analytics

Data Mining. Nonlinear Classification

Gerry Hobbs, Department of Statistics, West Virginia University

Business Analytics and Credit Scoring

The Artificial Prediction Market

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

WHITEPAPER. How to Credit Score with Predictive Analytics

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Leveraging Ensemble Models in SAS Enterprise Miner

Machine Learning with MATLAB David Willingham Application Engineer

How to Deploy Models using Statistica SVB Nodes

What is Data Mining, and How is it Useful for Power Plant Optimization? (and How is it Different from DOE, CFD, Statistical Modeling)

Predictive Analytics Certificate Program

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Oracle Real Time Decisions

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Best practices in project and portfolio management

Pentaho Data Mining Last Modified on January 22, 2007

Variable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal Bank of Scotland, Bridgeport, CT

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

Easily Identify Your Best Customers

TDWI Best Practice BI & DW Predictive Analytics & Data Mining

Comparison of Data Mining Techniques used for Financial Data Analysis

Data Mining Part 5. Prediction

Prediction of Stock Performance Using Analytical Techniques

Random Forest Based Imbalanced Data Cleaning and Classification

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

Using Adaptive Random Trees (ART) for optimal scorecard segmentation

Select the right configuration management database to establish a platform for effective service management.

Data Mining: STATISTICA

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

INFORMATION MANAGED. Project Management You Can Build On. Primavera Solutions for Engineering and Construction

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Predictive Modeling and Big Data

An Introduction to SAS Enterprise Miner and SAS Forecast Server. André de Waal, Ph.D. Analytical Consultant

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Credit Risk Analysis Using Logistic Regression Modeling

The Power of Business Intelligence in the Revenue Cycle

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Tax Fraud in Increasing

Statistics in Retail Finance. Chapter 6: Behavioural models

Customer and Business Analytic

Deployment of Predictive Models. Sumit Kumar Bardhan

Chapter 6. The stacking ensemble approach

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Chapter 7: Data Mining

Compliance. Technology. Process. Using Automated Decisioning and Business Rules to Improve Real-time Risk Management

Modeling Lifetime Value in the Insurance Industry

Data Mining Applications in Higher Education

DATA MINING TECHNIQUES AND APPLICATIONS

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Make Better Decisions Through Predictive Intelligence

Course Syllabus. Purposes of Course:

CROSS INDUSTRY PegaRULES Process Commander. Bringing Insight and Streamlining Change with the PegaRULES Process Simulator

IBM SPSS Data Preparation 22

Overview, Goals, & Introductions

IBM's Fraud and Abuse, Analytics and Management Solution

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

Ensembles and PMML in KNIME

Hurwitz ValuePoint: Predixion

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Beating the MLB Moneyline

Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Transcription:

January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms that optimize model accuracy, have revolutionized numerous industries, from manufacturing through marketing to sales. However, credit risk scores in the financial industries are still built using traditional statistical models to derive score-cards where additive factors such as specific age, income, etc., make up the final credit score, and determine the final decision of credit worthiness. This paper describes successful approaches of how modern predictive modeling algorithms can be used to build better credit risk models and scorecards without sacrificing some of the critical benefits from applying traditional scorecards such as reason scores (for denial of credit). How Traditional Scorecards Are Built The process of building a credit scorecard usually follows three steps: Data preparation and predictor coding Model building using logistic regression, and model validation Final deployment These steps are briefly described later. They are also described in some detail at the StatSoft Electronic Statistics Textbook, at http://www.statsoft.com/textbook/credit-scoring/. A Common Scorecard A final scorecard may have the following structure and look like this:

Page 2 of 8 The most important columns in this spreadsheet are the first, second, and last ones. The first column identifies each predictor variable (e.g., Value of Savings). The second column shows the specific interval boundaries or (combined) categories for categorical predictors. For example, Value of Savings < 140, Value of Savings between 140 and 700, and so on. The last column shows the credit score factor to be used for applicants that fall into the respective interval. For example, for an applicant with a Value of Savings between 140 and 700, a score value of 95 is added to the total credit score. If the Amount of credit is between 6,600 and 10,000, a score of 86 is added, and so on. For each credit applicant, a total sum credit score can be computed based on the respective categories that person falls into; critical cutoff values are then used to determine the final credit decision. The analytic workflow to get to the final scorecard follows a logical procedure to generate a table such as that shown above. Data Preparation, Predictor Coding After available data have been retrieved and cleaned (e.g., bad codes have been removed, missing data have been replaced, duplicates were deleted, etc.), the first step is to recode the value ranges for the continuous predictor variables such as age, average account balance, and so on into value ranges.

Page 3 of 8 The STATISTICA solution platform provides different methods to accomplish this coding, ranging from manual/visual methods (graphical methods) to fully automated optimal binning to maximize the information of the resulting binned variable for modeling credit default probability. Model Building, Logistic Regression The next step is to build a linear model that relates the coded categories to credit default risk. The statistical method that is used for this purpose is Logistic Linear Regression. The details of this method are described in the StatSoft Electronic Statistics Textbook (www.statsoft.com/textbook/).

Page 4 of 8 In short, a simple best linear equation is estimated and scaled so that, for example, the odds of credit default double for every 10 points' decrease in the resulting predicted score. Final Deployment The final deployment of a scorecard can be accomplished manually by adding the scores for the different categories for each variable describing a credit applicant. Typically, in larger banks the scoring is, of course, done automatically through systems such as STATISTICA Enterprise Server or STATISTICA Live Score for real-time scoring. Also, along with the final credit score, typically a final credit decision is returned by comparing the credit score to appropriately chosen cutoffs. Further, for rejected credit applications, the reasons for the rejection (e.g., those predictors whose respective scores were below the norm or average) can be provided to the applicant.

Page 5 of 8 Advantages of Common Scorecards The types of credit scorecards briefly described here have several convenient properties that explain their popularity. Final credit scores are the simple sum of individual factors determined by classifying applicants into specific bins. It is easy to see what specific factors were most beneficial for a favorable credit decision or detrimental and leading to a rejection of a credit application. In addition, because the basic statistical model is linear, it is easy to understand the inner workings or behavior of the credit risk model, in the sense that in most cases simple summaries can be derived such as the higher the education of an applicant, the lower the credit default risk and so on. Predictive Modeling Approaches So why are some progressive financial institutions and banks using other, perhaps more complex, approaches to credit scoring? The answer is simple: predictive modeling methods often result in more accurate predictions of risk. With more accurate predictions of risk, more credit can be extended to more applicants, while reducing or not increasing the overall default rate. Thus, the accuracy of the model to predict credit default has a direct impact on profits. Accuracy of Scorecards, and Profitability Shown below is a summary evaluation of different predictive modeling scorecards compared to the existing traditional scorecard for a large StatSoft customer.

Page 6 of 8 This graph (based on actual data) shows the enormous impact that a high-quality, accurate model of credit risk can have on the bottom line: Given the Baseline model, approving approximately 55% of applications for credit resulted in a little above 10% of eventual credit defaults. In this graph, the profitability of that status quo model and solution is pegged at 100% profit. Moving to the right, different cut-off values for credit approval are shown for a powerful ensemble predictive model built via STATISTICA. Depending on where the credit dis/approval threshold is set, given a predicted probability of default, more credit can be approved (e.g., for almost 85% of all applicants in the High Approval Rate model), while either decreasing or maintaining a constant default rate. The effect on the profitability is dramatic: over 80% greater profit resulted for all cutoff values considered. In short, and as expected, the more accurately a creditor can predict the risk of credit default, the more selective that creditor can be to extend credit only to those who will pay back the money with all interest as agreed. How Should Accurate Risk Models Be Built? In short, the answer is: by using algorithms that can find and approximate any systematic relationships between predictor variables and credit default risk, regardless if those relations are nonlinear, highly interactive (e.g., different models should be applied to different age groups), etc. Such algorithms are called general approximators because they can approximate any relationship regardless of how complex it might be and leverage those relationships to make more accurate predictions. Ensemble Models, Voting Models The algorithms that have proven to be very successful and that are refined in the STATISTICA platform are those applying voting (averaging) of predictions from different tree- models, or by boosting those models. Boosting is the process of applying a learning algorithm repeatedly to the poorly predicted observations in a training sample to represent correctly all homogeneous subgroups, groups of different and difficult-to-predict applicants, outliers, etc. For example, in a recent international competition based on a credit risk scoring problem (the international 2010 PAKDD Data Mining Competition), the winning (most accurate) solution was computed using the STATISTICA Boosting Trees (stochastic gradient boosting) algorithm. The paper describing the winning solution is available from StatSoft. The details of effective predictive modeling algorithms are often complex. However, efficient implementations of these algorithms on capable hardware, such as the implementations through the STATISTICA Credit Risk solutions, will solve even difficult problems against large databases in very little time.

Page 7 of 8 The Dreaded Black-Box and How to Open It There is little controversy around the notion that predictive modeling techniques as implemented in the STATISTICA solutions will outperform the accuracy of statistical modeling approaches in most cases. However, a common criticism that is raised in this context is about the concern that these models are black-boxes, making it nearly impossible to provide insights into the reasons for specific predictions, or the general mechanisms and relationships that drive the key outcomes such as credit default. On the surface, these concerns are valid and perhaps sufficiently important to continue with the status quo and common scorecard modeling approaches. However, there are good solutions to address this issue of lack-of-interpretability and reason scores, and all of them are easily implemented through the STATISTICA solutions. What-if (scenario) analysis. In most general terms, for every prediction that is made through a black-box predictive model, the computer can run comprehensive what-if analyses to evaluate what the credit default probability prediction would have been, if for each predictor different values had been observed. So for example, if an applicant for credit is 18 years old and credit is denied, the computer can quickly evaluate what the credit decision would have been if the applicant had been 20 years old or 30 years old. Applying this basic approach systematically to every prediction or group of predictions will provide detailed insights into the inner workings of even the most complex models, and into the specific predictors and their interactions that are responsible for a specific credit decision. Two-stage modeling. There are other and perhaps simpler approaches that can also be used to arrive at interpretable models and predictions. One can use the predictions of the most accurate black-box model as a starting point, and drill down into the model to identify the specific interactions and predictor bins that were important and drove the risk prediction (for example, using what-if scenarios or the various standard results around predictor importance available through STATISTICA). With that knowledge, better and more comprehensive common scorecards can then be built that include the relevant predictor interactions, pre-scoring segmentation of homogeneous groups of applicants, specific binning solutions, etc., identified through the first predictive modeling step. Deploying Predictive Models Risk models derived via predictive modeling tools are best deployed in an automated computer scoring solution such as STATISTICA Enterprise Server or STATISTICA Live Score. In fact, the STATISTICA Enterprise Decisioning Platform will manage the entire modeling lifecycle from model development to single-click deployment for real-time or batch scoring with version control and audit logs for regulatory compliance.

Page 8 of 8 Conclusion Risk modeling is an area where the application of advanced predictive modeling tools implemented in the STATISTICA solution platforms can produce significant and rapid return on investment. The STATISTICA platform includes algorithms such as boosted trees, tree-nets (or random forests), and automatic neural network ensembles, to mention only a few of the very powerful and proven-to-be-effective implementations available. When combined with the ease and comprehensiveness of features for model lifecycle management and integration with business rules and logic available in the STATISTICA Decisioning Platform, there is little reason not to adopt these methods to realize the revenue associated with more accurate risk prediction and to become more competitive in the complex financial services domain. Further Reading: Breiman, L. (2001). Statistical Modeling: The Two Cultures. Statistical Science, Vol. 16, 199-231. Giovanni, S., Elder, J. (2010). Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Synthesis Lectures on Data Min. Harańczyk, G. (2010). Predictive model optimization by searching parameter space. Winning entry to the PAKDD 2010 Credit Risk Modeling Competition. StatSoft Inc. Hill, T. & Lewicki, P. (2006). Statistics: Methods and Applications. StatSoft. (also available in electronic form at: http://www.statsoft.com/textbook/) Nisbet, R., Elder, J., Miner, G. (2009). Handbook of Statistical Analysis and Data Mining Applications. Elsevier. The Wall Street Journal Market Watch (2012). Danske Bank Implemented STATISTICA to Maximize Efficiency and Minimize Operational Risk. http://www.statsoft.com/portals/0/downloads/danske-bank-marketwatch.pdf For More Information: StatSoft, Inc. 2300 East 14 th Street Tulsa, OK 74104 USA www.statsoft.com StatSoft Copyright 2013