How To Make A Credit Risk Model For A Bank Account

Similar documents

Gerry Hobbs, Department of Statistics, West Virginia University

Data Mining. Nonlinear Classification

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Using multiple models: Bagging, Boosting, Ensembles, Forests

Decision Trees from large Databases: SLIQ

Data Mining Practical Machine Learning Tools and Techniques

Leveraging Ensemble Models in SAS Enterprise Miner

Knowledge Discovery and Data Mining

The Data Mining Process

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Journée Thématique Big Data 13/03/2015

A Property & Casualty Insurance Predictive Modeling Process in SAS

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Social Media Mining. Data Mining Essentials

Fast Analytics on Big Data with H20

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Chapter 12 Discovering New Knowledge Data Mining

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Classification of Bad Accounts in Credit Card Industry

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

An Overview and Evaluation of Decision Tree Methodology

The Predictive Data Mining Revolution in Scorecards:

Model Combination. 24 Novembre 2009

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

2015 Workshops for Professors

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Lecture 10: Regression Trees

Why Ensembles Win Data Mining Competitions

Ensemble Data Mining Methods

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Comparison of Data Mining Techniques used for Financial Data Analysis

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

In-Database Analytics

Data mining and statistical models in marketing campaigns of BT Retail

Predictive Modeling of Titanic Survivors: a Learning Competition

MS1b Statistical Data Mining

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

A fast, powerful data mining workbench designed for small to midsize organizations

Data Mining Techniques Chapter 6: Decision Trees

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

Data Mining Algorithms Part 1. Dejan Sarka

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

A Property and Casualty Insurance Predictive Modeling Process in SAS

Benchmarking of different classes of models used for credit scoring

Data Mining - Evaluation of Classifiers

Better credit models benefit us all

Data Mining Applications in Higher Education

Didacticiel Études de cas

Azure Machine Learning, SQL Data Mining and R

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Risk pricing for Australian Motor Insurance

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Distributed forests for MapReduce-based machine learning

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

Knowledge Discovery and Data Mining

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Data Mining Methods: Applications for Institutional Research

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

A Data Mining Tutorial

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Data mining techniques: decision trees

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Knowledge Discovery and Data Mining

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Classification and Regression by randomforest

Chapter 12 Bagging and Random Forests

Data Mining Applications in Fund Raising

Credit Risk Models. August 24 26, 2010

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Advanced In-Database Analytics

Clustering through Decision Tree Construction in Geology

Using Data Mining for Mobile Communication Clustering and Characterization

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Variable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal Bank of Scotland, Bridgeport, CT

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Course Syllabus. Purposes of Course:

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Data Mining Techniques and its Applications in Banking Sector

Challenges for Data Driven Systems

GLM, insurance pricing & big data: paying attention to convergence issues.

The Operational Value of Social Media Information. Social Media and Customer Interaction

Predictive Modeling and Big Data

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Learning outcomes. Knowledge and understanding. Competence and skills

Statistics in Retail Finance. Chapter 2: Statistical models of default

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

ANALYTICS CENTER LEARNING PROGRAM

Classification algorithm in Data mining: An Overview

Transcription:

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015

CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions 25 Q&A 29 3

INTRODUCTION

INTRODUCTION TEAM STRUCTURE Antonio Horta- Osorio CEO Retail Division Group Operations Commercial Banking Group Finance Group Risk Insurance Customer Products & Marketing Group Digital... Group Financial Risk Retail and Consumer Credit Risk Analytics & Modelling Customer Analytics and Decisions Analytics & Modelling Has two group level responsibilities, which are Model Validation, and Analytics & Model Development. We are doing the latter, with 3 focuses: 1) Build a centre of excellence for all analytics, linking up analytics teams across LBG, sharing best practices and knowledge around data and modelling techniques; 2) Act as a link to external entities, e.g. Universities, Bureaus, analytical software providers, to keep on top of latest research; 3) Conduct proof of concept projects for new data and analytics solutions overall to simplify and develop LBG s analytical capacity to be the best bank for customers. Customer Analytics and Decisions Responsible for the development and maintenance of the Retail and Consumer Finance risk and capital models, to support lending decisions in line with our risk appetite and capital management strategy 5

RANDOM FOREST METHODOLOGY

TRANSACTIONAL DATA MINING PROJECT OBJECTIVE Problem statement Most LBG credit risk models are built on traditional credit data bases and techniques, such as account and customer characteristics and Logistic Regression based scorecards The latest technical advancements in computation and Machine Learning are not used in most credit models, nor considered/evaluated Objective: A proof of concept project, leveraging learnings from a Random Forest based Fraud Model, to assess the potential in.. 1...Transactional data to enrich credit modelling datasets 2...Random Forests over Logistic Regression for estimating credit risk 7

RANDOM FORESTS OVERVIEW Training Sample Default rate: 50% Methodology cornerstones Numerous iterations of a Decision Tree build Each Decision Tree is different (trained on different subsets of the data) Split 2a Default rate: 90% Split 2b Default rate: 20% Split 3a Default rate: 10% Split 3b Default rate: 40% Each Decision Tree can be unstable in itself, still the Forest formed is proven to be stable and was found to be one of the most accurate methods for prediction Hundreds of decision trees a forest Bootstrap Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Average of votes: Probability of future default 8

DECISION TREES OVERVIEW P(G) P(B) GI Population 1 0 0 0.9 0.1 0.18 0.8 0.2 0.32 0.7 0.3 0.42 0.6 0.4 0.48 0.5 0.5 0.5 0.4 0.6 0.48 Split 2a Split 2b Split 3a Split 3b 0.3 0.7 0.42 0.2 0.8 0.32 0.1 0.9 0.18 0 1 0 Aim is to decrease impurity at each split One measure is the Gini impurity criterion at each node: G = 1 P(G) 2 P(B) 2 The decrease in Gini impurity shows how important a characteristic split is 9

DECISION TREES METHODOLOGY Building Methodology 1. Splitting the population until stop criteria met Too few observations in a tree node to split Perfectly pure node The split doesn t improve purity Reached a maximum tree depth / complexity (pre-set) 2. Evaluating the tree on an independent validation data set (not the same as test data or hold-out sample) 3. Prune back tree until performance is optimal on independent validation set 4. Assign an outcome probability to each leaf node as per the occurrence of outcome in that leaf node in the training data: score Scoring Methodology 1. Each new observation will fall into one of the leaf nodes, where it will get the score that was assigned to that leaf at training (outcome probability) Split Search Population Population Population Population Population Data driven and optimized Considers all characteristics and observations in the dataset at each split Considers fundamentally all possible splits of each characteristic Selects the best local candidate at each splitting 10

DECISION TREES PROS AND CONS PROS CONS 11

RANDOM FORESTS METHODOLOGY Bootstrap Sample Split 2a Split 2b 1. Select a random k observations (once per tree) 2. Select a random m characteristics and perform a split 3. Replacement of all characteristics into the bag 4. Select another random m characteristics and perform the next split 12

RANDOM FORESTS METHODOLOGY Bootstrap Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Building Methodology (cont d) 5. Each tree is trained until a stop criteria is reached 6. No pruning done Fundamental Parameters k: sample size (bag size) m: number of features n: number of trees d: max depth of tree Scoring Methodology 1. Each new observation will fall into one of the leaf nodes, and gets a score (same as in decision trees) 2. Each tree will produce a score for each observation, and determine a decision (default / no default) 3. All votes are averaged, providing an outcome probability for the Forest as a whole the final score 13

RANDOM FORESTS PROS AND CONS PROS CONS 14

RANDOM FORESTS ANALOGY Random Forest An ensemble of Decision Trees A Decision Tree is a supervised learning method, capable of observing associations in data All decision trees are trained using the same methodology All decision trees are trained using slightly different subsets of our data, developing an edge in scoring different types of observations Empirical and theoretical evidence shows that the average of a lot of highly trained trees gives a more accurate and stable prediction, than using a single model Board of Medical Experts An ensemble of Specialists A Specialist is capable of learning from experience, books, practice All specialists have the same brain structure and learning capabilities All specialists specialize in different areas of a subject, had different degrees, different experiences A board of specialists may provide a more balanced and accurate decision, than a single generalist 15

RANDOM FORESTS GENERIC FRAMEWORK Unsupervised Learning methods (no outcome) Clustering (k-means, k-mediods, Hierarchical, Density, etc) Association Rules (Market Basket Analysis, Sequence Analysis) Supervised Learning methods (binary or continuous outcome) Simple Classifiers (Logistic Regression, Decision Tree, Support Vector Machines, etc) Ensemble Classifiers (Random Forest, Artificial Neural Networks, Gradient Boosting, etc) Generally a combination of simple classifier units Type of simple classifiers (uniform, mixed) Diversification logic (subspaces of features, bags of observations, performance of other units, etc) Voting logic (simple averaging, majority voting, confidence-enhanced voting, etc) 16

TRANSACTIONAL DATA MINING PROJECT Csaba Főző, Lee Gregory, Sami Niemi, Olga Murumets

TRANSACTIONAL DATA MINING PROJECT OBJECTIVE Problem statement Most LBG credit risk models are built on traditional credit data bases and techniques, such as account and customer characteristics and Logistic Regression based scorecards The latest technical advancements in computation and Machine Learning are not used in most credit models, nor considered/evaluated Objective: A proof of concept project, leveraging learnings from a Random Forest based Fraud Model, to assess the potential in.. 1...Transactional data to enrich credit modelling datasets 2...Random Forests over Logistic Regression for estimating credit risk 18

TRANSACTIONAL DATA MINING SCOPE In Scope Applications to a credit product over 3 months. Same characteristics as used in the champion model, plus characteristics derived from transactions. Extraction and transformation of new data elements from transactional data sources (Transaction Categorization System) The same bad definition will be used as in the champion model. Development and out-of-time test samples will be used for validation, aligned with the champion model time windows. Comparison against the live scorecard in place, on the basis of performance, transparency, and stability. Data sourcing and preliminary feature selection in SQL (for transactional data), and in SAS (for customer data). Data preparation and feature selection in SAS, R and Python. Model development using a Random Forest package in R and Python. Not in Scope Implementation Monitoring Governance 19

TRANSACTIONAL DATA MINING PROJECT PHASES Regression with 12 chars Phase 1a: RF on 12 chars Phase 1b: RF on 1500 chars Phase 2: RF on 1500 chars + transactional chars Phase 3: RF on 1500 chars + all transactional chars Phase 4: RF on 1500 chars + all transactional chars + complex derived chars 1. Develop Random Forest Methodology for Credit Risk: Build a Random Forests model using the same training and test data as in the champion model (12 characteristic variables). Extend model build to include full 1,500 credit risk characteristics considered during development of the champion model. Evaluate results on hold-out test sample and compare with champion model to evaluate Random Forests predictive power. 2. Inclusion of Transactional Data: Perform Transaction Data extraction to create Characteristic variables. Extend data model to better align with credit risk modelling Extend Random Forest and Logistic Regression model builds to include Transaction Data. Evaluate both training and test results across the two Random Forest models and the two Logistic Regression models (two models for each based on two data sets, original characteristic variables and the original data plus transactional data). 3. Extension of Transactional Data from another credit product: Using data extraction methodology from Phase 2 to create Characteristic Variables based on further transactions. Extend the Random Forest methodology to use data from champion model data, and all transactons data. Evaluate performance across all existing models. 4. New Transaction Data Characteristics: Design methodology to apply Association Rule Mining to generate new characteristic variables based on Transaction Data. Test these new variables using both Random Forest and Logistic Regression methodology. 20

TRANSACTIONAL DATA MINING ISSUES Computational restrictions Transactional data storage (wide datasets in volatile, permanent) Processing storage (Memory, Spool space) Characteristic transformations (SQL/SAS/R) Data transfer speed (Different databases, local PC) System Reliability Modelling tool (SAS, SAS EM, R) Processing speed (code optimization, indexing, partitioning, parallel processing) Transparency and interpretability Measurement of variable importance while collinearity is present Impact of drivers on outcome probability Random Forest methodology Different implementations and tools (SAS, R, Python, C#) Feature selection on 50K+ features, focussed on improving the POS model Stability (parameter optimization, independent validation) 21

TRANSACTIONAL DATA MINING TRANSACTIONAL DATA Transaction types 2 different basic types Transaction purposes/channels MCC code groups Beneficiaries X Weeks Weeks from time of application 0-6 weeks Months Months from time of application 0-12 months X Measurement Volume Sum, Avg, Max, Min Value Avg, Min, Max Balance Combining transaction type with time interval and measurement yields a multitude of predictors: 40K+ characteristics Infrastructure necessitated the reduction of these characteristics Information Value macros cannot cope with this number of chars Chars populated very sparsely were removed (.01%) Chars with very low defaults were removed (125 and 1%) Marginal Information Value and Mean Decrease in Gini was used to select final candidate chars 22

TRANSACTIONAL DATA MINING MODEL PERFORMANCE Characteristics Model Hold-out Gini 12 Logistic Regression Champion Characteristics 51% 12 Random Forest Champion Characteristics 54% 170 Random Forest All Credit Risk Chars (1.5K+ considered) 59% 100 Random Forest Champion Chars + Transactional Chars (15K+ considered) 56% 10 Logistic Regression All Credit Risk & Transactional Chars 53% 258 Random Forest All Credit Risk & Transactional Chars 60% 23

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0,99 0,94 0,88 0,82 0,76 0,70 0,65 0,59 0,53 0,47 0,41 0,36 0,30 0,24 0,18 0,12 0,07 Cumulative Good % Population % TRANSACTIONAL DATA MINING BEST RANDOM FOREST MODEL 100% Validation GINI 100% Cumulative Score Distribution 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 10% Random Forest Model random 20% 10% Cuml Good% Cuml Bad% 0% 0% Cumulative Bad % Score 24

CONCLUSION

CONCLUSION A CRUDE ESTIMATE OF PROFIT & LOSS IMPACT Fixing predicted defaults Fixing predicted non-defaults 4%+ 12%+ Extra Lending Potential Reduction in Losses 26

CONCLUSION SUMMARY Positive Results Substantial improvement in credit risk model performance, which also translates to business growth / loss mitigation Two successful application of Random Forests in LBG (Fraud and Credit Risk) More information can be used in modelling credit risk (# of predictors, interactions, non-linearity) Model build is relatively fast Importance of model drivers are available Challenges Implementation is difficult with current IT systems, though possible Full model interpretation is less straight-forward, though possible Cultural change is always hard, though competitors give us some push 27

Q&A

30