TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015
CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions 25 Q&A 29 3
INTRODUCTION
INTRODUCTION TEAM STRUCTURE Antonio Horta- Osorio CEO Retail Division Group Operations Commercial Banking Group Finance Group Risk Insurance Customer Products & Marketing Group Digital... Group Financial Risk Retail and Consumer Credit Risk Analytics & Modelling Customer Analytics and Decisions Analytics & Modelling Has two group level responsibilities, which are Model Validation, and Analytics & Model Development. We are doing the latter, with 3 focuses: 1) Build a centre of excellence for all analytics, linking up analytics teams across LBG, sharing best practices and knowledge around data and modelling techniques; 2) Act as a link to external entities, e.g. Universities, Bureaus, analytical software providers, to keep on top of latest research; 3) Conduct proof of concept projects for new data and analytics solutions overall to simplify and develop LBG s analytical capacity to be the best bank for customers. Customer Analytics and Decisions Responsible for the development and maintenance of the Retail and Consumer Finance risk and capital models, to support lending decisions in line with our risk appetite and capital management strategy 5
RANDOM FOREST METHODOLOGY
TRANSACTIONAL DATA MINING PROJECT OBJECTIVE Problem statement Most LBG credit risk models are built on traditional credit data bases and techniques, such as account and customer characteristics and Logistic Regression based scorecards The latest technical advancements in computation and Machine Learning are not used in most credit models, nor considered/evaluated Objective: A proof of concept project, leveraging learnings from a Random Forest based Fraud Model, to assess the potential in.. 1...Transactional data to enrich credit modelling datasets 2...Random Forests over Logistic Regression for estimating credit risk 7
RANDOM FORESTS OVERVIEW Training Sample Default rate: 50% Methodology cornerstones Numerous iterations of a Decision Tree build Each Decision Tree is different (trained on different subsets of the data) Split 2a Default rate: 90% Split 2b Default rate: 20% Split 3a Default rate: 10% Split 3b Default rate: 40% Each Decision Tree can be unstable in itself, still the Forest formed is proven to be stable and was found to be one of the most accurate methods for prediction Hundreds of decision trees a forest Bootstrap Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Average of votes: Probability of future default 8
DECISION TREES OVERVIEW P(G) P(B) GI Population 1 0 0 0.9 0.1 0.18 0.8 0.2 0.32 0.7 0.3 0.42 0.6 0.4 0.48 0.5 0.5 0.5 0.4 0.6 0.48 Split 2a Split 2b Split 3a Split 3b 0.3 0.7 0.42 0.2 0.8 0.32 0.1 0.9 0.18 0 1 0 Aim is to decrease impurity at each split One measure is the Gini impurity criterion at each node: G = 1 P(G) 2 P(B) 2 The decrease in Gini impurity shows how important a characteristic split is 9
DECISION TREES METHODOLOGY Building Methodology 1. Splitting the population until stop criteria met Too few observations in a tree node to split Perfectly pure node The split doesn t improve purity Reached a maximum tree depth / complexity (pre-set) 2. Evaluating the tree on an independent validation data set (not the same as test data or hold-out sample) 3. Prune back tree until performance is optimal on independent validation set 4. Assign an outcome probability to each leaf node as per the occurrence of outcome in that leaf node in the training data: score Scoring Methodology 1. Each new observation will fall into one of the leaf nodes, where it will get the score that was assigned to that leaf at training (outcome probability) Split Search Population Population Population Population Population Data driven and optimized Considers all characteristics and observations in the dataset at each split Considers fundamentally all possible splits of each characteristic Selects the best local candidate at each splitting 10
DECISION TREES PROS AND CONS PROS CONS 11
RANDOM FORESTS METHODOLOGY Bootstrap Sample Split 2a Split 2b 1. Select a random k observations (once per tree) 2. Select a random m characteristics and perform a split 3. Replacement of all characteristics into the bag 4. Select another random m characteristics and perform the next split 12
RANDOM FORESTS METHODOLOGY Bootstrap Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Bootstrap Sample Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Split 2a Split 2b Building Methodology (cont d) 5. Each tree is trained until a stop criteria is reached 6. No pruning done Fundamental Parameters k: sample size (bag size) m: number of features n: number of trees d: max depth of tree Scoring Methodology 1. Each new observation will fall into one of the leaf nodes, and gets a score (same as in decision trees) 2. Each tree will produce a score for each observation, and determine a decision (default / no default) 3. All votes are averaged, providing an outcome probability for the Forest as a whole the final score 13
RANDOM FORESTS PROS AND CONS PROS CONS 14
RANDOM FORESTS ANALOGY Random Forest An ensemble of Decision Trees A Decision Tree is a supervised learning method, capable of observing associations in data All decision trees are trained using the same methodology All decision trees are trained using slightly different subsets of our data, developing an edge in scoring different types of observations Empirical and theoretical evidence shows that the average of a lot of highly trained trees gives a more accurate and stable prediction, than using a single model Board of Medical Experts An ensemble of Specialists A Specialist is capable of learning from experience, books, practice All specialists have the same brain structure and learning capabilities All specialists specialize in different areas of a subject, had different degrees, different experiences A board of specialists may provide a more balanced and accurate decision, than a single generalist 15
RANDOM FORESTS GENERIC FRAMEWORK Unsupervised Learning methods (no outcome) Clustering (k-means, k-mediods, Hierarchical, Density, etc) Association Rules (Market Basket Analysis, Sequence Analysis) Supervised Learning methods (binary or continuous outcome) Simple Classifiers (Logistic Regression, Decision Tree, Support Vector Machines, etc) Ensemble Classifiers (Random Forest, Artificial Neural Networks, Gradient Boosting, etc) Generally a combination of simple classifier units Type of simple classifiers (uniform, mixed) Diversification logic (subspaces of features, bags of observations, performance of other units, etc) Voting logic (simple averaging, majority voting, confidence-enhanced voting, etc) 16
TRANSACTIONAL DATA MINING PROJECT Csaba Főző, Lee Gregory, Sami Niemi, Olga Murumets
TRANSACTIONAL DATA MINING PROJECT OBJECTIVE Problem statement Most LBG credit risk models are built on traditional credit data bases and techniques, such as account and customer characteristics and Logistic Regression based scorecards The latest technical advancements in computation and Machine Learning are not used in most credit models, nor considered/evaluated Objective: A proof of concept project, leveraging learnings from a Random Forest based Fraud Model, to assess the potential in.. 1...Transactional data to enrich credit modelling datasets 2...Random Forests over Logistic Regression for estimating credit risk 18
TRANSACTIONAL DATA MINING SCOPE In Scope Applications to a credit product over 3 months. Same characteristics as used in the champion model, plus characteristics derived from transactions. Extraction and transformation of new data elements from transactional data sources (Transaction Categorization System) The same bad definition will be used as in the champion model. Development and out-of-time test samples will be used for validation, aligned with the champion model time windows. Comparison against the live scorecard in place, on the basis of performance, transparency, and stability. Data sourcing and preliminary feature selection in SQL (for transactional data), and in SAS (for customer data). Data preparation and feature selection in SAS, R and Python. Model development using a Random Forest package in R and Python. Not in Scope Implementation Monitoring Governance 19
TRANSACTIONAL DATA MINING PROJECT PHASES Regression with 12 chars Phase 1a: RF on 12 chars Phase 1b: RF on 1500 chars Phase 2: RF on 1500 chars + transactional chars Phase 3: RF on 1500 chars + all transactional chars Phase 4: RF on 1500 chars + all transactional chars + complex derived chars 1. Develop Random Forest Methodology for Credit Risk: Build a Random Forests model using the same training and test data as in the champion model (12 characteristic variables). Extend model build to include full 1,500 credit risk characteristics considered during development of the champion model. Evaluate results on hold-out test sample and compare with champion model to evaluate Random Forests predictive power. 2. Inclusion of Transactional Data: Perform Transaction Data extraction to create Characteristic variables. Extend data model to better align with credit risk modelling Extend Random Forest and Logistic Regression model builds to include Transaction Data. Evaluate both training and test results across the two Random Forest models and the two Logistic Regression models (two models for each based on two data sets, original characteristic variables and the original data plus transactional data). 3. Extension of Transactional Data from another credit product: Using data extraction methodology from Phase 2 to create Characteristic Variables based on further transactions. Extend the Random Forest methodology to use data from champion model data, and all transactons data. Evaluate performance across all existing models. 4. New Transaction Data Characteristics: Design methodology to apply Association Rule Mining to generate new characteristic variables based on Transaction Data. Test these new variables using both Random Forest and Logistic Regression methodology. 20
TRANSACTIONAL DATA MINING ISSUES Computational restrictions Transactional data storage (wide datasets in volatile, permanent) Processing storage (Memory, Spool space) Characteristic transformations (SQL/SAS/R) Data transfer speed (Different databases, local PC) System Reliability Modelling tool (SAS, SAS EM, R) Processing speed (code optimization, indexing, partitioning, parallel processing) Transparency and interpretability Measurement of variable importance while collinearity is present Impact of drivers on outcome probability Random Forest methodology Different implementations and tools (SAS, R, Python, C#) Feature selection on 50K+ features, focussed on improving the POS model Stability (parameter optimization, independent validation) 21
TRANSACTIONAL DATA MINING TRANSACTIONAL DATA Transaction types 2 different basic types Transaction purposes/channels MCC code groups Beneficiaries X Weeks Weeks from time of application 0-6 weeks Months Months from time of application 0-12 months X Measurement Volume Sum, Avg, Max, Min Value Avg, Min, Max Balance Combining transaction type with time interval and measurement yields a multitude of predictors: 40K+ characteristics Infrastructure necessitated the reduction of these characteristics Information Value macros cannot cope with this number of chars Chars populated very sparsely were removed (.01%) Chars with very low defaults were removed (125 and 1%) Marginal Information Value and Mean Decrease in Gini was used to select final candidate chars 22
TRANSACTIONAL DATA MINING MODEL PERFORMANCE Characteristics Model Hold-out Gini 12 Logistic Regression Champion Characteristics 51% 12 Random Forest Champion Characteristics 54% 170 Random Forest All Credit Risk Chars (1.5K+ considered) 59% 100 Random Forest Champion Chars + Transactional Chars (15K+ considered) 56% 10 Logistic Regression All Credit Risk & Transactional Chars 53% 258 Random Forest All Credit Risk & Transactional Chars 60% 23
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0,99 0,94 0,88 0,82 0,76 0,70 0,65 0,59 0,53 0,47 0,41 0,36 0,30 0,24 0,18 0,12 0,07 Cumulative Good % Population % TRANSACTIONAL DATA MINING BEST RANDOM FOREST MODEL 100% Validation GINI 100% Cumulative Score Distribution 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 10% Random Forest Model random 20% 10% Cuml Good% Cuml Bad% 0% 0% Cumulative Bad % Score 24
CONCLUSION
CONCLUSION A CRUDE ESTIMATE OF PROFIT & LOSS IMPACT Fixing predicted defaults Fixing predicted non-defaults 4%+ 12%+ Extra Lending Potential Reduction in Losses 26
CONCLUSION SUMMARY Positive Results Substantial improvement in credit risk model performance, which also translates to business growth / loss mitigation Two successful application of Random Forests in LBG (Fraud and Credit Risk) More information can be used in modelling credit risk (# of predictors, interactions, non-linearity) Model build is relatively fast Importance of model drivers are available Challenges Implementation is difficult with current IT systems, though possible Full model interpretation is less straight-forward, though possible Cultural change is always hard, though competitors give us some push 27
Q&A
30