BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Similar documents

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Risk pricing for Australian Motor Insurance

Model Validation Techniques

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Introduction to Predictive Modeling Using GLMs

How To Build A Predictive Model In Insurance

Data Mining. Nonlinear Classification

Supervised Learning (Big Data Analytics)

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

A Deeper Look Inside Generalized Linear Models

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Predictive Modeling Techniques in Insurance

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Regression Modeling Strategies

Penalized Logistic Regression and Classification of Microarray Data

Microsoft Azure Machine learning Algorithms

Machine Learning Capacity and Performance Analysis and R

Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Why Ensembles Win Data Mining Competitions

Knowledge Discovery and Data Mining

Lecture 3: Linear methods for classification

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Practical Machine Learning Tools and Techniques

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

IST 557 Final Project

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

The Data Mining Process

Simple Predictive Analytics Curtis Seare

Data Mining Methods: Applications for Institutional Research

Studying Auto Insurance Data

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

SAS Software to Fit the Generalized Linear Model

Nonnested model comparison of GLM and GAM count regression models for life insurance data

SOA 2013 Life & Annuity Symposium May 6-7, Session 30 PD, Predictive Modeling Applications for Life and Annuity Pricing and Underwriting

Fast Analytics on Big Data with H20

Advanced Ensemble Strategies for Polynomial Models

The Operational Value of Social Media Information. Social Media and Customer Interaction

Predicting daily incoming solar energy from weather data

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Classification and Regression by randomforest

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution

Data Mining - Evaluation of Classifiers

STA 4273H: Statistical Machine Learning

Benchmarking of different classes of models used for credit scoring

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Lecture 13: Validation

Machine Learning Methods for Demand Estimation

Classification of Bad Accounts in Credit Card Industry

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Insurance Fraud Detection: MARS versus Neural Networks?

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

exspline That: Explaining Geographic Variation in Insurance Pricing

Knowledge Discovery and Data Mining

not possible or was possible at a high cost for collecting the data.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

CS570 Data Mining Classification: Ensemble Methods

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

: Introduction to Machine Learning Dr. Rita Osadchy

A Property & Casualty Insurance Predictive Modeling Process in SAS

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Predictive Modeling in Long-Term Care Insurance

Gamma Distribution Fitting

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Regularized Logistic Regression for Mind Reading with Parallel Validation

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry

Solving Regression Problems Using Competitive Ensemble Models

Better credit models benefit us all

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

JetBlue Airways Stock Price Analysis and Prediction

GLM I An Introduction to Generalized Linear Models

Decision Trees from large Databases: SLIQ

Gerry Hobbs, Department of Statistics, West Virginia University

Statistical Machine Learning

Predicting borrowers chance of defaulting on credit loans

Expert Systems with Applications

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Azure Machine Learning, SQL Data Mining and R

Predictive modelling around the world

Presentation by: Ahmad Alsahaf. Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering

The Predictive Data Mining Revolution in Scorecards:

Transcription:

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14

Insurance has always been a data business The industry has successfully used data in pricing thanks to Decades of experience Highly trained resources: actuaries! Increasing computing power More recently, innovative players in mature markets started to make use of data for other areas such as marketing, fraud detection, claims management, service providers management, etc

New users of predictive modelling are o Internet o Retail o Telecommunications o Accommodation o Aviation and transport o Challenges faced Shorter experience (most started in the last 10 years). No actuaries Data with large number of rows thousands of variables text Solution found : Machine Learning traditional regression techniques (OLS or GLMs) were replaced by more versatile non parametric techniques and/or human input was replaced by tuning parameters optimized by the Machine

Spam detection or how to deal with thousands of variables Emails text are converted into document-term matrix with thousands of columns SPAM One simple way to detect spam is to replace GLMs by regularized GLMs which are GLMs where a penalty parameter is introduced in the loss function. This allows to automatically restrict the features space, while in traditional GLMs, selection of most relevant predictors is performed manually.

The penalty effect in a regularized GLM Whilst fitting Regularized GLMs, you introduce a penalty in the loss function (the deviance) to minimize. The penalty is defined as alpha=1 is the lasso penalty, and alpha=0 the ridge penalty

Analytics which are now part of our day-to-day vocabulary

Analytics which make us buy more Amazon revolutionized electronic commerce with People who viewed this item also viewed..., o By suggesting things customers are likely to want, Amazon customers make two or more purchases instead of a single purchase. Netflix does something similar in their online movie business.

Analytics which help us connect with others LinkedIn uses People You May Know Group You May Like to help you connect with others

Analytics which remember our closest ones From the free Machine Learning course @ ml-class.org by Andrew Ng

High value from data is yet to be captured

Two types of contributors to the predictive modelling field From Statistical modelling, the two cultures by Breiman (2001) The Data Modelling Culture The Machine Learning Culture y OLS GLMs GAMs GLMMs Cox Model validation. goodness-of-fit tests and residual examination Provide more insight about how nature is associating the response variables to the input variables. But, if the model is a poor emulation of nature, the conclusions based on this insight may be wrong! x y unknown Regularized GLMs, Neural nets, Decision trees, Model validation. Measured by predictive accuracy Sometimes considered as black box (unfairly for some techniques), they often produce higher predictive power with less modelling efforts all models are wrong, some are useful. George Box x

Actuarial modelling: a hybrid and practical approach Whilst fitting models, actuaries have 2 goals in mind: prediction and information. We use GLMs to keep things simple but when it is necessary we have learnt to Use GAMs and GEEs to relax some of GLMs assumptions (linearity, independence) Don t fully rely on GLMs goodness-of-fit tests and test predictive power on cross-validation datasets Use GLMMs to evaluate credibility estimates for categories with little statistical material Use PCA or regularized regression to handle with data with high dimensionality Integrate Machine Learning techniques insights to improve GLMs predictive power

Interactions: the ugly side of GLMs Two risk factors are said to interact when the effect of one factor varies depending on the levels of the other factor Latitude and longitude typically interact Gender and age are also known to interact in Longevity or Motor insurance Unfortunately, GLM models do not automatically account for interactions although they can incorporate them. How smart actuaries detect potential interactions? luck, intuition, descriptive analysis, experience, market practices help Machine Learning techniques based on decision trees

Decision trees are known to detect interactions Yes High 17% Low 83% Is BP > 91? High 70% Low 30% Classified as high risk! Yes High 2% Low 98% Classified as low risk! No High 12% Low 88% Is age <= 62.5? Yes High 50% Low 50% No High 23% Low 77% Is ST present? but usually have lower predictive power than GLMs No High 11% Low 89% Classified as low risk!

Random Forest will provide you with higher predictive power but less interpretability A Random Forest is: a collection of weak and independent decision trees such that each tree has been trained on a bootstrapped dataset with a random selection of predictors (think about the wisdom of crowds)

Boosted Regression Trees or learn step by step slowly BRTs (also called Gradient Boosting Machine) use boosting and decision trees techniques: The boosting algorithm gradually increases emphasis on poorly modelled observations. It minimizes a loss function (the deviance, as in GLMs) by adding, at each step, a new simple tree whose focus is only on the residuals The contributions of each tree are shrunk by setting a learning rate very small (and < 1) to give more stable fitted values for the final model To further improve predictive performance, the process uses random subsets of data to fit each new tree (bagging).

The Gradient Boosting Machine algorithm Developed by Friedman (2001) who extended the work of Friedman, Hastie, and Tibshirani (2000), 3 professors from Stanford who are also the developers of Regularized GLMs, GAMs and many others!!!

Why do I love BRTs? BRTs can be fitted to a variety of response types (Gaussian, Poisson, Binomial) BRTs best fit (interactions included) is automatically detected by the machine BRTs learn non-linear functions without the need to specify them BRT outputs have some GLM flavour and provide insight on the relationship between the response and the predictors BRTs avoid doing much data cleaning because of their ability to accommodate missing values immunity to monotone transformations of predictors, extreme outliers and irrelevant predictors

Links to BRTs areas of application Orange s churn, up-, and cross-sell at 2009 KDD Cup http://jmlr.csail.mit.edu/proceedings/papers/v7/miller09/miller09.pdf Yahoo Learning to Rank Challenge http://jmlr.csail.mit.edu/proceedings/papers/v14/chapelle11a/chapell e11a.pdf Patients most likely to be admitted to hospital - Health Heritage Prize Only available to Kaggle s competitors Fraud detection in http://www.datamines.com/resources/papers/fraud%20comparison.pdf Fish species richness http://www.stanford.edu/~hastie/papers/leathwick%20et%20al%202 006%20MEPS%20.pdf Motor insurance http://dl.acm.org/citation.cfm?id=2064113.2064457

A practical example Objective: model the relationship between settlement delay, injury severity, legal representation and the finalized claim amount Variables Description Settled amount $10-$4,490,000 5 injury codes (inj1, inj2, inj5) 1 (no injury), 2, 3, 4, 5, 6 (fatal), 9 (not recorded) Accident month Coded 1 (7/89) through to 120 (6/99) Reporting month Finalization month Coded as accident Coded as accident Operation time The settlement delay percentile rank (0-100) Legal representation 0 (no), 1 (yes) 22 036 settled personal injury insurance claims from accidents occurring from 7/1989 through to 1/1999.

Why this dataset? Is publicly available: it was featured in the book by de Jong & Heller (GLMs for insurance data). It can be downloaded at http://www.afas.mq.edu.au/research/books/glms_for_insu rance_data/data_sets Is insurance related with highly skewed claims size Presence of interactions

Software used Entire analysis is done in R. R is a free software environment which provides a wide variety of statistical and graphical techniques. It has gained exponential popularity both in the business and academic worlds You can download it for free @ www.r-project.org/ 2 add-on packages (also freely available) were used To train GAMs: Wood s package mgcv. To train BRTs: dismo, a package which facilitates the use of BRTs in R. It calls Ridgeway s package (gbm) which could also have been used to train the model but provides less diagnostic reports.

Assessing model performance We assess model predictive performance using independent data (cross-validation) Partitioning the data into separate training and testing subsets Claims settled before 98 / Claims settled in 98 and 99 5-fold cross-validation of the training set Randomly divided training data into 5 subsets Make 5 different training sets each comprising a unique combination of 4 subsets. the deviance metric: which measures how much the predicted values differ from the observations for skewed data (the deviance is also the loss function minimized whilst fitting GLMs).

A few data manipulation To convert the injury codes into ordinal factors, we: recoded the injury level 9 into 0 and set missing values (for inj2, inj5) at 0 Other transformations: We capped inj2, and inj5 at 3 (too low statistical material for higher values). We computed the reporting delay and the log of the claim amounts We split the data in a training set and a testing set: Claims settled before 98 Claims settled in 98 and 99 We also formed 5 random subsets of the training set to perform 5 fold cross validations

GLM trained GLM <- glm(total ~ op_time + factor(legrep) + rep_delay+ + factor(inj1)+ factor(inj2)+ factor(inj3)+ factor(inj4)+factor(inj5), family=gamma(link="log"), data=training) Very simple GLM No non-linear relationship except for the one introduced by the log link function No interactions

BRT trained library(dismo) BRT<-gbm.step(data=training, gbm.x=c(2:7,11,14), gbm.y=12, family="gaussian", tree.complexity=5, learning.rate=0.005) Size of individual trees (usually 3 to 5) Same predictors as for the GLM Log of claim amounts Lower (slower) is better but computationally expensive. Usually between 0.005 to 0.1) Note that a 3 rd tuning parameter is sometimes required: the number of trees. In our case, the gbm.step routine computes the optimal number of trees (2900) automatically using 10 fold cross validation. Predictors influence 2-ways interaction ranking

BRT s Partial dependence plots Non-linear relationship detected automatically represent the effect of each predictor after accounting for the effects of the other predictors

Plot of interactions fitted by BRT

GLM trained with BRT s insight GLM2 <- glm(total ~ (op_time + factor(legrep) + fast)^2 + op_time*factor(legrep)*fast + rep_delay+ factor(inj1)+ factor(inj2)+ factor(inj3)+ factor(inj4)+factor(inj5), family=gamma(link="log"), data=training) Non linear relationship and interaction are introduced (as did de Jong and Heller) to model the non linear effect of op_time and its interaction with legrep We identified fast claims settlement (op_time<=5) with a dummy variable fast

Incorporate interactions & non-linear relationship with GAMs Generalized Additive Models (GAMs) use the basic ideas of Generalized Linear Models While in GLMs g(μ) is a linear combination of predictors, g(μ) g(e[y])=α+β 1 X 1 +β 2 X 2 +...+β N X N Y {X} ~ exponential family in GAMs the linear predictor can also contain one or more smooth functions of covariates g(μ) = β X + f 1 (X 1 ) + f 2 (X 2 ) + f 3 (X 3,X 4 )+... To represent the functions f, use of cubic splines is common To avoid over-fitting, a penalized Maximum Likelihood (ML) is minimized. The optimal penalty parameter is automatically obtained via cross-validation

GAM trained with BRT insight GAM <- gam(total ~ (op_time + factor(legrep) + fast)^2 + op_time*factor(legrep)*fast + te(op_time,rep_delay,bs="cs") + factor(inj1) + factor(inj2)+ factor(inj3)+ factor(inj4)+factor(inj5), family=gamma(link="log"), data=training, gamma=1.4) The GAM framework allows us to incorporate an additional interaction between op_time and rep_delay which could not have been easily introduced in the GLM framework

Transformation of BRTs predictions E(Y) = exp(e(logy)) Exp(BRTs s predictions) provides us only with the expected median of the claims size as function of the predictors To relate the median with the mean and get predictions of the mean (and not the median), we trained a GAM to model the claims size with: BRTs fitted values as the predictor a Gamma error and a log link Another transformation would have consisted of adding variance of the log transformed claim amounts /2 Generally doesn t provide good prediction as variance unlikely to be constant and should be modelled as function of model predictors too

5 fold cross validations Lower Gamma deviance is better GLM holdout GA deviance = 1.023 BRT1 holdout GA deviance = 1.011 GLM2 holdout GA deviance = 1.001 GAM holdout GA deviance = 1.001 Interactions matter! We see here that - incorporating an interaction between op_time and legrep improves significantly the GLM s fit - a more complex model (GAM) doesn t improve predictive accuracy and then we are better off keeping things simple. - to further improve accuracy, we could simply blend GLM and BRT predictions Blends: GLM+BRT1 holdout GA deviance = 1.002 GLM2+BRT1 holdout GA deviance = 0.993 GLM2+GAM holdout GA deviance = 0.999

Plot of deviance errors against 5cv predicted values

Predictions for 1998 and 1999 GLM holdout GA deviance = 1.03 BRT1 holdout GA deviance = 0.993 GLM2 holdout GA deviance = 0.996 This omits however the inflation effect. To model inflation, we trained the residuals of our previous models as function of the settlement month and used it to predict the in(de)flation in 98/99. After accounting for deflation GLM holdout GA deviance = 0.927 BRT1 holdout GA deviance = 0.926 GLM2 holdout GA deviance = 0.906 BRT1 + GLM2 holdout GA deviance = 0.894

Lessons from this example 1. Make everything as simple as possible but not simpler (Einstein) Interactions matter! Omitting them can result in a loss of predictive accuracy 2. Parametric models work better in presence of small datasets But the challenge is to incorporate the right model structure 3. Machine Learning techniques are not all black boxes and can provide useful insights 4. Predictions need to be adjusted to account for future trends and this is true whatever the technique used 5. Blends of different techniques usually improve accuracy