BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
|
|
|
- Albert Gibson
- 10 years ago
- Views:
Transcription
1 BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort [email protected] Session Number: TBR14
2 Insurance has always been a data business The industry has successfully used data in pricing thanks to Decades of experience Highly trained resources: actuaries! Increasing computing power More recently, innovative players in mature markets started to make use of data for other areas such as marketing, fraud detection, claims management, service providers management, etc
3 New users of predictive modelling are o Internet o Retail o Telecommunications o Accommodation o Aviation and transport o Challenges faced Shorter experience (most started in the last 10 years). No actuaries Data with large number of rows thousands of variables text Solution found : Machine Learning traditional regression techniques (OLS or GLMs) were replaced by more versatile non parametric techniques and/or human input was replaced by tuning parameters optimized by the Machine
4 Spam detection or how to deal with thousands of variables s text are converted into document-term matrix with thousands of columns SPAM One simple way to detect spam is to replace GLMs by regularized GLMs which are GLMs where a penalty parameter is introduced in the loss function. This allows to automatically restrict the features space, while in traditional GLMs, selection of most relevant predictors is performed manually.
5 The penalty effect in a regularized GLM Whilst fitting Regularized GLMs, you introduce a penalty in the loss function (the deviance) to minimize. The penalty is defined as alpha=1 is the lasso penalty, and alpha=0 the ridge penalty
6 Analytics which are now part of our day-to-day vocabulary
7 Analytics which make us buy more Amazon revolutionized electronic commerce with People who viewed this item also viewed..., o By suggesting things customers are likely to want, Amazon customers make two or more purchases instead of a single purchase. Netflix does something similar in their online movie business.
8 Analytics which help us connect with others LinkedIn uses People You May Know Group You May Like to help you connect with others
9 Analytics which remember our closest ones From the free Machine Learning ml-class.org by Andrew Ng
10 High value from data is yet to be captured
11 Two types of contributors to the predictive modelling field From Statistical modelling, the two cultures by Breiman (2001) The Data Modelling Culture The Machine Learning Culture y OLS GLMs GAMs GLMMs Cox Model validation. goodness-of-fit tests and residual examination Provide more insight about how nature is associating the response variables to the input variables. But, if the model is a poor emulation of nature, the conclusions based on this insight may be wrong! x y unknown Regularized GLMs, Neural nets, Decision trees, Model validation. Measured by predictive accuracy Sometimes considered as black box (unfairly for some techniques), they often produce higher predictive power with less modelling efforts all models are wrong, some are useful. George Box x
12 Actuarial modelling: a hybrid and practical approach Whilst fitting models, actuaries have 2 goals in mind: prediction and information. We use GLMs to keep things simple but when it is necessary we have learnt to Use GAMs and GEEs to relax some of GLMs assumptions (linearity, independence) Don t fully rely on GLMs goodness-of-fit tests and test predictive power on cross-validation datasets Use GLMMs to evaluate credibility estimates for categories with little statistical material Use PCA or regularized regression to handle with data with high dimensionality Integrate Machine Learning techniques insights to improve GLMs predictive power
13 Interactions: the ugly side of GLMs Two risk factors are said to interact when the effect of one factor varies depending on the levels of the other factor Latitude and longitude typically interact Gender and age are also known to interact in Longevity or Motor insurance Unfortunately, GLM models do not automatically account for interactions although they can incorporate them. How smart actuaries detect potential interactions? luck, intuition, descriptive analysis, experience, market practices help Machine Learning techniques based on decision trees
14 Decision trees are known to detect interactions Yes High 17% Low 83% Is BP > 91? High 70% Low 30% Classified as high risk! Yes High 2% Low 98% Classified as low risk! No High 12% Low 88% Is age <= 62.5? Yes High 50% Low 50% No High 23% Low 77% Is ST present? but usually have lower predictive power than GLMs No High 11% Low 89% Classified as low risk!
15 Random Forest will provide you with higher predictive power but less interpretability A Random Forest is: a collection of weak and independent decision trees such that each tree has been trained on a bootstrapped dataset with a random selection of predictors (think about the wisdom of crowds)
16 Boosted Regression Trees or learn step by step slowly BRTs (also called Gradient Boosting Machine) use boosting and decision trees techniques: The boosting algorithm gradually increases emphasis on poorly modelled observations. It minimizes a loss function (the deviance, as in GLMs) by adding, at each step, a new simple tree whose focus is only on the residuals The contributions of each tree are shrunk by setting a learning rate very small (and < 1) to give more stable fitted values for the final model To further improve predictive performance, the process uses random subsets of data to fit each new tree (bagging).
17 The Gradient Boosting Machine algorithm Developed by Friedman (2001) who extended the work of Friedman, Hastie, and Tibshirani (2000), 3 professors from Stanford who are also the developers of Regularized GLMs, GAMs and many others!!!
18 Why do I love BRTs? BRTs can be fitted to a variety of response types (Gaussian, Poisson, Binomial) BRTs best fit (interactions included) is automatically detected by the machine BRTs learn non-linear functions without the need to specify them BRT outputs have some GLM flavour and provide insight on the relationship between the response and the predictors BRTs avoid doing much data cleaning because of their ability to accommodate missing values immunity to monotone transformations of predictors, extreme outliers and irrelevant predictors
19 Links to BRTs areas of application Orange s churn, up-, and cross-sell at 2009 KDD Cup Yahoo Learning to Rank Challenge e11a.pdf Patients most likely to be admitted to hospital - Health Heritage Prize Only available to Kaggle s competitors Fraud detection in Fish species richness 006%20MEPS%20.pdf Motor insurance
20 A practical example Objective: model the relationship between settlement delay, injury severity, legal representation and the finalized claim amount Variables Description Settled amount $10-$4,490,000 5 injury codes (inj1, inj2, inj5) 1 (no injury), 2, 3, 4, 5, 6 (fatal), 9 (not recorded) Accident month Coded 1 (7/89) through to 120 (6/99) Reporting month Finalization month Coded as accident Coded as accident Operation time The settlement delay percentile rank (0-100) Legal representation 0 (no), 1 (yes) settled personal injury insurance claims from accidents occurring from 7/1989 through to 1/1999.
21 Why this dataset? Is publicly available: it was featured in the book by de Jong & Heller (GLMs for insurance data). It can be downloaded at rance_data/data_sets Is insurance related with highly skewed claims size Presence of interactions
22 Software used Entire analysis is done in R. R is a free software environment which provides a wide variety of statistical and graphical techniques. It has gained exponential popularity both in the business and academic worlds You can download it for 2 add-on packages (also freely available) were used To train GAMs: Wood s package mgcv. To train BRTs: dismo, a package which facilitates the use of BRTs in R. It calls Ridgeway s package (gbm) which could also have been used to train the model but provides less diagnostic reports.
23 Assessing model performance We assess model predictive performance using independent data (cross-validation) Partitioning the data into separate training and testing subsets Claims settled before 98 / Claims settled in 98 and 99 5-fold cross-validation of the training set Randomly divided training data into 5 subsets Make 5 different training sets each comprising a unique combination of 4 subsets. the deviance metric: which measures how much the predicted values differ from the observations for skewed data (the deviance is also the loss function minimized whilst fitting GLMs).
24 A few data manipulation To convert the injury codes into ordinal factors, we: recoded the injury level 9 into 0 and set missing values (for inj2, inj5) at 0 Other transformations: We capped inj2, and inj5 at 3 (too low statistical material for higher values). We computed the reporting delay and the log of the claim amounts We split the data in a training set and a testing set: Claims settled before 98 Claims settled in 98 and 99 We also formed 5 random subsets of the training set to perform 5 fold cross validations
25 GLM trained GLM <- glm(total ~ op_time + factor(legrep) + rep_delay+ + factor(inj1)+ factor(inj2)+ factor(inj3)+ factor(inj4)+factor(inj5), family=gamma(link="log"), data=training) Very simple GLM No non-linear relationship except for the one introduced by the log link function No interactions
26 BRT trained library(dismo) BRT<-gbm.step(data=training, gbm.x=c(2:7,11,14), gbm.y=12, family="gaussian", tree.complexity=5, learning.rate=0.005) Size of individual trees (usually 3 to 5) Same predictors as for the GLM Log of claim amounts Lower (slower) is better but computationally expensive. Usually between to 0.1) Note that a 3 rd tuning parameter is sometimes required: the number of trees. In our case, the gbm.step routine computes the optimal number of trees (2900) automatically using 10 fold cross validation. Predictors influence 2-ways interaction ranking
27 BRT s Partial dependence plots Non-linear relationship detected automatically represent the effect of each predictor after accounting for the effects of the other predictors
28 Plot of interactions fitted by BRT
29 GLM trained with BRT s insight GLM2 <- glm(total ~ (op_time + factor(legrep) + fast)^2 + op_time*factor(legrep)*fast + rep_delay+ factor(inj1)+ factor(inj2)+ factor(inj3)+ factor(inj4)+factor(inj5), family=gamma(link="log"), data=training) Non linear relationship and interaction are introduced (as did de Jong and Heller) to model the non linear effect of op_time and its interaction with legrep We identified fast claims settlement (op_time<=5) with a dummy variable fast
30 Incorporate interactions & non-linear relationship with GAMs Generalized Additive Models (GAMs) use the basic ideas of Generalized Linear Models While in GLMs g(μ) is a linear combination of predictors, g(μ) g(e[y])=α+β 1 X 1 +β 2 X β N X N Y {X} ~ exponential family in GAMs the linear predictor can also contain one or more smooth functions of covariates g(μ) = β X + f 1 (X 1 ) + f 2 (X 2 ) + f 3 (X 3,X 4 )+... To represent the functions f, use of cubic splines is common To avoid over-fitting, a penalized Maximum Likelihood (ML) is minimized. The optimal penalty parameter is automatically obtained via cross-validation
31 GAM trained with BRT insight GAM <- gam(total ~ (op_time + factor(legrep) + fast)^2 + op_time*factor(legrep)*fast + te(op_time,rep_delay,bs="cs") + factor(inj1) + factor(inj2)+ factor(inj3)+ factor(inj4)+factor(inj5), family=gamma(link="log"), data=training, gamma=1.4) The GAM framework allows us to incorporate an additional interaction between op_time and rep_delay which could not have been easily introduced in the GLM framework
32 Transformation of BRTs predictions E(Y) = exp(e(logy)) Exp(BRTs s predictions) provides us only with the expected median of the claims size as function of the predictors To relate the median with the mean and get predictions of the mean (and not the median), we trained a GAM to model the claims size with: BRTs fitted values as the predictor a Gamma error and a log link Another transformation would have consisted of adding variance of the log transformed claim amounts /2 Generally doesn t provide good prediction as variance unlikely to be constant and should be modelled as function of model predictors too
33 5 fold cross validations Lower Gamma deviance is better GLM holdout GA deviance = BRT1 holdout GA deviance = GLM2 holdout GA deviance = GAM holdout GA deviance = Interactions matter! We see here that - incorporating an interaction between op_time and legrep improves significantly the GLM s fit - a more complex model (GAM) doesn t improve predictive accuracy and then we are better off keeping things simple. - to further improve accuracy, we could simply blend GLM and BRT predictions Blends: GLM+BRT1 holdout GA deviance = GLM2+BRT1 holdout GA deviance = GLM2+GAM holdout GA deviance = 0.999
34 Plot of deviance errors against 5cv predicted values
35 Predictions for 1998 and 1999 GLM holdout GA deviance = 1.03 BRT1 holdout GA deviance = GLM2 holdout GA deviance = This omits however the inflation effect. To model inflation, we trained the residuals of our previous models as function of the settlement month and used it to predict the in(de)flation in 98/99. After accounting for deflation GLM holdout GA deviance = BRT1 holdout GA deviance = GLM2 holdout GA deviance = BRT1 + GLM2 holdout GA deviance = 0.894
36 Lessons from this example 1. Make everything as simple as possible but not simpler (Einstein) Interactions matter! Omitting them can result in a loss of predictive accuracy 2. Parametric models work better in presence of small datasets But the challenge is to incorporate the right model structure 3. Machine Learning techniques are not all black boxes and can provide useful insights 4. Predictions need to be adjusted to account for future trends and this is true whatever the technique used 5. Blends of different techniques usually improve accuracy
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort [email protected] Motivation Location matters! Observed value at one location is
Risk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
Model Validation Techniques
Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Introduction to Predictive Modeling Using GLMs
Introduction to Predictive Modeling Using GLMs Dan Tevet, FCAS, MAAA, Liberty Mutual Insurance Group Anand Khare, FCAS, MAAA, CPCU, Milliman 1 Antitrust Notice The Casualty Actuarial Society is committed
How To Build A Predictive Model In Insurance
The Do s & Don ts of Building A Predictive Model in Insurance University of Minnesota November 9 th, 2012 Nathan Hubbell, FCAS Katy Micek, Ph.D. Agenda Travelers Broad Overview Actuarial & Analytics Career
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems
Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern
A Deeper Look Inside Generalized Linear Models
A Deeper Look Inside Generalized Linear Models University of Minnesota February 3 rd, 2012 Nathan Hubbell, FCAS Agenda Property & Casualty (P&C Insurance) in one slide The Actuarial Profession Travelers
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
In this presentation, you will be introduced to data mining and the relationship with meaningful use.
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine
Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds
Combining Linear and Non-Linear Modeling Techniques: Getting the Best of Two Worlds Outline Who is EMB? Insurance industry predictive modeling applications EMBLEM our GLM tool How we have used CART with
THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
Predictive Modeling Techniques in Insurance
Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Regression Modeling Strategies
Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions
Penalized Logistic Regression and Classification of Microarray Data
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification
Microsoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql [email protected] http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
Machine Learning Capacity and Performance Analysis and R
Machine Learning and R May 3, 11 30 25 15 10 5 25 15 10 5 30 25 15 10 5 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 100 80 60 40 100 80 60 40 100 80 60 40 30 25 15 10 5 25 15 10
Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney
Combining GLM and datamining techniques for modelling accident compensation data Peter Mulquiney Introduction Accident compensation data exhibit features which complicate loss reserving and premium rate
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Why Ensembles Win Data Mining Competitions
Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar
Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Prepared by Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com [email protected]
Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg
Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )
IST 557 Final Project
George Slota DataMaster 5000 IST 557 Final Project Abstract As part of a competition hosted by the website Kaggle, a statistical model was developed for prediction of United States Census 2010 mailing
Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Studying Auto Insurance Data
Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
Nonnested model comparison of GLM and GAM count regression models for life insurance data
Nonnested model comparison of GLM and GAM count regression models for life insurance data Claudia Czado, Julia Pfettner, Susanne Gschlößl, Frank Schiller December 8, 2009 Abstract Pricing and product development
SOA 2013 Life & Annuity Symposium May 6-7, 2013. Session 30 PD, Predictive Modeling Applications for Life and Annuity Pricing and Underwriting
SOA 2013 Life & Annuity Symposium May 6-7, 2013 Session 30 PD, Predictive Modeling Applications for Life and Annuity Pricing and Underwriting Moderator: Barry D. Senensky, FSA, FCIA, MAAA Presenters: Jonathan
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
The Operational Value of Social Media Information. Social Media and Customer Interaction
The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia
Predicting daily incoming solar energy from weather data
Predicting daily incoming solar energy from weather data ROMAIN JUBAN, PATRICK QUACH Stanford University - CS229 Machine Learning December 12, 2013 Being able to accurately predict the solar power hitting
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution
Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution Rie Johnson Tong Zhang 1 Introduction This document describes our entry nominated for the second prize of the Heritage
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Benchmarking of different classes of models used for credit scoring
Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering
IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons
Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research [email protected]
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research [email protected] Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE
ACTA UNIVERSITATIS AGRICULTURAE ET SILVICULTURAE MENDELIANAE BRUNENSIS Volume 62 41 Number 2, 2014 http://dx.doi.org/10.11118/actaun201462020383 GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE Silvie Kafková
We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap
Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2
Lecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
Machine Learning Methods for Demand Estimation
Machine Learning Methods for Demand Estimation By Patrick Bajari, Denis Nekipelov, Stephen P. Ryan, and Miaoyu Yang Over the past decade, there has been a high level of interest in modeling consumer behavior
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD
Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
Insurance Fraud Detection: MARS versus Neural Networks?
Insurance Fraud Detection: MARS versus Neural Networks? Louise A Francis FCAS, MAAA [email protected] 1 Objectives Introduce a relatively new data mining method which can be used as an alternative
Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/
Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO
exspline That: Explaining Geographic Variation in Insurance Pricing
Paper 8441-2016 exspline That: Explaining Geographic Variation in Insurance Pricing Carol Frigo and Kelsey Osterloo, State Farm Insurance ABSTRACT Generalized linear models (GLMs) are commonly used to
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
not possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
CS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection
Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics
Predictive Modeling in Long-Term Care Insurance
Predictive Modeling in Long-Term Care Insurance Nathan R. Lally and Brian M. Hartman May 3, 2015 Abstract The accurate prediction of long-term care insurance (LTCI) mortality, lapse, and claim rates is
Gamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing [email protected]
SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing [email protected] IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way
Regularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013
Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.
Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry
Paper 12028 Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry Junxiang Lu, Ph.D. Overland Park, Kansas ABSTRACT Increasingly, companies are viewing
Solving Regression Problems Using Competitive Ensemble Models
Solving Regression Problems Using Competitive Ensemble Models Yakov Frayman, Bernard F. Rolfe, and Geoffrey I. Webb School of Information Technology Deakin University Geelong, VIC, Australia {yfraym,brolfe,webb}@deakin.edu.au
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
JetBlue Airways Stock Price Analysis and Prediction
JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue
GLM I An Introduction to Generalized Linear Models
GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Expert Systems with Applications
Expert Systems with Applications 39 (2012) 3659 3667 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Gradient boosting
Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model
Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity
Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Predictive modelling around the world 28.11.13
Predictive modelling around the world 28.11.13 Agenda Why this presentation is really interesting Introduction to predictive modelling Case studies Conclusions Why this presentation is really interesting
Presentation by: Ahmad Alsahaf. Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering
Johann Bernoulli Institute for Mathematics and Computer Science, University of Groningen 9-October 2015 Presentation by: Ahmad Alsahaf Research collaborator at the Hydroinformatics lab - Politecnico di
The Predictive Data Mining Revolution in Scorecards:
January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms
