Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model



Similar documents
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Risk pricing for Australian Motor Insurance

Data Mining. Nonlinear Classification

Predicting daily incoming solar energy from weather data

Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds

Regression III: Advanced Methods

IST 557 Final Project

Model Validation Techniques

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

exspline That: Explaining Geographic Variation in Insurance Pricing

Logistic Regression (a type of Generalized Linear Model)

UNIVERSITY OF WAIKATO. Hamilton New Zealand

Introduction to Predictive Modeling Using GLMs

Cross Validation. Dr. Thomas Jensen Expedia.com

Knowledge Discovery and Data Mining

SAS Software to Fit the Generalized Linear Model

Studying Auto Insurance Data

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

Supervised Learning (Big Data Analytics)

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

An Introduction to Machine Learning

Data Mining and Visualization

Statistical Machine Learning

Simple Linear Regression Inference

Introduction to General and Generalized Linear Models

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

16 : Demand Forecasting

Better credit models benefit us all

Penalized Logistic Regression and Classification of Microarray Data

How To Understand The Theory Of Probability

The primary goal of this thesis was to understand how the spatial dependence of

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Data Mining Methods: Applications for Institutional Research

Monotonicity Hints. Abstract

Lecture 3: Linear methods for classification

GAM for large datasets and load forecasting

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Advanced Ensemble Strategies for Polynomial Models

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

The Artificial Prediction Market

Geostatistics Exploratory Analysis

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Least Squares Estimation

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Simple Predictive Analytics Curtis Seare

Why Ensembles Win Data Mining Competitions

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

JetBlue Airways Stock Price Analysis and Prediction

Knowledge Discovery and Data Mining

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

5. Multiple regression

Predictive Modeling Techniques in Insurance

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Multiple Regression: What Is It?

Azure Machine Learning, SQL Data Mining and R

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Using kernel methods to visualise crime data

Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney

Univariate Regression

Data Mining Practical Machine Learning Tools and Techniques

STA 4273H: Statistical Machine Learning

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Lecture 10: Regression Trees

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Local outlier detection in data forensics: data mining approach to flag unusual schools

SOA 2013 Life & Annuity Symposium May 6-7, Session 30 PD, Predictive Modeling Applications for Life and Annuity Pricing and Underwriting

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Practical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University

HT2015: SC4 Statistical Data Mining and Machine Learning

Regression Modeling Strategies

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Model Combination. 24 Novembre 2009

Traffic Driven Analysis of Cellular Data Networks

Automated Biosurveillance Data from England and Wales,

Using Excel for Statistical Analysis

Gerry Hobbs, Department of Statistics, West Virginia University

GLM I An Introduction to Generalized Linear Models

Dealing with continuous variables and geographical information in non life insurance ratemaking. Maxime Clijsters

Environmental Remote Sensing GEOG 2021

Predictive Modeling and Big Data

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Computer exercise 4 Poisson Regression

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Using multiple models: Bagging, Boosting, Ensembles, Forests

Data Mining - Evaluation of Classifiers

Marketing Mix Modelling and Big Data P. M Cain

Local classification and local likelihoods

Benchmarking of different classes of models used for credit scoring

Transcription:

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com

Motivation Location matters! Observed value at one location is influenced by the observed values at other locations in a geographic area. True for real estate, species distribution, insurance Classical approach consists of adding different proxy variables (geodemographic, crime, weather, traffic ) in predictive models. Yet it is usually not enough to capture all the geographical information available Idea: use of modern techniques to incorporate latitude / longitude information as model inputs Generalized Additive Models (GAMs), Spatial Simultaneous Autoregressive Error Models (SARs) and Boosted Regression Trees (BRTs/GBMs) are 3 practical tools to boost models accuracy

The California Housing example Use of proxy variables (median income, housing density, average occupation in each house) is not enough and leads to an underestimation in coastal areas and overestimation in inland areas.

The California Housing Dataset The data set was first introduced by Pace and Barry (1997) and is available @ http://www.liaad.up.pt/~ltorgo/regression/cal_housing.htm l It consists of aggregated data from each of 20,460 neighbourhoods (1990 census block groups) in California The response variable Y is the median house value in each neighbourhood There are a total of eight predictors, all numeric. demographics variables (median income, housing density and average occupancy in each house). the location of each neighbourhood (longitude and latitude), and properties of the houses in the neighbourhood (average number of rooms and bedrooms).

Geo-spatial modelling: 3 different techniques as candidate SARs are specifically designed to account for geographical effects. Sometimes presented as an extension of time series in 2D. Use of neighbourhoods to specify the spatial correlation structure allow them to work on large datasets BRTs belong to the machine learning field. Not specifically designed to account for spatial autocorrelation but BRTs ability to learn local interactions should make them a good candidate Generalized Additive Models (GAMs) are GLMs which can contain smooth functions of covariates. Use of a smooth function of Latitude- Longitude as a predictor can be a practical way to incorporate spatial smoothing

How do SARs work? Approach 1: specify correlation structure SARs assume that the response at each location i is a function not only of the explanatory variables at i, but of the values of the response at neighbouring locations j as well The neighbourhood relationship is formally expressed in a n n matrix of spatial weights (W), with elements (w ij ) representing a measure of the connection between locations i and j SAR Error Model Y = X β + u with u = λw u + ε λ= Spatial Lag Coefficient ε= normal error See Sengupta s presentation in 2010 CAS RPM Seminar

The neighbourhood structure The neighbourhood can be identified by a fixed number of nearest neighbors, or by Euclidean or great circle distance (e.g. the distance along Earth s surface) to define cells within or outside a respective neighbourhood The neighbours can further be weighted to give closer neighbours higher weights and more distant neighbours lower weights.

How do GAMs work? Approach 2: smooth geographical trends GAMs use the basic ideas of Generalized Linear Models g(μ) g(e[y])=α+β 1 X 1 +β 2 X 2 +...+β N X N Y {X} ~ exponential family While in GLMs g(μ) is a linear combination of predictors, in GAMs the linear predictor can also contain one or more smooth functions of covariates g(μ) = β X + f 1 (X 1 ) + f 2 (X 2 ) + f 3 (X 3,X 4 )+... To represent the functions f, use of cubic splines is common To avoid over-fitting, a penalized Maximum Likelihood (ML) is minimized instead of the ML in GLMs. The optimal penalty parameter is automatically obtained via cross-validation See Guszcza s presentation in 2010 CAS RPM Seminar

How do BRTs work? Approach 3: let the machine do if for you! BRTs (also called Gradient Boosting Machine) use boosting and decision trees techniques: The boosting algorithm learns step by step slowly and gradually increases emphasis on poorly modelled observations. It minimizes a loss function (the deviance, as in GLMs) by adding, at each step, a new simple tree whose focus is only on the residuals The contributions of each tree are shrunk by setting a learning rate very small (and < 1) to give more stable fitted values for the final model To further improve predictive performance, the process uses random subsets of data to fit each new tree (bagging).

Boosted Regression trees are good to detect interactions as they are based on regression trees which partition the feature space into a set of rectangles and then produce a multitude of local interactions. This totally makes sense whilst modelling geographical effect. Gear Analytics

Other nice properties of BRTs BRTs can be fitted to a variety of response types (Gaussian, Poisson, Binomial) BRTs best fit (interactions included) is automatically detected by the machine BRTs learn non-linear functions without the need to specify them BRT outputs have some GLM flavour and provide insight on the relationship between the response and the predictors BRTs avoid doing much data cleaning because of their ability to accommodate missing values immunity to monotone transformations of predictors, extreme outliers and irrelevant predictors

BRTs examples links Orange s churn, up-, and cross-sell at 2009 KDD Cup http://jmlr.csail.mit.edu/proceedings/papers/v7/miller09/miller09.pdf Yahoo Learning to Rank Challenge http://jmlr.csail.mit.edu/proceedings/papers/v14/chapelle11a/chapelle11a.pdf Patients most likely to be admitted to hospital - Health Heritage Prize Only available to Kaggle s members Fraud detection in http://www.data-mines.com/resources/papers/fraud%20comparison.pdf Fish species richness http://www.stanford.edu/~hastie/papers/leathwick%20et%20al%202006%20mep S%20.pdf BRTs in those papers are sometimes called a different way: Gradient Boosting Machine, TreeNet. But they are the same models.

Software used Entire analysis (including all graphics) in this presentation is done in R GAMs: Wood s package (mgcv). SARs: spdep, a spatial econometrics package developped by Bivand et al. We, however, wrote our own code for predictions. BRTs: Ridgeway s package (gbm) We plotted maps with Loecher s package (RgoogleMaps)which take Google s map as a background image to overlay plots within R

Strategy to assess models performance Many models can be made to perform well on their training data, the danger is that they overfit to specific features of that data that lack applicability in a wider sample, degrading model performance when predicting to new datasets. Best practice consists in assessing model predictive performance using independent data (cross-validation) Partitioning the data into separate training and testing subsets k-fold cross-validation when data is scarce Here, we chose to partition the data in a training set (70% of examples) and a testing set (30%) and then reported errors maps and errors distributions, and the squared multiple correlation coefficient R 2 (the proportion of variance in Y that can be accounted for by our model) measured on the testing set as previously chosen in previous studies (with both logy and Y as the response).

Residuals histograms (with logy as the response) Lower dispersion for SAR and BRT residual distributions

Residuals histograms (with Y as the response) Lower dispersion for SAR and BRT residual distributions To predict Y, we fitted a GAM to model the relationship between E(Yi) and E(log(Yi)). Works better than assuming that E(Yi)=exp(E(log(Yi)+sigma^2/ 2) with sigma constant.

Maps outputs (OLS vs GAMs) Ordinary Least Squares Generalized Additive Models

Maps outputs (BRTs vs SAR) BRTs SAR and 4 nearest neighbors

Strategies to improve predictive accuracy Fine tune the number of nearest neighbours for SAR Find a more appropriate transformation for the response than the logy suggested by Pace & Barry in their 1997 paper Split analysis of non censored/censored house values Fit SARs on GAM residuals Fit SARs on BRT residuals Fit BRTs on SAR residuals If you are lazy, try BRTs*0.5 + SAR*0.5!!!

Ideas to boost your modelling in insurance When only predictive accuracy matters Use BRTs to get high predictive power with low modelling effort If you want even better accuracy, combine BRTs with Random Forest (another off-the-shell ML technique) in presence of spatial correlation, try SARs+BRTs When you want to keep control on the model structure such as in pricing Fit GLMs / GAMs fit BRTs for an easy benchmark and highlight useful interactions If you have variables with many categories (such as districts) use GLMMs to get credibility estimates In presence of spatial correlation, try GAMs or SARs / BRTs on GLMs residuals To know more on Machine Learning, go to Stanford s online course ml-class.org by Andrew Ng