# Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Save this PDF as:

Size: px
Start display at page:

Download "Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model"

## Transcription

1 Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort

2 Motivation Location matters! Observed value at one location is influenced by the observed values at other locations in a geographic area. True for real estate, species distribution, insurance Classical approach consists of adding different proxy variables (geodemographic, crime, weather, traffic ) in predictive models. Yet it is usually not enough to capture all the geographical information available Idea: use of modern techniques to incorporate latitude / longitude information as model inputs Generalized Additive Models (GAMs), Spatial Simultaneous Autoregressive Error Models (SARs) and Boosted Regression Trees (BRTs/GBMs) are 3 practical tools to boost models accuracy

3 The California Housing example Use of proxy variables (median income, housing density, average occupation in each house) is not enough and leads to an underestimation in coastal areas and overestimation in inland areas.

4 The California Housing Dataset The data set was first introduced by Pace and Barry (1997) and is l It consists of aggregated data from each of 20,460 neighbourhoods (1990 census block groups) in California The response variable Y is the median house value in each neighbourhood There are a total of eight predictors, all numeric. demographics variables (median income, housing density and average occupancy in each house). the location of each neighbourhood (longitude and latitude), and properties of the houses in the neighbourhood (average number of rooms and bedrooms).

5 Geo-spatial modelling: 3 different techniques as candidate SARs are specifically designed to account for geographical effects. Sometimes presented as an extension of time series in 2D. Use of neighbourhoods to specify the spatial correlation structure allow them to work on large datasets BRTs belong to the machine learning field. Not specifically designed to account for spatial autocorrelation but BRTs ability to learn local interactions should make them a good candidate Generalized Additive Models (GAMs) are GLMs which can contain smooth functions of covariates. Use of a smooth function of Latitude- Longitude as a predictor can be a practical way to incorporate spatial smoothing

6 How do SARs work? Approach 1: specify correlation structure SARs assume that the response at each location i is a function not only of the explanatory variables at i, but of the values of the response at neighbouring locations j as well The neighbourhood relationship is formally expressed in a n n matrix of spatial weights (W), with elements (w ij ) representing a measure of the connection between locations i and j SAR Error Model Y = X β + u with u = λw u + ε λ= Spatial Lag Coefficient ε= normal error See Sengupta s presentation in 2010 CAS RPM Seminar

7 The neighbourhood structure The neighbourhood can be identified by a fixed number of nearest neighbors, or by Euclidean or great circle distance (e.g. the distance along Earth s surface) to define cells within or outside a respective neighbourhood The neighbours can further be weighted to give closer neighbours higher weights and more distant neighbours lower weights.

8 How do GAMs work? Approach 2: smooth geographical trends GAMs use the basic ideas of Generalized Linear Models g(μ) g(e[y])=α+β 1 X 1 +β 2 X β N X N Y {X} ~ exponential family While in GLMs g(μ) is a linear combination of predictors, in GAMs the linear predictor can also contain one or more smooth functions of covariates g(μ) = β X + f 1 (X 1 ) + f 2 (X 2 ) + f 3 (X 3,X 4 )+... To represent the functions f, use of cubic splines is common To avoid over-fitting, a penalized Maximum Likelihood (ML) is minimized instead of the ML in GLMs. The optimal penalty parameter is automatically obtained via cross-validation See Guszcza s presentation in 2010 CAS RPM Seminar

9 How do BRTs work? Approach 3: let the machine do if for you! BRTs (also called Gradient Boosting Machine) use boosting and decision trees techniques: The boosting algorithm learns step by step slowly and gradually increases emphasis on poorly modelled observations. It minimizes a loss function (the deviance, as in GLMs) by adding, at each step, a new simple tree whose focus is only on the residuals The contributions of each tree are shrunk by setting a learning rate very small (and < 1) to give more stable fitted values for the final model To further improve predictive performance, the process uses random subsets of data to fit each new tree (bagging).

10 Boosted Regression trees are good to detect interactions as they are based on regression trees which partition the feature space into a set of rectangles and then produce a multitude of local interactions. This totally makes sense whilst modelling geographical effect. Gear Analytics

11 Other nice properties of BRTs BRTs can be fitted to a variety of response types (Gaussian, Poisson, Binomial) BRTs best fit (interactions included) is automatically detected by the machine BRTs learn non-linear functions without the need to specify them BRT outputs have some GLM flavour and provide insight on the relationship between the response and the predictors BRTs avoid doing much data cleaning because of their ability to accommodate missing values immunity to monotone transformations of predictors, extreme outliers and irrelevant predictors

12 BRTs examples links Orange s churn, up-, and cross-sell at 2009 KDD Cup Yahoo Learning to Rank Challenge Patients most likely to be admitted to hospital - Health Heritage Prize Only available to Kaggle s members Fraud detection in Fish species richness S%20.pdf BRTs in those papers are sometimes called a different way: Gradient Boosting Machine, TreeNet. But they are the same models.

13 Software used Entire analysis (including all graphics) in this presentation is done in R GAMs: Wood s package (mgcv). SARs: spdep, a spatial econometrics package developped by Bivand et al. We, however, wrote our own code for predictions. BRTs: Ridgeway s package (gbm) We plotted maps with Loecher s package (RgoogleMaps)which take Google s map as a background image to overlay plots within R

14 Strategy to assess models performance Many models can be made to perform well on their training data, the danger is that they overfit to specific features of that data that lack applicability in a wider sample, degrading model performance when predicting to new datasets. Best practice consists in assessing model predictive performance using independent data (cross-validation) Partitioning the data into separate training and testing subsets k-fold cross-validation when data is scarce Here, we chose to partition the data in a training set (70% of examples) and a testing set (30%) and then reported errors maps and errors distributions, and the squared multiple correlation coefficient R 2 (the proportion of variance in Y that can be accounted for by our model) measured on the testing set as previously chosen in previous studies (with both logy and Y as the response).

15 Residuals histograms (with logy as the response) Lower dispersion for SAR and BRT residual distributions

16 Residuals histograms (with Y as the response) Lower dispersion for SAR and BRT residual distributions To predict Y, we fitted a GAM to model the relationship between E(Yi) and E(log(Yi)). Works better than assuming that E(Yi)=exp(E(log(Yi)+sigma^2/ 2) with sigma constant.

17 Maps outputs (OLS vs GAMs) Ordinary Least Squares Generalized Additive Models

18 Maps outputs (BRTs vs SAR) BRTs SAR and 4 nearest neighbors

19 Strategies to improve predictive accuracy Fine tune the number of nearest neighbours for SAR Find a more appropriate transformation for the response than the logy suggested by Pace & Barry in their 1997 paper Split analysis of non censored/censored house values Fit SARs on GAM residuals Fit SARs on BRT residuals Fit BRTs on SAR residuals If you are lazy, try BRTs*0.5 + SAR*0.5!!!

20 Ideas to boost your modelling in insurance When only predictive accuracy matters Use BRTs to get high predictive power with low modelling effort If you want even better accuracy, combine BRTs with Random Forest (another off-the-shell ML technique) in presence of spatial correlation, try SARs+BRTs When you want to keep control on the model structure such as in pricing Fit GLMs / GAMs fit BRTs for an easy benchmark and highlight useful interactions If you have variables with many categories (such as districts) use GLMMs to get credibility estimates In presence of spatial correlation, try GAMs or SARs / BRTs on GLMs residuals To know more on Machine Learning, go to Stanford s online course ml-class.org by Andrew Ng

### BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully

### Regression Analysis Using ArcMap. By Jennie Murack

Regression Analysis Using ArcMap By Jennie Murack Regression Basics How is Regression Different from other Spatial Statistical Analyses? With other tools you ask WHERE something is happening? Are there

### Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

### Risk pricing for Australian Motor Insurance

Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model

### Predicting daily incoming solar energy from weather data

Predicting daily incoming solar energy from weather data ROMAIN JUBAN, PATRICK QUACH Stanford University - CS229 Machine Learning December 12, 2013 Being able to accurately predict the solar power hitting

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Actuarial. Modeling Seminar Part 2. Matthew Morton FSA, MAAA Ben Williams

Actuarial Data Analytics / Predictive Modeling Seminar Part 2 Matthew Morton FSA, MAAA Ben Williams Agenda Introduction Overview of Seminar Traditional Experience Study Traditional vs. Predictive Modeling

### Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds

Combining Linear and Non-Linear Modeling Techniques: Getting the Best of Two Worlds Outline Who is EMB? Insurance industry predictive modeling applications EMBLEM our GLM tool How we have used CART with

### exspline That: Explaining Geographic Variation in Insurance Pricing

Paper 8441-2016 exspline That: Explaining Geographic Variation in Insurance Pricing Carol Frigo and Kelsey Osterloo, State Farm Insurance ABSTRACT Generalized linear models (GLMs) are commonly used to

### Model Validation Techniques

Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

### Drug Store Sales Prediction

Drug Store Sales Prediction Chenghao Wang, Yang Li Abstract - In this paper we tried to apply machine learning algorithm into a real world problem drug store sales forecasting. Given store information,

### IST 557 Final Project

George Slota DataMaster 5000 IST 557 Final Project Abstract As part of a competition hosted by the website Kaggle, a statistical model was developed for prediction of United States Census 2010 mailing

### Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### Introduction to Predictive Models Book Chapters 1, 2 and 5. Carlos M. Carvalho The University of Texas McCombs School of Business

Introduction to Predictive Models Book Chapters 1, 2 and 5. Carlos M. Carvalho The University of Texas McCombs School of Business 1 1. Introduction 2. Measuring Accuracy 3. Out-of Sample Predictions 4.

### Studying Auto Insurance Data

Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

### Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

### Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### UNIVERSITY OF WAIKATO. Hamilton New Zealand

UNIVERSITY OF WAIKATO Hamilton New Zealand Can We Trust Cluster-Corrected Standard Errors? An Application of Spatial Autocorrelation with Exact Locations Known John Gibson University of Waikato Bonggeun

### Logistic Regression (a type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge

### Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and

### Introduction to Predictive Modeling Using GLMs

Introduction to Predictive Modeling Using GLMs Dan Tevet, FCAS, MAAA, Liberty Mutual Insurance Group Anand Khare, FCAS, MAAA, CPCU, Milliman 1 Antitrust Notice The Casualty Actuarial Society is committed

Lecture 15: Loess Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the lecture Introduce nonparametric regression: Bivariate

### Classification and Regression Trees

Classification and Regression Trees Bob Stine Dept of Statistics, School University of Pennsylvania Trees Familiar metaphor Biology Decision tree Medical diagnosis Org chart Properties Recursive, partitioning

### GLM III: Advanced Modeling Strategy 2005 CAS Seminar on Predictive Modeling Duncan Anderson MA FIA Watson Wyatt Worldwide

GLM III: Advanced Modeling Strategy 25 CAS Seminar on Predictive Modeling Duncan Anderson MA FIA Watson Wyatt Worldwide W W W. W A T S O N W Y A T T. C O M Agenda Introduction Testing the link function

### Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons

### Joint models for classification and comparison of mortality in different countries.

Joint models for classification and comparison of mortality in different countries. Viani D. Biatat 1 and Iain D. Currie 1 1 Department of Actuarial Mathematics and Statistics, and the Maxwell Institute

### 16 : Demand Forecasting

16 : Demand Forecasting 1 Session Outline Demand Forecasting Subjective methods can be used only when past data is not available. When past data is available, it is advisable that firms should use statistical

### CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION N PROBLEM DEFINITION Opportunity New Booking - Time of Arrival Shortest Route (Distance/Time) Taxi-Passenger Demand Distribution Value Accurate

### Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

### Step 5: Conduct Analysis. The CCA Algorithm

Model Parameterization: Step 5: Conduct Analysis P Dropped species with fewer than 5 occurrences P Log-transformed species abundances P Row-normalized species log abundances (chord distance) P Selected

### Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models

### Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### GAM for large datasets and load forecasting

GAM for large datasets and load forecasting Simon Wood, University of Bath, U.K. in collaboration with Yannig Goude, EDF French Grid load GW 3 5 7 5 1 15 day GW 3 4 5 174 176 178 18 182 day Half hourly

### Better credit models benefit us all

Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

### New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

### Prediction of Adjusted Net National Income Per Capita of Countries

Prediction of Adjusted Net National Income Per Capita of Countries Introduction to the Research Question The purpose of this study was to identify the best predictors for Adjusted Net National Income Per

### Univariate Regression

Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

### The primary goal of this thesis was to understand how the spatial dependence of

5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial

### Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

### ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

### Simple Linear Regression Inference

Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

### Linear Model Selection and Regularization

Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

### Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

### The Artificial Prediction Market

The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

### The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner

Paper 3361-2015 The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Narmada Deve Panneerselvam, Spears School of Business, Oklahoma State University, Stillwater,

### C19 Machine Learning

C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks

### Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.

### Data Mining and Visualization

Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research

Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

### Why Ensembles Win Data Mining Competitions

Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:

### Simple Predictive Analytics Curtis Seare

Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

### Neural Networks & Boosting

Neural Networks & Boosting Bob Stine Dept of Statistics, School University of Pennsylvania Questions How is logistic regression different from OLS? Logistic mean function for probabilities Larger weight

### Variable selection using random forests

Pattern Recognition Letters 31 (2010) January 25, 2012 Outline 1 2 Sensitivity to n and p Sensitivity to mtry and ntree 3 Procedure Starting example 4 Prostate data Four high dimensional classication datasets

### MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

### Investigation of Optimal Reimbursement Policy for Wifi Service by Public Libraries

Investigation of Optimal Reimbursement Policy for Wifi Service by Public Libraries Vinod Bakthavachalam August 28, 2014 Data Overview The purpose of this analysis is to investigate what the optimal reimbursement

### Appendix 1: Time series analysis of peak-rate years and synchrony testing.

Appendix 1: Time series analysis of peak-rate years and synchrony testing. Overview The raw data are accessible at Figshare ( Time series of global resources, DOI 10.6084/m9.figshare.929619), sources are

### Spatial autocorrelation analysis of residuals and geographically weighted regression

Spatial autocorrelation analysis of residuals and geographically weighted regression Materials: Use your project from the tutorial Temporally dynamic aspatial regression in SpaceStat Objective: You will

### 5. Multiple regression

5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

### INTRODUCTORY STATISTICS

INTRODUCTORY STATISTICS FIFTH EDITION Thomas H. Wonnacott University of Western Ontario Ronald J. Wonnacott University of Western Ontario WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

### 1. χ 2 minimization 2. Fits in case of of systematic errors

Data fitting Volker Blobel University of Hamburg March 2005 1. χ 2 minimization 2. Fits in case of of systematic errors Keys during display: enter = next page; = next page; = previous page; home = first

### Practical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University

Practical I conometrics data collection, analysis, and application Christiana E. Hilmer Michael J. Hilmer San Diego State University Mi Table of Contents PART ONE THE BASICS 1 Chapter 1 An Introduction

### Computer exercise 4 Poisson Regression

Chalmers-University of Gothenburg Department of Mathematical Sciences Probability, Statistics and Risk MVE300 Computer exercise 4 Poisson Regression When dealing with two or more variables, the functional

### Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney

Combining GLM and datamining techniques for modelling accident compensation data Peter Mulquiney Introduction Accident compensation data exhibit features which complicate loss reserving and premium rate

### Monotonicity Hints. Abstract

Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology

### business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

Ronald Christensen Advanced Linear Modeling Multivariate, Time Series, and Spatial Data; Nonparametric Regression and Response Surface Maximization Second Edition Springer Preface to the Second Edition

### Regression Modeling Strategies

Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

### Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

### HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

### Logistic Regression (1/24/13)

STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

### Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity

### Traffic Driven Analysis of Cellular Data Networks

Traffic Driven Analysis of Cellular Data Networks Samir R. Das Computer Science Department Stony Brook University Joint work with Utpal Paul, Luis Ortiz (Stony Brook U), Milind Buddhikot, Anand Prabhu

### Predictive Modeling Techniques in Insurance

Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Model Combination. 24 Novembre 2009

Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

### Dealing with continuous variables and geographical information in non life insurance ratemaking. Maxime Clijsters

Dealing with continuous variables and geographical information in non life insurance ratemaking Maxime Clijsters Introduction Policyholder s Vehicle type (4x4 Y/N) Kilowatt of the vehicle Age Age of the

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### An Introduction to Generalized Linear Mixed Models Using SAS PROC GLIMMIX

An Introduction to Generalized Linear Mixed Models Using SAS PROC GLIMMIX Phil Gibbs Advanced Analytics Manager SAS Technical Support November 22, 2008 UC Riverside What We Will Cover Today What is PROC

### SOA 2013 Life & Annuity Symposium May 6-7, 2013. Session 30 PD, Predictive Modeling Applications for Life and Annuity Pricing and Underwriting

SOA 2013 Life & Annuity Symposium May 6-7, 2013 Session 30 PD, Predictive Modeling Applications for Life and Annuity Pricing and Underwriting Moderator: Barry D. Senensky, FSA, FCIA, MAAA Presenters: Jonathan

### Machine Learning Methods for Demand Estimation

Machine Learning Methods for Demand Estimation By Patrick Bajari, Denis Nekipelov, Stephen P. Ryan, and Miaoyu Yang Over the past decade, there has been a high level of interest in modeling consumer behavior

### Using Excel for Statistical Analysis

Using Excel for Statistical Analysis You don t have to have a fancy pants statistics package to do many statistical functions. Excel can perform several statistical tests and analyses. First, make sure

### Geostatistics Exploratory Analysis

Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt