Cross Validation. Dr. Thomas Jensen Expedia.com

Similar documents
Lecture 13: Validation

Data Mining. Nonlinear Classification

Knowledge Discovery and Data Mining

Data Mining - Evaluation of Classifiers

Knowledge Discovery and Data Mining

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data

L13: cross-validation

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Data Mining Methods: Applications for Institutional Research

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Performance Measures in Data Mining

Evaluation & Validation: Credibility: Evaluating what has been learned

Cross-validation for detecting and preventing overfitting

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Knowledge Discovery and Data Mining

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Hong Kong Stock Index Forecasting

Cross-Validation. Synonyms Rotation estimation

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

SINGULAR SPECTRUM ANALYSIS HYBRID FORECASTING METHODS WITH APPLICATION TO AIR TRANSPORT DEMAND

Chapter 6. The stacking ensemble approach

Decompose Error Rate into components, some of which can be measured on unlabeled data

Improving Demand Forecasting

On Cross-Validation and Stacking: Building seemingly predictive models on random data

THE THREE "Rs" OF PREDICTIVE ANALYTICS

How To Calculate A Multiiperiod Probability Of Default

Chapter 7. Feature Selection. 7.1 Introduction

Data Mining Practical Machine Learning Tools and Techniques

Local classification and local likelihoods

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Why do statisticians "hate" us?

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

The Operational Value of Social Media Information. Social Media and Customer Interaction

Making Sense of the Mayhem: Machine Learning and March Madness

THE PREDICTIVE MODELLING PROCESS

Multiple Linear Regression in Data Mining

Evidence to Action: Use of Predictive Models for Beach Water Postings

Characterizing Task Usage Shapes in Google s Compute Clusters

Model Validation Techniques

Supervised Learning (Big Data Analytics)

Logistic Regression for Spam Filtering

STA 4273H: Statistical Machine Learning

Lecture 10: Regression Trees

Replicating portfolios revisited

Regularized Logistic Regression for Mind Reading with Parallel Validation

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University

Smart ESG Integration: Factoring in Sustainability

Demand Management Where Practice Meets Theory

Advanced Ensemble Strategies for Polynomial Models

FINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT. Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly

Model Combination. 24 Novembre 2009

CART 6.0 Feature Matrix

Classification and Regression by randomforest

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

SOA 2013 Life & Annuity Symposium May 6-7, Session 30 PD, Predictive Modeling Applications for Life and Annuity Pricing and Underwriting

Ch.3 Demand Forecasting.

Statistical Models in R

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Predicting Flight Delays

Business Analytics using Data Mining Project Report. Optimizing Operation Room Utilization by Predicting Surgery Duration

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Bootstrapping Big Data

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Section 6: Model Selection, Logistic Regression and more...

SAS Certificate Applied Statistics and SAS Programming

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

APPENDIX 15. Review of demand and energy forecasting methodologies Frontier Economics

Applying Data Science to Sales Pipelines for Fun and Profit

Data Mining Techniques Chapter 6: Decision Trees

TURKISH ORACLE USER GROUP

HT2015: SC4 Statistical Data Mining and Machine Learning

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

Time series Forecasting using Holt-Winters Exponential Smoothing

CHAPTER 11 FORECASTING AND DEMAND PLANNING

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

The Voices of Influence iijournals.com

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

How To Build A Predictive Model In Insurance

Data Mining III: Numeric Estimation

JetBlue Airways Stock Price Analysis and Prediction

Better decision making under uncertain conditions using Monte Carlo Simulation

VIF Regression: A Fast Regression Algorithm For Large Data

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Data Mining Approaches to Modeling Insurance Risk. Dan Steinberg, Mikhail Golovnya, Scott Cardell. Salford Systems 2009

Penalized regression: Introduction

Performance Metrics for Graph Mining Tasks

1 Maximum likelihood estimation

Social Media Mining. Data Mining Essentials

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Galaxy Morphological Classification

Data Mining Lab 5: Introduction to Neural Networks

Applied Multivariate Analysis - Big data analytics

Data Mining Techniques for Prognosis in Pancreatic Cancer

IDENTIFICATION OF DEMAND FORECASTING MODEL CONSIDERING KEY FACTORS IN THE CONTEXT OF HEALTHCARE PRODUCTS

Transcription:

Cross Validation Dr. Thomas Jensen Expedia.com

About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract with Expedia Manage two algorithms One to assign a dollar value to each hotel in the database Use Cross Validation to optimize the algorithms Another to forecast how well/bad we are doing in terms of room availability in a given hotel Responsible for metrics 3 04.10.201 3 Cross Validation

Overview What is R2 and why can it be misleading The issue of overfitting Definition of cross validation How to conduct cross validation Best practices when evaluating model fit

How to check if a model fit is good? The R2 statistic has become the almost universally standard measure for model fit in linear models What is R2? R 2 = 1 y i f i 2 y i y 2 Model error Variance in the dependent variable It is the ratio of error in a model over the total variance in the dependent variable. Hence the lower the error, the higher the R2 value.

How to check if a model fit is good? y i f i 2 = 18.568 y i y 2 = 55.001 R 2 = 1 18.568 55.001 R 2 = 0.6624 A decent model fit!

How to check if a model fit is good? y i f i 2 = 15.276 y i y 2 = 55.001 R 2 = 1 15.276 55.001 R 2 = 0.72 Is this a better model? No, overfitting!

Overfitting Left to their own devices, modeling techniques will overfit the data. Classic Example: multiple regression Every time you add a variable to the regression, the model s R 2 goes up. Naïve interpretation: every additional predictive variable helps explain yet more of the target s variance. But that can t be true! Left to its own devices, Multiple Regression will fit too many patterns. A reason why modeling requires subject-matter expertise.

prediction error 0.05 0.10 0.15 0.20 Overfitting Error on the dataset used to fit the model can be misleading ing vs Test Error high bias low variance <----- low bias high variance -----> Doesn t predict future performance. Too much complexity can diminish model s accuracy on future data. Sometimes called the Bias-Variance Tradeoff. train data test data 40 60 80 100 120 140 160 complexity (# nnet cycles)

Overfitting Cont. What are the consequences of overfitting? Overfitted models will have high R2 values, but will perform poorly in predicting out-of-sample cases

What is Cross Validation A method of assessing the accuracy and validity of a statistical model. The available data set is divided into two parts. Modeling of the data uses one part only. The model selected for this part is then used to predict the values in the other part of the data. A valid model should show good predictive accuracy.

Cross Validation the ideal procedure 1. Divide data into three sets, training, validation and test sets Test Validation 2. Find the optimal model on the training set, and use the test set to check its predictive capability Test Cross validation step 3. See how well the model can predict the test set Model Validation 4. The validation error gives an unbiased estimate of the predictive power of a model

K-fold Cross Validation Since data is often scarce, there might not be enough to set aside for a validation sample To work around this issue k-fold CV works as follows: 1. Split the sample into k subsets of equal size 2. For each fold estimate a model on all the subsets except one 3. Use the left out subset to test the model, by calculating a CV metric of choice 4. Average the CV metric across subsets to get the CV error This has the advantage of using all data for estimating the model, however finding a good value for k can be tricky

K-fold Cross Validation Example 1 2 3 Test 4 5 1. Split the data into 5 samples 1 2 4 5 3 Test 2. Fit a model to the training samples and use the test sample to calculate A CV metric. 1 2 3 5 4 Test 3. Repeat the process for the next sample, until all samples have been used to either train or test the model

Cross Validation - Metrics How do we determine if one model is predicting better than another model? The basic relation: Error i = y i f i The difference between observed (y) and predicted value (f), when applying the model to unseen data

Cross Validation Metrics Mean Squared Error (MSE) 1/n y i f i 2 7.96 Root Mean Squared Error (RMSE) 1/n y i f i 2 2.82 Mean Absolute Percentage Error (MAPE) (1/n y i f i y i )*100 120%

Simulation Example Simulated 8 variables from a standard gaussian. The first three variables where related to the dependent variable, the other variables where random noise. The true model should have the first three variables, while the other variables should be discarded. By using Cross Validation, we should be able to avoid overfitting.

Simulation Example

Best Practice for Reporting Model Fit 1. Use Cross Validation to find the best model 2. Report the RMSE and MAPE statistics from the cross validation procedure 3. Report the R Squared from the model as you normally would. The added cross-validation information will allow one to evaluate not how much variance can be explained by the model, but also the predictive accuracy of the model. Good models should have a high predictive AND explanatory power!

Conclusion The take home message Only looking at R2 can lead to selecting models that do not have inferior predictive capability. Instead market researchers should evaluate a model according to a combination of R2 and cross validation metrics. This will produce more robust models with better predictive powers.