Cross Validation. Dr. Thomas Jensen Expedia.com

Save this PDF as:

Size: px
Start display at page:

Transcription

1 Cross Validation Dr. Thomas Jensen Expedia.com

2 About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract with Expedia Manage two algorithms One to assign a dollar value to each hotel in the database Use Cross Validation to optimize the algorithms Another to forecast how well/bad we are doing in terms of room availability in a given hotel Responsible for metrics Cross Validation

3 Overview What is R2 and why can it be misleading The issue of overfitting Definition of cross validation How to conduct cross validation Best practices when evaluating model fit

4 How to check if a model fit is good? The R2 statistic has become the almost universally standard measure for model fit in linear models What is R2? R 2 = 1 y i f i 2 y i y 2 Model error Variance in the dependent variable It is the ratio of error in a model over the total variance in the dependent variable. Hence the lower the error, the higher the R2 value.

5 How to check if a model fit is good? y i f i 2 = y i y 2 = R 2 = R 2 = A decent model fit!

6 How to check if a model fit is good? y i f i 2 = y i y 2 = R 2 = R 2 = 0.72 Is this a better model? No, overfitting!

7 Overfitting Left to their own devices, modeling techniques will overfit the data. Classic Example: multiple regression Every time you add a variable to the regression, the model s R 2 goes up. Naïve interpretation: every additional predictive variable helps explain yet more of the target s variance. But that can t be true! Left to its own devices, Multiple Regression will fit too many patterns. A reason why modeling requires subject-matter expertise.

8 prediction error Overfitting Error on the dataset used to fit the model can be misleading ing vs Test Error high bias low variance <----- low bias high variance -----> Doesn t predict future performance. Too much complexity can diminish model s accuracy on future data. Sometimes called the Bias-Variance Tradeoff. train data test data complexity (# nnet cycles)

9 Overfitting Cont. What are the consequences of overfitting? Overfitted models will have high R2 values, but will perform poorly in predicting out-of-sample cases

10 What is Cross Validation A method of assessing the accuracy and validity of a statistical model. The available data set is divided into two parts. Modeling of the data uses one part only. The model selected for this part is then used to predict the values in the other part of the data. A valid model should show good predictive accuracy.

11 Cross Validation the ideal procedure 1. Divide data into three sets, training, validation and test sets Test Validation 2. Find the optimal model on the training set, and use the test set to check its predictive capability Test Cross validation step 3. See how well the model can predict the test set Model Validation 4. The validation error gives an unbiased estimate of the predictive power of a model

12 K-fold Cross Validation Since data is often scarce, there might not be enough to set aside for a validation sample To work around this issue k-fold CV works as follows: 1. Split the sample into k subsets of equal size 2. For each fold estimate a model on all the subsets except one 3. Use the left out subset to test the model, by calculating a CV metric of choice 4. Average the CV metric across subsets to get the CV error This has the advantage of using all data for estimating the model, however finding a good value for k can be tricky

13 K-fold Cross Validation Example Test Split the data into 5 samples Test 2. Fit a model to the training samples and use the test sample to calculate A CV metric Test 3. Repeat the process for the next sample, until all samples have been used to either train or test the model

14 Cross Validation - Metrics How do we determine if one model is predicting better than another model? The basic relation: Error i = y i f i The difference between observed (y) and predicted value (f), when applying the model to unseen data

15 Cross Validation Metrics Mean Squared Error (MSE) 1/n y i f i Root Mean Squared Error (RMSE) 1/n y i f i Mean Absolute Percentage Error (MAPE) (1/n y i f i y i )* %

16 Simulation Example Simulated 8 variables from a standard gaussian. The first three variables where related to the dependent variable, the other variables where random noise. The true model should have the first three variables, while the other variables should be discarded. By using Cross Validation, we should be able to avoid overfitting.

17 Simulation Example

18 Best Practice for Reporting Model Fit 1. Use Cross Validation to find the best model 2. Report the RMSE and MAPE statistics from the cross validation procedure 3. Report the R Squared from the model as you normally would. The added cross-validation information will allow one to evaluate not how much variance can be explained by the model, but also the predictive accuracy of the model. Good models should have a high predictive AND explanatory power!

19 Conclusion The take home message Only looking at R2 can lead to selecting models that do not have inferior predictive capability. Instead market researchers should evaluate a model according to a combination of R2 and cross validation metrics. This will produce more robust models with better predictive powers.

Lecture 13: Validation

Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model

Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection.

COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection. Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise

Introduction to Predictive Models Book Chapters 1, 2 and 5. Carlos M. Carvalho The University of Texas McCombs School of Business

Introduction to Predictive Models Book Chapters 1, 2 and 5. Carlos M. Carvalho The University of Texas McCombs School of Business 1 1. Introduction 2. Measuring Accuracy 3. Out-of Sample Predictions 4.

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data Athanasius Zakhary, Neamat El Gayar Faculty of Computers and Information Cairo University, Giza, Egypt

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

Measuring forecast accuracy

Rob J Hyndman Monday, 31 March 2014 Everyone wants to know how accurate their forecasts are. Does your forecasting method give good forecasts? Are they better than the competitor methods? There are many

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

L13: cross-validation

Resampling methods Cross validation Bootstrap L13: cross-validation Bias and variance estimation with the Bootstrap Three-way data partitioning CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU

Time Series Forecasting Real World Data: Auto Regression Using a Neural Network Forecaster with Weighted Windows

Time Series Forecasting Real World Data: Auto Regression Using a Neural Network Forecaster with Weighted Windows By Brad Morantz Ph.D. Copyright 2004, 2008 Time Series Forecasting Used when don t have

Improving Generalization

Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. Cross-Validation 4. Weight Restriction

Performance Measures in Data Mining

Performance Measures in Data Mining Common Performance Measures used in Data Mining and Machine Learning Approaches L. Richter J.M. Cejuela Department of Computer Science Technische Universität München

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia

The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit

5. Fit the models to each phenotypic dataset using ouch and record the AICc and parameter estimates.

25 629 630 631 632 633 634 635 636 APPENDIX A. SUPPLEMENTARY FIGURES Fig. A1 is a schematic diagram of the simulation study design. Fig. A2 shows the distribution of selection opportunity, discriminability

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34

Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning

On Cross-Validation and Stacking: Building seemingly predictive models on random data

On Cross-Validation and Stacking: Building seemingly predictive models on random data ABSTRACT Claudia Perlich Media6 New York, NY 10012 claudia@media6degrees.com A number of times when using cross-validation

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

Decompose Error Rate into components, some of which can be measured on unlabeled data

Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

Improving Demand Forecasting

Improving Demand Forecasting 2 nd July 2013 John Tansley - CACI Overview The ideal forecasting process: Efficiency, transparency, accuracy Managing and understanding uncertainty: Limits to forecast accuracy,

SINGULAR SPECTRUM ANALYSIS HYBRID FORECASTING METHODS WITH APPLICATION TO AIR TRANSPORT DEMAND

SINGULAR SPECTRUM ANALYSIS HYBRID FORECASTING METHODS WITH APPLICATION TO AIR TRANSPORT DEMAND K. Adjenughwure, Delft University of Technology, Transport Institute, Ph.D. candidate V. Balopoulos, Democritus

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

Mean of Ratios or Ratio of Means: statistical uncertainty applied to estimate Multiperiod Probability of Default

Mean of Ratios or Ratio of Means: statistical uncertainty applied to estimate Multiperiod Probability of Default Matteo Formenti 1 Group Risk Management UniCredit Group Università Carlo Cattaneo September

THE THREE "Rs" OF PREDICTIVE ANALYTICS

THE THREE "Rs" OF PREDICTIVE As companies commit to big data and data-driven decision making, the demand for predictive analytics has never been greater. While each day seems to bring another story of

Drug Store Sales Prediction

Drug Store Sales Prediction Chenghao Wang, Yang Li Abstract - In this paper we tried to apply machine learning algorithm into a real world problem drug store sales forecasting. Given store information,

Cross-Validation. Synonyms Rotation estimation

Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

Hong Kong Stock Index Forecasting

Hong Kong Stock Index Forecasting Tong Fu Shuo Chen Chuanqi Wei tfu1@stanford.edu cslcb@stanford.edu chuanqi@stanford.edu Abstract Prediction of the movement of stock market is a long-time attractive topic

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency

Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

Evidence to Action: Use of Predictive Models for Beach Water Postings

Evidence to Action: Use of Predictive Models for Beach Water Postings Canadian Society for Epidemiology and Biostatistics Caitlyn Paget, June 4 th 2015 Goal is to improve program delivery Can we improve

Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key

THE PREDICTIVE MODELLING PROCESS

THE PREDICTIVE MODELLING PROCESS Models are used extensively in business and have an important role to play in sound decision making. This paper is intended for people who need to understand the process

Chapter 7. Feature Selection. 7.1 Introduction

Chapter 7 Feature Selection Feature selection is not used in the system classification experiments, which will be discussed in Chapter 8 and 9. However, as an autonomous system, OMEGA includes feature

Evaluation and Credibility. How much should we believe in what was learned?

Evaluation and Credibility How much should we believe in what was learned? Outline Introduction Classification with Train, Test, and Validation sets Handling Unbalanced Data; Parameter Tuning Cross-validation

Classification and Regression Trees

Classification and Regression Trees Bob Stine Dept of Statistics, School University of Pennsylvania Trees Familiar metaphor Biology Decision tree Medical diagnosis Org chart Properties Recursive, partitioning

Local classification and local likelihoods

Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

Model Validation Techniques

Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

Click to edit Master title style. Introduction to Predictive Modeling

Click to edit Master title style Introduction to Predictive Modeling Notices This document is intended for use by attendees of this Lityx training course only. Any unauthorized use or copying is strictly

Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

Why do statisticians "hate" us?

Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University

Machine Learning Methods for Causal Effects Susan Athey, Stanford University Guido Imbens, Stanford University Introduction Supervised Machine Learning v. Econometrics/Statistics Lit. on Causality Supervised

The Operational Value of Social Media Information. Social Media and Customer Interaction

The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia

Replicating portfolios revisited

Prepared by: Alexey Botvinnik Dr. Mario Hoerig Dr. Florian Ketterer Alexander Tazov Florian Wechsung SEPTEMBER 2014 Replicating portfolios revisited TABLE OF CONTENTS 1 STOCHASTIC MODELING AND LIABILITIES

Lecture 10: Regression Trees

Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

Demand Management Where Practice Meets Theory

Demand Management Where Practice Meets Theory Elliott S. Mandelman 1 Agenda What is Demand Management? Components of Demand Management (Not just statistics) Best Practices Demand Management Performance

Modeling Web Quality-of-Experience on Cellular Networks

Modeling Web Quality-of-Experience on Cellular Networks Athula Balachandran, Vaneet Aggarwal, Emir Halepovic, Jeffrey Pang, Srinivasan Seshan, Shobha Venkataraman, He Yan Carnegie Mellon University, AT&T

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

Smart ESG Integration: Factoring in Sustainability

Smart ESG Integration: Factoring in Sustainability Abstract Smart ESG integration is an advanced ESG integration method developed by RobecoSAM s Quantitative Research team. In a first step, an improved

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,

Optimizing Inventory Management Using Demand Metrics

Optimizing Inventory Management Using Demand Metrics Mark Chockalingam Managing Principal, Demand Planning Net Lexington, MA markc@demandplanning.net Abstract In this how-to session, we will first go through

FINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT. Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly

FINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly 1 1. INTRODUCTION and MOTIVATION 2. PROPOSED METHOD Random Forests Classification

Model Combination. 24 Novembre 2009

Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

Classification and Regression by randomforest

Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

Ch.3 Demand Forecasting.

Part 3 : Acquisition & Production Support. Ch.3 Demand Forecasting. Edited by Dr. Seung Hyun Lee (Ph.D., CPL) IEMS Research Center, E-mail : lkangsan@iems.co.kr Demand Forecasting. Definition. An estimate

Statistical Models in R

Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

SOA 2013 Life & Annuity Symposium May 6-7, 2013. Session 30 PD, Predictive Modeling Applications for Life and Annuity Pricing and Underwriting

SOA 2013 Life & Annuity Symposium May 6-7, 2013 Session 30 PD, Predictive Modeling Applications for Life and Annuity Pricing and Underwriting Moderator: Barry D. Senensky, FSA, FCIA, MAAA Presenters: Jonathan

Age to Age Factor Selection under Changing Development Chris G. Gross, ACAS, MAAA

Age to Age Factor Selection under Changing Development Chris G. Gross, ACAS, MAAA Introduction A common question faced by many actuaries when selecting loss development factors is whether to base the selected

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

FORECASTING DAILY SPOT FOREIGN EXCHANGE RATES USING WAVELETS AND NEURAL NETWORKS

FORECASTING DAILY SPOT FOREIGN EXCHANGE RATES USING WAVELETS AND NEURAL NETWORKS Amit Mitra Department of Mathematics & Statistics IIT Kanpur. 1/30 OVERVIEW OF THE TALK Introduction What are exchange rates

Bayesian data analysis: what it is and what it is not. Prof. Andrew Gelman Dept. of Statistics Columbia University

Bayesian data analysis: what it is and what it is not Prof. Andrew Gelman Dept. of Statistics Columbia University Talk for Columbia University Department of Computer Science, 15 Dec 2003 1 Themes Popular

Data Mining Classification: Alternative Techniques. Instance-Based Classifiers. Lecture Notes for Chapter 5. Introduction to Data Mining

Data Mining Classification: Alternative Techniques Instance-Based Classifiers Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar Set of Stored Cases Atr1... AtrN Class A B

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

Optimizing Inventory Management Using Demand Metrics

Northeast Supply Chain Conference Optimizing Inventory Management Using Demand Metrics Principal, DemandPlanning.Net markc@demandplanning.net Abstract In this how-to session, we will first go through the

Business Analytics using Data Mining Project Report. Optimizing Operation Room Utilization by Predicting Surgery Duration

Business Analytics using Data Mining Project Report Optimizing Operation Room Utilization by Predicting Surgery Duration Project Team 4 102034606 WU, CHOU-CHUN 103078508 CHEN, LI-CHAN 102077503 LI, DAI-SIN

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO

Stock Market Price Prediction Using Linear and Polynomial Regression Models

Stock Market Price Prediction Using Linear and Polynomial Regression Models Lucas Nunno University of New Mexico Computer Science Department Albuquerque, New Mexico, United States lnunno@cs.unm.edu Abstract

SAS Certificate Applied Statistics and SAS Programming

SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and Advanced SAS Programming Brigham Young University Department of Statistics offers an Applied Statistics and

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

Calculating Interval Forecasts

Calculating Chapter 7 (Chatfield) Monika Turyna & Thomas Hrdina Department of Economics, University of Vienna Summer Term 2009 Terminology An interval forecast consists of an upper and a lower limit between

Section 6: Model Selection, Logistic Regression and more...

Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS

OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS CLARKE, Stephen R. Swinburne University of Technology Australia One way of examining forecasting methods via assignments

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

CART 6.0 Feature Matrix

CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window

HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

Time series Forecasting using Holt-Winters Exponential Smoothing

Time series Forecasting using Holt-Winters Exponential Smoothing Prajakta S. Kalekar(04329008) Kanwal Rekhi School of Information Technology Under the guidance of Prof. Bernard December 6, 2004 Abstract

APPENDIX 15. Review of demand and energy forecasting methodologies Frontier Economics

APPENDIX 15 Review of demand and energy forecasting methodologies Frontier Economics Energex regulatory proposal October 2014 Assessment of Energex s energy consumption and system demand forecasting procedures

Applying Data Science to Sales Pipelines for Fun and Profit

Applying Data Science to Sales Pipelines for Fun and Profit Andy Twigg, CTO, C9 @lambdatwigg Abstract Machine learning is now routinely applied to many areas of industry. At C9, we apply machine learning

Predicting Flight Delays

Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

Data Mining III: Numeric Estimation

Data Mining III: Numeric Estimation Computer Science 105 Boston University David G. Sullivan, Ph.D. Review: Numeric Estimation Numeric estimation is like classification learning. it involves learning a

!"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"

!"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"!"#"\$%&#'()*+',\$\$-.&#',/"-0%.12'32./4'5,5'6/%&)\$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

CHAPTER 11 FORECASTING AND DEMAND PLANNING

OM CHAPTER 11 FORECASTING AND DEMAND PLANNING DAVID A. COLLIER AND JAMES R. EVANS 1 Chapter 11 Learning Outcomes l e a r n i n g o u t c o m e s LO1 Describe the importance of forecasting to the value

Predicting Gold Prices

CS229, Autumn 2013 Predicting Gold Prices Megan Potoski Abstract Given historical price data up to and including a given day, I attempt to predict whether the London PM price fix of gold the following

The Voices of Influence iijournals.com

W W W. I I J O T. C O M OT S U M M E R 2 0 1 4 V O L U M E 9 N U M B E R 3 The Voices of Influence iijournals.com Predicting Intraday Trading Volume and Volume Percentages VENKATESH SATISH, ABHAY SAXENA,

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

K-nearest-neighbor: an introduction to machine learning

K-nearest-neighbor: an introduction to machine learning Xiaojin Zhu jerryzhu@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison slide 1 Outline Types of learning Classification: