Cross-Validation and Other Out-of-Sample Testing Strategies
|
|
- Morris McKinney
- 7 years ago
- Views:
Transcription
1 Cross-Validation and Other Out-of-Sample Testing Strategies Simon J. Mason International Research Institute for Climate Prediction, Earth Institute of Columbia University AMS Short Course on Significance Testing, Model Evaluation and Alternatives Seattle, January 11, 2004 L i n k i n g S c i e n c e t o S o c i e t y
2 What is cross-validation? Livezey s talk on resampling highlighted two questions: Is the observed result good? Is the observed result right? Resampling techniques can be used to address these questions by, respectively, determining: The empirical distribution of the estimator under the null hypothesis (by permutation or bootstrap methods); Sampling errors in the observed value of the estimator (by bootstrap or jackknife methods). Cross-validation addresses the second question (is the observed result right?).
3 What is cross-validation? Resampling techniques are frequently used in the context of forecast performance estimators. Cross-validation is a specialized resampling procedure that is designed specifically for application in model validation problems. Cross-validation is often, but incorrectly, used as a synonym for jackknifing. The two methods are distinct.
4 Why cross-validate? Model fitting does not provide a good estimate of actual forecast skill: the model-fit statistics tell us how well the model describes the data, not how well it predicts the data. Because the model knows in advance the relationships between the data, model-fit statistics almost inevitably over-estimate predictive skill. In effect, therefore, cross-validation is intended to eliminate errors in performance estimates.
5 Predictive v descriptive skill To estimate true predictive skill we need a set of forecasts that are independent of the data used to train the model. Specifically, the verification sample should be completely distinct from the training/calibration sample. Any leakage of information from the training sample to the verification sample will bias the predictive skill estimate.
6 Model validation designs 1. Leave at least one year out of the training sample. 2. Reconstruct the model using the new smaller training sample. 3. Forecast at least one of the years omitted. 4. Repeat at least step 3. Objective: mimic the complete lack of knowledge of future values in operational forecasting when hindcasting.
7 Leave-one-out cross-validation 1. Leave the first year out of the training sample. 2. Reconstruct the model using the new smaller training sample. 3. Forecast the omitted year. 4. Repeat, omitting subsequent years until a forecast has been made for each year.
8 What is cross-validation? Jackknife: Re-calculate the estimator using all possible subsets of the data in which one observation is missing. The jackknife is used to indicate the distribution of sampling errors in the estimator. Cross-validation: Re-calculate the estimated values using (all possible) subsets of the data (in which one observation is missing), and then recalculate the estimator. Cross-validation is used to obtain an unbiased value for the performance estimator.
9 Jackknife Leave-one-out cross-validation Correlate all but 1951 Correlate Period all but 1952 period period Predict 1951 Predict period 1952 period period period period then correlate Correlate all but 1953 then compare the correlations. Predict 1953 Correlate all but 1954 Predict 1954 period Period period Period Correlate all but 1955 period Period period Period Predict 1955 period period
10 Example Jackknife correlations between JFM temperature for San Diego (CD93) and Eastern North Dakota (CD16).
11 Example Crossvalidated correlation between JFM temperature for San Diego (CD93) and Eastern North Dakota (CD16).
12 Example The jackknife correlations suggest that the true correlation is between and The cross-validated correlation suggests that the true correlation is 0.228; less than the sample correlation. Cross-validated correlations typically are in the lower percentiles of jackknife correlations (in this case ~25 th percentile). Why?
13 Assumptions The only assumption with cross-validation is that there is no leakage of information about the omitted data to the training sample. The most difficult problem is to ensure that NONE of the information in the training sample is known to the verification sample. Unfortunately, there is usually some leakage:
14 Leakage Recalculate model parameters. Reselect predictors. Why?
15 Leakage Note that each year might involve a completely new model. The model to be used to produce a real forecast is not actually verified itself. Verification procedures verify the entire forecast process NOT the specific model.
16 Leakage Recalculate model parameters. Reselect predictors. Why? Recalculate the optimal number of modes. Why? Redefine climatology. Why? Avoid autocorrelation. Why? How?
17 Leave-k-out cross-validation 1951 Predict 1951 Omit Omit Predict Omit Omit Omit period period Omit 1953 Omit 1953 Predict 1953 Omit 1953 Omit 1953 Omit 1954 Omit 1954 Predict 1954 Omit 1954 period period Omit 1955 period Omit Omit period Predict Omit Omit Ensure that cross-validation window length is at least twice the decorrelation time
18 Assumptions There were no obvious sources of leakage in the example. The cross-validated correlation was less than the sample correlation, and less than about 75% of the jackknife correlations. So which gives us the best estimate of predictive skill? Is the difference between descriptive and predictive skill real, and if so, why?
19 Cross-validation and sampling errors Use the jackknifed averages of JFM temperature for San Diego to provide cross-validated hindcasts of JFM temperature for the omitted year, and then correlate these hindcasts with the observed values
20 Cross-validation San Diego JFM temperature r =
21 Cross-validation Bias in crossvalidated estimates of forecast skill, for simple linear regression given different sample sizes and population correlations. Bias Sample size
22 Predictive v descriptive skill Why should there be a difference between predictive and descriptive skill? 1. Model parameters are optimized for the data in the training sample, not for future data. But the best estimate of the population correlation is obtained using the largest sample possible so the fitted model parameters should be unbiased. 2. Model variables are optimized for the data in the training sample, not for future data.
23 Cross-validation and bias Withholding ALL the information in the omitted year from the verification sample is effectively impossible: it is known a priori that the model climatology will shift away from the omitted value, and that modelled relationships will shift away from the direction implicit in the omitted observational pair. So even omitting information does not necessarily mean that this information is not known by the verification sample.
24 Cross-validation and bias Leave-one-out cross-validation can give pathological estimates of forecast performance. But the underestimation of performance may be most serious for the reference forecast strategy, and so it is possible that some estimates of skill may be overestimated. With leave-3-out cross-validation, the correlation between the observed and predicted JFM temperatures increases from 1.0 to about -0.6, and continues to increase (on average) with an increasing crossvalidation window length, up to the point at which sampling errors become large.
25 Cross-validation and bias If the model is correct, leave-k-out cross-validation will always underestimate forecast performance, but the underestimation usually decreases as k increases (up to a point). It is therefore desirable to make k large.
26 Cross-validation Cross-validation is poorly designed for determining whether the model parameters are correctly estimated. But Cross-validation (appropriately implemented) is well designed for determining whether the right model has been selected
27 Multiplicity A multiplicity of candidate predictors results in positive biases in performance measures.
28 Cross-validation and multiplicity Model selection criteria indicate improved model selection for leave-k-out over leave-1-out cross-validation.
29 Cross-validation and multiplicity although if the number of candidate predictors becomes too large it becomes impossible to avoid overfitting.
30 Cross-validation If cross-validation can be used effectively to determine whether the right model has been selected, why not use cross-validation as a selection procedure rather than verification procedure? Cross-validation as a model selection procedure involves selecting the model with the best predictive capability rather than the best descriptive capability.
31 Cross-validation But how can we now verify the forecasts? If we are picking the model that gives the best cross-validated skill estimates, we cannot then use cross-validation to estimate its predictive skill. We again are subject to the problem of picking the best results, and assuming that those results will be a good indication of true performance.
32 Retroactive forecasting 1981 period ( ) Predict period Predict ( ) period ( ) 1984 period ( ) 1985 period ( ) Predict 1983 Omit Omit Predict 1984 Omit Omit Predict 1985
33 But what if we do not have enough data? L i n k i n g S c i e n c e t o S o c i e t y
34 Two-deep cross-validation 1951 Predict 1951 Predict Omit Predict Omit period period 1955 Period Omit 1953 Predict 1953 Predict 1953 Omit 1953 Omit 1954 Predict 1954 Predict 1954 Omit 1954 period Period Omit 1955 period Omit Omit period Predict Predict Omit Use the orange years for model selection. Use the green years for model verification.
35 Summary Cross-validation is designed to give an indication of the predictive skill of a model, rather than its descriptive skill. For reliable estimates of predictive skill, leakage needs to be avoided. Completely withholding information from the training sample is not a guarantee against leakage, although ising large cross-validation windows helps. In practice, most leakage results from only considering the first of two sources of uncertainty: 1. Are the model parameters correct? 2. Is the model correct?
36 Summary Cross-validation will underestimate the performance of a correct model because it is not well-designed for measuring sampling errors in parameter estimates. It is better designed for identifying biases in performance measures that result from incorrect model specification. As such, cross-validation can be used as an effective method of model selection.
37 Recommendations Cross-validation should only be performed if the model variables and format are not known a priori. Otherwise jackknifing or bootstrapping would be more appropriate. Leave-one-out cross-validation should almost always be avoided, and large cross-validation windows should be used, if possible. Retroactive validation is preferable to cross-validation where sample sizes are sufficiently large. Otherwise two-deep cross-validation is an attractive option that has not been widely used. L i n k i n g S c i e n c e t o S o c i e t y
38 Recommended readings Barnston, A. G., and H. M. van den Dool, 1993: A degeneracy in cross-validated skill in regression-based forecasts. J. Climate, 6, Browne, M. W., 2000: Cross-validation methods. J. Math. Psych., 44, Elsner, J. B., and C. P. Schmertmann, 1994: Assessing forecast skill through cross validation. Wea. Forecasting, 9, Livezey, R. E., A. G. Barnston, and B. K. Neumeister, 1990: Mixed analog/persistence prediction of United States seasonal mean temperatures. Int. J. Climatol., 10, Michaelsen, J., 1987: Cross-validation in statistical climate forecast models. J. Clim. Appl. Meteorol., 26, L i n k i n g S c i e n c e t o S o c i e t y
Weekly Sales Forecasting
Weekly Sales Forecasting! San Diego Data Science and R Users Group June 2014 Kevin Davenport! http://kldavenport.com kldavenportjr@gmail.com @KevinLDavenport Thank you to our sponsors: The competition
More informationMULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
More informationAppendix 1: Time series analysis of peak-rate years and synchrony testing.
Appendix 1: Time series analysis of peak-rate years and synchrony testing. Overview The raw data are accessible at Figshare ( Time series of global resources, DOI 10.6084/m9.figshare.929619), sources are
More informationNonparametric statistics and model selection
Chapter 5 Nonparametric statistics and model selection In Chapter, we learned about the t-test and its variations. These were designed to compare sample means, and relied heavily on assumptions of normality.
More informationL13: cross-validation
Resampling methods Cross validation Bootstrap L13: cross-validation Bias and variance estimation with the Bootstrap Three-way data partitioning CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU
More informationStatistics 2014 Scoring Guidelines
AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home
More informationMissing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional
More informationMultiple Regression: What Is It?
Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationCross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
More informationMicrosoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
More informationReview Jeopardy. Blue vs. Orange. Review Jeopardy
Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?
More informationCross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
More informationAnswer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade
Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements
More informationWe discuss 2 resampling methods in this chapter - cross-validation - the bootstrap
Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2
More informationIntroduction to Minitab and basic commands. Manipulating data in Minitab Describing data; calculating statistics; transformation.
Computer Workshop 1 Part I Introduction to Minitab and basic commands. Manipulating data in Minitab Describing data; calculating statistics; transformation. Outlier testing Problem: 1. Five months of nickel
More information5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
More informationTesting for Lack of Fit
Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data? If the model is correct then ˆσ 2 should be an unbiased estimate of σ 2. If we have a model which is not complex enough to fit
More informationTest Bias. As we have seen, psychological tests can be well-conceived and well-constructed, but
Test Bias As we have seen, psychological tests can be well-conceived and well-constructed, but none are perfect. The reliability of test scores can be compromised by random measurement error (unsystematic
More informationHong Kong Stock Index Forecasting
Hong Kong Stock Index Forecasting Tong Fu Shuo Chen Chuanqi Wei tfu1@stanford.edu cslcb@stanford.edu chuanqi@stanford.edu Abstract Prediction of the movement of stock market is a long-time attractive topic
More informationA Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia
More informationOverview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationPITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU
PITFALLS IN TIME SERIES ANALYSIS Cliff Hurvich Stern School, NYU The t -Test If x 1,..., x n are independent and identically distributed with mean 0, and n is not too small, then t = x 0 s n has a standard
More informationSolución del Examen Tipo: 1
Solución del Examen Tipo: 1 Universidad Carlos III de Madrid ECONOMETRICS Academic year 2009/10 FINAL EXAM May 17, 2010 DURATION: 2 HOURS 1. Assume that model (III) verifies the assumptions of the classical
More informationIntroduction to Quantitative Methods
Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................
More informationWhy do statisticians "hate" us?
Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data
More informationCombining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney
Combining GLM and datamining techniques for modelling accident compensation data Peter Mulquiney Introduction Accident compensation data exhibit features which complicate loss reserving and premium rate
More informationCross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
More informationExample: Boats and Manatees
Figure 9-6 Example: Boats and Manatees Slide 1 Given the sample data in Table 9-1, find the value of the linear correlation coefficient r, then refer to Table A-6 to determine whether there is a significant
More informationPractical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University
Practical I conometrics data collection, analysis, and application Christiana E. Hilmer Michael J. Hilmer San Diego State University Mi Table of Contents PART ONE THE BASICS 1 Chapter 1 An Introduction
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationYou have data! What s next?
You have data! What s next? Data Analysis, Your Research Questions, and Proposal Writing Zoo 511 Spring 2014 Part 1:! Research Questions Part 1:! Research Questions Write down > 2 things you thought were
More informationMISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
More informationA THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA
A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University Agency Internal User Unmasked Result Subjects
More informationMonotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology
More informationAdequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection
Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics
More informationA Review of Methods. for Dealing with Missing Data. Angela L. Cool. Texas A&M University 77843-4225
Missing Data 1 Running head: DEALING WITH MISSING DATA A Review of Methods for Dealing with Missing Data Angela L. Cool Texas A&M University 77843-4225 Paper presented at the annual meeting of the Southwest
More informationSample Size and Power in Clinical Trials
Sample Size and Power in Clinical Trials Version 1.0 May 011 1. Power of a Test. Factors affecting Power 3. Required Sample Size RELATED ISSUES 1. Effect Size. Test Statistics 3. Variation 4. Significance
More informationGUIDELINES FOR REVIEWING QUANTITATIVE DESCRIPTIVE STUDIES
GUIDELINES FOR REVIEWING QUANTITATIVE DESCRIPTIVE STUDIES These guidelines are intended to promote quality and consistency in CLEAR reviews of selected studies that use statistical techniques and other
More informationThe Prediction of Indian Monsoon Rainfall: A Regression Approach. Abstract
The Prediction of Indian Monsoon Rainfall: Goutami Bandyopadhyay A Regression Approach 1/19 Dover Place Kolkata-7 19 West Bengal India goutami15@yahoo.co.in Abstract The present paper analyses the monthly
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationConstructing a TpB Questionnaire: Conceptual and Methodological Considerations
Constructing a TpB Questionnaire: Conceptual and Methodological Considerations September, 2002 (Revised January, 2006) Icek Ajzen Brief Description of the Theory of Planned Behavior According to the theory
More informationStatistical Tests for Multiple Forecast Comparison
Statistical Tests for Multiple Forecast Comparison Roberto S. Mariano (Singapore Management University & University of Pennsylvania) Daniel Preve (Uppsala University) June 6-7, 2008 T.W. Anderson Conference,
More informationSIMULATION STUDIES IN STATISTICS WHAT IS A SIMULATION STUDY, AND WHY DO ONE? What is a (Monte Carlo) simulation study, and why do one?
SIMULATION STUDIES IN STATISTICS WHAT IS A SIMULATION STUDY, AND WHY DO ONE? What is a (Monte Carlo) simulation study, and why do one? Simulations for properties of estimators Simulations for properties
More informationOptimization of technical trading strategies and the profitability in security markets
Economics Letters 59 (1998) 249 254 Optimization of technical trading strategies and the profitability in security markets Ramazan Gençay 1, * University of Windsor, Department of Economics, 401 Sunset,
More informationChapter 10. Key Ideas Correlation, Correlation Coefficient (r),
Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables
More informationBusiness Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.
Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing
More informationMachine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34
Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning
More informationSTA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance
Principles of Statistics STA-201-TE This TECEP is an introduction to descriptive and inferential statistics. Topics include: measures of central tendency, variability, correlation, regression, hypothesis
More informationTeaching Pre-Algebra in PowerPoint
Key Vocabulary: Numerator, Denominator, Ratio Title Key Skills: Convert Fractions to Decimals Long Division Convert Decimals to Percents Rounding Percents Slide #1: Start the lesson in Presentation Mode
More informationAccentuate the Negative: Homework Examples from ACE
Accentuate the Negative: Homework Examples from ACE Investigation 1: Extending the Number System, ACE #6, 7, 12-15, 47, 49-52 Investigation 2: Adding and Subtracting Rational Numbers, ACE 18-22, 38(a),
More informationMgmt 469. Regression Basics. You have all had some training in statistics and regression analysis. Still, it is useful to review
Mgmt 469 Regression Basics You have all had some training in statistics and regression analysis. Still, it is useful to review some basic stuff. In this note I cover the following material: What is a regression
More informationA Statistical Analysis of Popular Lottery Winning Strategies
CS-BIGS 4(1): 66-72 2010 CS-BIGS http://www.bentley.edu/csbigs/vol4-1/chen.pdf A Statistical Analysis of Popular Lottery Winning Strategies Albert C. Chen Torrey Pines High School, USA Y. Helio Yang San
More informationEconometrics Simple Linear Regression
Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight
More informationEvaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
More informationConn Valuation Services Ltd.
CAPITALIZED EARNINGS VS. DISCOUNTED CASH FLOW: Which is the more accurate business valuation tool? By Richard R. Conn CMA, MBA, CPA, ABV, ERP Is the capitalized earnings 1 method or discounted cash flow
More informationModule 5: Multiple Regression Analysis
Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College
More informationThe Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces
The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces Or: How I Learned to Stop Worrying and Love the Ball Comment [DP1]: Titles, headings, and figure/table captions
More informationPlease Excuse My Dear Aunt Sally
Really Good Stuff Activity Guide Order of Operations Poster Congratulations on your purchase of the Really Good Stuff Order of Operations Poster a colorful reference poster to help your students remember
More informationChapter 6 Experiment Process
Chapter 6 Process ation is not simple; we have to prepare, conduct and analyze experiments properly. One of the main advantages of an experiment is the control of, for example, subjects, objects and instrumentation.
More informationElements of statistics (MATH0487-1)
Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -
More informationArtificial Neural Network and Non-Linear Regression: A Comparative Study
International Journal of Scientific and Research Publications, Volume 2, Issue 12, December 2012 1 Artificial Neural Network and Non-Linear Regression: A Comparative Study Shraddha Srivastava 1, *, K.C.
More informationModel Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
More informationFacebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
More informationPsychology 60 Fall 2013 Practice Exam Actual Exam: Next Monday. Good luck!
Psychology 60 Fall 2013 Practice Exam Actual Exam: Next Monday. Good luck! Name: 1. The basic idea behind hypothesis testing: A. is important only if you want to compare two populations. B. depends on
More informationPicking Distractors for Multiple Choice Questions
Picking Distractors for Multiple Choice Questions Maplesoft, a division of Waterloo Maple Inc., 008 Multiple choice questions are an appealing format for both instructors and students. In fact, some instructors
More informationFinancial Risk Management Exam Sample Questions/Answers
Financial Risk Management Exam Sample Questions/Answers Prepared by Daniel HERLEMONT 1 2 3 4 5 6 Chapter 3 Fundamentals of Statistics FRM-99, Question 4 Random walk assumes that returns from one time period
More informationDiscovering Math: Prediction and Probability Teacher s Guide
Teacher s Guide Grade Level: K 2 Curriculum Focus: Mathematics Lesson Duration: Two class periods Program Description Discovering Math: Prediction and Probability From likelihood to frequency to prediction,
More informationThe Variability of P-Values. Summary
The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report
More informationA Review of Missing Data Treatment Methods
A Review of Missing Data Treatment Methods Liu Peng, Lei Lei Department of Information Systems, Shanghai University of Finance and Economics, Shanghai, 200433, P.R. China ABSTRACT Missing data is a common
More informationHomework Assignment 7
Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
More informationClimate change and heating/cooling degree days in Freiburg
339 Climate change and heating/cooling degree days in Freiburg Finn Thomsen, Andreas Matzatrakis Meteorological Institute, Albert-Ludwigs-University of Freiburg, Germany Abstract The discussion of climate
More informationhp calculators HP 50g Trend Lines The STAT menu Trend Lines Practice predicting the future using trend lines
The STAT menu Trend Lines Practice predicting the future using trend lines The STAT menu The Statistics menu is accessed from the ORANGE shifted function of the 5 key by pressing Ù. When pressed, a CHOOSE
More informationAppendix Figure 1 The Geography of Consumer Bankruptcy
Appendix Figure 1 The Geography of Consumer Bankruptcy Number of Bankruptcy Offices in Each Court Number of Chapter 13 Judges Chapter 13 Filings Per Capita Chapter 13 Discharge Rate Variation in Chapter
More informationBayesian merging of multiple climate model forecasts for seasonal hydrological predictions
JOURNAL OF GEOPHYSICAL RESEARCH, VOL. 112,, doi:10.1029/2006jd007655, 2007 Bayesian merging of multiple climate model forecasts for seasonal hydrological predictions Lifeng Luo, 1,2 Eric F. Wood, 2 and
More informationPARTIAL LEAST SQUARES IS TO LISREL AS PRINCIPAL COMPONENTS ANALYSIS IS TO COMMON FACTOR ANALYSIS. Wynne W. Chin University of Calgary, CANADA
PARTIAL LEAST SQUARES IS TO LISREL AS PRINCIPAL COMPONENTS ANALYSIS IS TO COMMON FACTOR ANALYSIS. Wynne W. Chin University of Calgary, CANADA ABSTRACT The decision of whether to use PLS instead of a covariance
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &
More informationChapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
More informationCourse Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics
Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This
More informationPower Analysis for Correlation & Multiple Regression
Power Analysis for Correlation & Multiple Regression Sample Size & multiple regression Subject-to-variable ratios Stability of correlation values Useful types of power analyses Simple correlations Full
More informationThe Null Hypothesis. Geoffrey R. Loftus University of Washington
The Null Hypothesis Geoffrey R. Loftus University of Washington Send correspondence to: Geoffrey R. Loftus Department of Psychology, Box 351525 University of Washington Seattle, WA 98195-1525 gloftus@u.washington.edu
More informationCorrelational Research
Correlational Research Chapter Fifteen Correlational Research Chapter Fifteen Bring folder of readings The Nature of Correlational Research Correlational Research is also known as Associational Research.
More informationCONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
More informationPredicting daily incoming solar energy from weather data
Predicting daily incoming solar energy from weather data ROMAIN JUBAN, PATRICK QUACH Stanford University - CS229 Machine Learning December 12, 2013 Being able to accurately predict the solar power hitting
More informationEvaluating Data Mining Models: A Pattern Language
Evaluating Data Mining Models: A Pattern Language Jerffeson Souza Stan Matwin Nathalie Japkowicz School of Information Technology and Engineering University of Ottawa K1N 6N5, Canada {jsouza,stan,nat}@site.uottawa.ca
More informationINTRODUCTION TO MULTIPLE CORRELATION
CHAPTER 13 INTRODUCTION TO MULTIPLE CORRELATION Chapter 12 introduced you to the concept of partialling and how partialling could assist you in better interpreting the relationship between two primary
More informationReview of Transpower s. electricity demand. forecasting methods. Professor Rob J Hyndman. B.Sc. (Hons), Ph.D., A.Stat. Contact details: Report for
Review of Transpower s electricity demand forecasting methods Professor Rob J Hyndman B.Sc. (Hons), Ph.D., A.Stat. Contact details: Telephone: 0458 903 204 Email: robjhyndman@gmail.com Web: robjhyndman.com
More informationQuantile Regression for Peak Demand Forecasting
Quantile Regression for Peak Demand Forecasting Charlie Gibbons Ahmad Faruqui July 1, 2014 Copyright 2013 The Brattle Group, Inc. Outline Approaches to forecasting peak demand Our empirical approach OLS
More informationMultivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
More informationThe NCAA Basketball Betting Market: Tests of the Balanced Book and Levitt Hypotheses
The NCAA Basketball Betting Market: Tests of the Balanced Book and Levitt Hypotheses Rodney J. Paul, St. Bonaventure University Andrew P. Weinbach, Coastal Carolina University Kristin K. Paul, St. Bonaventure
More information4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4
4. Simple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/4 Outline The simple linear model Least squares estimation Forecasting with regression Non-linear functional forms Regression
More informationE10: Controlled Experiments
E10: Controlled Experiments Quantitative, empirical method Used to identify the cause of a situation or set of events X is responsible for Y Directly manipulate and control variables Correlation does not
More informationAim To help students prepare for the Academic Reading component of the IELTS exam.
IELTS Reading Test 1 Teacher s notes Written by Sam McCarter Aim To help students prepare for the Academic Reading component of the IELTS exam. Objectives To help students to: Practise doing an academic
More informationScatter Plot, Correlation, and Regression on the TI-83/84
Scatter Plot, Correlation, and Regression on the TI-83/84 Summary: When you have a set of (x,y) data points and want to find the best equation to describe them, you are performing a regression. This page
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationEvent driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016
Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency
More informationLocation matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is
More informationDATA MINING SPECIES DISTRIBUTION AND LANDCOVER. Dawn Magness Kenai National Wildife Refuge
DATA MINING SPECIES DISTRIBUTION AND LANDCOVER Dawn Magness Kenai National Wildife Refuge Why Data Mining Random Forest Algorithm Examples from the Kenai Species Distribution Model Pattern Landcover Model
More information