Introduction to nonparametric regression: Least squares vs. Nearest neighbors



Similar documents
Local classification and local likelihoods

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Penalized regression: Introduction

Two-sample inference: Continuous data

Sentiment analysis using emoticons

Regression III: Advanced Methods

Principles of Data Mining by Hand&Mannila&Smyth

Supervised and unsupervised learning - 1

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

HT2015: SC4 Statistical Data Mining and Machine Learning

Statistical tests for SPSS

STA 4273H: Statistical Machine Learning

Automated Learning and Data Visualization

Regression III: Advanced Methods

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Physics Lab Report Guidelines

Basics of Statistical Machine Learning

Supervised Learning (Big Data Analytics)

Decompose Error Rate into components, some of which can be measured on unlabeled data

OUTLIER ANALYSIS. Data Mining 1

LCs for Binary Classification

Machine Learning Logistic Regression

CALCULATIONS & STATISTICS

Sample Size and Power in Clinical Trials

Pigeonhole Principle Solutions

Cross Validation. Dr. Thomas Jensen Expedia.com

Multivariate Normal Distribution

Fairfield Public Schools

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Data Mining: An Overview. David Madigan

Statistical Models in Data Mining

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Support Vector Machine (SVM)

A.II. Kernel Estimation of Densities

CHAPTER 14 NONPARAMETRIC TESTS

Predictive Modeling Techniques in Insurance

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

II. DISTRIBUTIONS distribution normal distribution. standard scores

Error Type, Power, Assumptions. Parametric Tests. Parametric vs. Nonparametric Tests

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Linear and quadratic Taylor polynomials for functions of several variables.

Lecture 3: Linear methods for classification

Simple Regression Theory II 2010 Samuel L. Baker

Big Data: a new era for Statistics

Introduction to the Smith Chart for the MSA Sam Wetterlin 10/12/09 Z +

Imputing Values to Missing Data

3. Reaction Diffusion Equations Consider the following ODE model for population growth

Session 7 Bivariate Data and Analysis

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

1 Sufficient statistics

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

Java Modules for Time Series Analysis

Cross-validation for detecting and preventing overfitting

Reflections on Probability vs Nonprobability Sampling

Research Methods & Experimental Design

Interaction between quantitative predictors

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

17. SIMPLE LINEAR REGRESSION II

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Pr(X = x) = f(x) = λe λx

Post-hoc comparisons & two-way analysis of variance. Two-way ANOVA, II. Post-hoc testing for main effects. Post-hoc testing 9.

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

Classification Techniques for Remote Sensing

Econometrics Simple Linear Regression

Predict the Popularity of YouTube Videos Using Early View Data

Marketing Mix Modelling and Big Data P. M Cain

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation

WHERE DOES THE 10% CONDITION COME FROM?

SUMAN DUVVURU STAT 567 PROJECT REPORT

Chapter 8: Quantitative Sampling

Lesson 26: Reflection & Mirror Diagrams

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

BIOE Lotka-Volterra Model L-V model with density-dependent prey population growth

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Week 4: Standard Error and Confidence Intervals

Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray

Prospective, retrospective, and cross-sectional studies

Normality Testing in Excel

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Comparing Two Groups. Standard Error of ȳ 1 ȳ 2. Setting. Two Independent Samples

Agility, Uncertainty, and Software Project Estimation Todd Little, Landmark Graphics

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces

Evaluation & Validation: Credibility: Evaluating what has been learned

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

8. THE NORMAL DISTRIBUTION

The fundamental question in economics is 2. Consumer Preferences

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Server Load Prediction

A logistic approximation to the cumulative normal distribution

Geographically Weighted Regression

Knowledge Discovery and Data Mining

Non Parametric Inference

Transcription:

Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16

Introduction For the remainder of the course, we will focus on nonparametric regression and classification The regression problem involves modeling how the expected value of a response y changes in response to changes in an explanatory variable x: E(y x) = f(x) Linear regression, as its name implies, assumes a linear relationship; namely, that f(x) = β 0 + β 1 x Patrick Breheny STA 621: Nonparametric Statistics 2/16

Parametric vs. nonparametric approaches This reduction of a complicated function to a simple form with a small number of unknown parameters is very similar to the parametric approach to estimation and inference involving the unknown distribution function The nonparametric approach, in contrast, is to make as few assumptions about the regression function f as possible Instead, we will try to use the data as much as possible to learn about the potential shape of f allowing f to be very flexible, yet smooth Patrick Breheny STA 621: Nonparametric Statistics 3/16

Simple local models One way to achieve this flexibility is by fitting a different, simple model separately at every point x 0 in much the same way that we used kernels to estimate density As with kernel density estimates, this is done using only those observations close to x 0 to fit the simple model As we will see, it is possible to extend many of the same kernel ideas we have already discussed to smoothly blend these local models to construct a smooth estimate ˆf of the relationship between x and y Patrick Breheny STA 621: Nonparametric Statistics 4/16

Introduction Before we do so, however, let us get a general feel for the contrast between parametric and nonparametric classification by contrasting two simple, but very different, methods: the ordinary least squares regression model and the k-nearest neighbor prediction rule The linear model makes huge assumptions about the structure of the problem, but is quite stable Nearest neighbors is virtually assumption-free, but its results can be quite unstable Each method can be quite powerful in different settings and for different reasons Patrick Breheny STA 621: Nonparametric Statistics 5/16

Simulation settings To examine which method is better in which setting, we will simulate data from a simple model in which y can take on one of two values: 1 or 1 The corresponding x values are derived from one of two settings: Setting 1: x values are drawn from a bivariate normal distribution with different means for y = 1 and y = 1 Setting 2: A mixture in which 10 sets of means for each class (1, 1) are drawn; x values are then drawn by randomly selecting a mean from the appropriate class and then generating a random bivariate normal observation with that mean A fair competition between the two methods is then how well they do at predicting whether a future observation is 1 or 1 given its x values Patrick Breheny STA 621: Nonparametric Statistics 6/16

Linear model results Patrick Breheny STA 621: Nonparametric Statistics 7/16

Linear model remarks The linear model seems to classify points reasonably in setting 1 In setting 2, on the other hand, there are some regions which seem questionable For example, in the lower left hand corner of the plot, does it really make sense to predict blue given that all of the nearby points are red? Patrick Breheny STA 621: Nonparametric Statistics 8/16

Nearest neighbors Consider then a completely different approach in which we don t assume a model, a distribution, a likelihood, or anything about the problem: we just look at nearby points and base our prediction on the average of those points This approach is called the nearest-neighbor method, and is defined formally as ŷ(x) = 1 k x i N k (x) where N k (x) is the neighborhood of x defined by its k closest points in the sample y i, Patrick Breheny STA 621: Nonparametric Statistics 9/16

Nearest neighbor results Patrick Breheny STA 621: Nonparametric Statistics 10/16

Nearest neighbor remarks Nearest neighbor seems not to perform terribly well in setting 1, as its classification boundaries are unnecessarily complex and unstable On the other hand, the method seemed perhaps better than the linear model in setting 2, where a complex and curved boundary seems to fit the data better Furthermore, the choice of k plays a big role in the fit, and the optimal k might not be the same in settings 1 and 2 Patrick Breheny STA 621: Nonparametric Statistics 11/16

Inference Introduction Of course, it is potentially misleading to judge whether a method is better simply because it fits the sample better What matters, of course, is how well its predictions generalize to new samples Thus, consider generating 100 data sets of size 200, fitting each model, and then measuring how well each method does at predicting 10,000 new, independent observations Patrick Breheny STA 621: Nonparametric Statistics 12/16

Simulation results Black line = least squares; blue line = nearest neighbors 0 50 100 150 Setting 1 Setting 2 0.32 Misclassification Rate 0.30 0.28 0.26 0.24 0 50 100 150 k Patrick Breheny STA 621: Nonparametric Statistics 13/16

Remarks Introduction In setting 1, linear regression was always better than nearest neighbors In setting 2, nearest neighbors was usually better than linear regression However, it wasn t always better than linear regression when k was too big or too small, the nearest neighbors method performed poorly In setting 1, the bigger k was, the better; in setting 2, there was a Goldilocks value of k (about 25) that proved optimal in balancing the bias-variance tradeoff Patrick Breheny STA 621: Nonparametric Statistics 14/16

Conclusions Thus, Fitting an ordinary linear model is rarely the best we can do On the other hand, nearest-neighbors is rarely stable enough to be ideal, even in modest dimensions, unless our sample size is very large (recall the curse of dimensionality) Patrick Breheny STA 621: Nonparametric Statistics 15/16

Conclusions (cont d) These two methods stand on opposite sides of the methodology spectrum with regard to assumptions and structure The methods we will discuss for the remainder of the course involve bridging the gap between these two methods making linear regression more flexible, adding structure and stability to nearest neighbor ideas, or combining concepts from both As with kernel density estimation, the main theme that emerges is the need to apply methods that bring the right mix of flexibility and stability that is appropriate for the data Patrick Breheny STA 621: Nonparametric Statistics 16/16