Local classification and local likelihoods

Size: px

Start display at page:

Download "Local classification and local likelihoods"

Britton Brooks
8 years ago
Views:

1 Local classification and local likelihoods November 18

2 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor classification The idea is very simple: given a point x 0, we find the k observations in the data set which are closest in distance to x 0 (these are x 0 s neighbors) The most likely classification at x 0 is then decided by majority vote among the k neighbors

point x 0, we find the k observations in the data set which are closest in distance to x 0 (these are

3 Bias-variance tradeoff for nearest neighbors The number of neighbors, k, is the smoothing parameter A small value for k will produce a classifier with very low bias, but high variance; the situation is reversed for large k There is an important distinction between the bias-variance tradeoff for nearest neighbors and kernel approaches, however: With kernel density classifiers with a constant bandwidth, the bias remains relatively constant across x 0, but the variance changes depending on the local density With k-nearest neighbors, variance remains relatively constant across x 0 because there same number of observations always goes into the classifier, but the bias changes depending on local density, as the neighborhood must expand in low-density regions

with a constant bandwidth, the bias remains relatively constant across x 0, but the variance changes depending on the local density With k-nearest neighbors, variance remains relatively

4 Cross-validation for the heart study A simple measure of classification accuracy is the misclassification rate: the proportion of observations misclassified by the majority vote among the nearest neighbors CV misclassification rate k

misclassified by the majority vote among the nearest neighbors CV

5 Heart study: Nearest-neighbor classification Comparing nearest-neighbor classification with k = 63 to kernel density classification and logistic regression: Pr(CHD) Kernel Logistic k NN Systolic Blood Pressure

classification and logistic regression: Pr(CHD) 0.0 0.2 0.4 0.6 0.

6 Improving on nearest-neighbors Despite its simplicity, k nearest neighbors actually performs pretty well in many problems However, just like the local average, it can be improved upon in two ways: Using kernels to allow the weight given to observation i to vary continuously as a function of its distance between x i and x 0 Fitting a local logistic regression model instead of majority voting: logit(y i = 1) = f(x)

kernels to allow the weight given to observation i to vary continuously as a function of its distance

7 The principle is the same as loess, although instead of minimizing the residual sum of squares, we maximize the log likelihood of the logistic regression model, fitting a new pair of regression coefficients at each target point x 0 : (ˆα, ˆβ) = arg max α,β K h (x 0, x i )l(y i, ˆπ i ), where the contribution of observation i to the likelihood is once again weighted by the kernel, and ˆπ i = eα+xβ 1 + e α+xβ l(y i, ˆπ i ) = y i log(ˆπ i ) + (1 y i ) log(1 ˆπ i ) i

: (ˆα, ˆβ) = arg max α,β K h (x 0, x i )l(y i, ˆπ i ), where the contribution of observation i to the likelihood is

8 Fitting via IRLS Fitting of logistic regression models already proceeds according to an iteratively reweighted least squares (IRLS) algorithm, which easily incorporates the The weight given to an observation i in a given iteration of the IRLS algorithm is then a product of the weight coming from the quadratic approximation to the likelihood and the weight coming from the kernel (w i = w 1i w 2i )

observation i in a given iteration of the IRLS algorithm is then a product of the weight coming

9 Heart study Pr(CHD) Kernel Logistic k NN Local logit Systolic Blood Pressure

10 Cross-validation The deviance (-2 times the log-likelihood) or average deviance is a natural loss function to use in calculating a cross-validation score: CV = 2 n n l(y i, ˆπ ( 1) ) i=1 Unfortunately, the many simplifications and closed forms that we derived for cross-validation involving least squares regression and squared error loss in the previous lecture do not hold anymore This differs from misclassification (0-1) loss in the sense that the loss incurred by an incorrect prediction with ˆπ i =.6 is less than the loss incurred by an incorrect prediction with ˆπ i =.95, resulting in a smoother curve

squares regression and squared error loss in the previous lecture do not hold anymore This differs from misclassification (0-1) loss in the sense that

11 CV: Misclassification versus deviance Cross validated misclassification rate CV using deviance k k

12 Effective degrees of freedom and GCV The identities derived in the past lecture involving degrees of freedom and generalized cross-validation do not exactly hold However, it is common to proceed by analogy with linear regression, defining tr(l) as the effective degrees of freedom and using GCV to select smoothing parameters: GCV = 1 n ( 2) n i=1 l(y i, ˆπ i ) (1 tr(l)/n) 2

to proceed by analogy with linear regression, defining tr(l) as the effective degrees of

13 CV versus GCV approximation Effective degrees of freedom CV using deviance CV GCV k

14 R implementation Fitting local logistic regression models in R can be accomplished using the locfit package Syntax is the same, except one can specify family= binomial to obtain logistic regression: fit <- locfit(chd~lp(sbp,nn=.8,deg=2),fam="binomial") Confidence bands can be calculated via scb, with simul controlling whether to calculate simultaneous or pointwise bands: ci.simult <- scb(chd~lp(sbp,nn=.8,deg=2), family="binomial") ci.ptwise <- scb(chd~lp(sbp,nn=.8,deg=2), family="binomial",simul=false)

8,deg=2),fam="binomial") Confidence bands can be calculated via scb, with simul controlling whether to calculate simultaneous

15 Pointwise bands ^(x) f π^(x) Systolic blood pressure Systolic blood pressure

16 Simultaneous bands ^(x) f π^(x) Systolic blood pressure Systolic blood pressure

17 Extensions k-nearest Neighbor classification We have concentrated on local linear regression models for continuous data and local logistic regression models for binary data, but the idea is very general Any independent and identically distributed data that can be modeled with a likelihood can be modeled nonparametrically by fitting a simple parametric model with continuously varying local weights Poisson regression and Gamma regression are also implemented in locfit

identically distributed data that can be modeled with a likelihood can be modeled nonparametrically by fitting a

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,