Local classification and local likelihoods

Local classification and local likelihoods November 18

k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor classification The idea is very simple: given a point x 0, we find the k observations in the data set which are closest in distance to x 0 (these are x 0 s neighbors) The most likely classification at x 0 is then decided by majority vote among the k neighbors

Bias-variance tradeoff for nearest neighbors The number of neighbors, k, is the smoothing parameter A small value for k will produce a classifier with very low bias, but high variance; the situation is reversed for large k There is an important distinction between the bias-variance tradeoff for nearest neighbors and kernel approaches, however: With kernel density classifiers with a constant bandwidth, the bias remains relatively constant across x 0, but the variance changes depending on the local density With k-nearest neighbors, variance remains relatively constant across x 0 because there same number of observations always goes into the classifier, but the bias changes depending on local density, as the neighborhood must expand in low-density regions

Cross-validation for the heart study A simple measure of classification accuracy is the misclassification rate: the proportion of observations misclassified by the majority vote among the nearest neighbors CV misclassification rate 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0 20 40 60 80 100 k

Heart study: Nearest-neighbor classification Comparing nearest-neighbor classification with k = 63 to kernel density classification and logistic regression: Pr(CHD) 0.0 0.2 0.4 0.6 0.8 1.0 Kernel Logistic k NN 100 120 140 160 180 200 220 Systolic Blood Pressure

Improving on nearest-neighbors Despite its simplicity, k nearest neighbors actually performs pretty well in many problems However, just like the local average, it can be improved upon in two ways: Using kernels to allow the weight given to observation i to vary continuously as a function of its distance between x i and x 0 Fitting a local logistic regression model instead of majority voting: logit(y i = 1) = f(x)

The principle is the same as loess, although instead of minimizing the residual sum of squares, we maximize the log likelihood of the logistic regression model, fitting a new pair of regression coefficients at each target point x 0 : (ˆα, ˆβ) = arg max α,β K h (x 0, x i )l(y i, ˆπ i ), where the contribution of observation i to the likelihood is once again weighted by the kernel, and ˆπ i = eα+xβ 1 + e α+xβ l(y i, ˆπ i ) = y i log(ˆπ i ) + (1 y i ) log(1 ˆπ i ) i

Fitting via IRLS Fitting of logistic regression models already proceeds according to an iteratively reweighted least squares (IRLS) algorithm, which easily incorporates the The weight given to an observation i in a given iteration of the IRLS algorithm is then a product of the weight coming from the quadratic approximation to the likelihood and the weight coming from the kernel (w i = w 1i w 2i )

Heart study Pr(CHD) 0.0 0.2 0.4 0.6 0.8 1.0 Kernel Logistic k NN Local logit 100 120 140 160 180 200 220 Systolic Blood Pressure

Cross-validation The deviance (-2 times the log-likelihood) or average deviance is a natural loss function to use in calculating a cross-validation score: CV = 2 n n l(y i, ˆπ ( 1) ) i=1 Unfortunately, the many simplifications and closed forms that we derived for cross-validation involving least squares regression and squared error loss in the previous lecture do not hold anymore This differs from misclassification (0-1) loss in the sense that the loss incurred by an incorrect prediction with ˆπ i =.6 is less than the loss incurred by an incorrect prediction with ˆπ i =.95, resulting in a smoother curve

CV: Misclassification versus deviance Cross validated misclassification rate 0.33 0.35 0.37 CV using deviance 1.28 1.32 1.36 100 200 300 400 500 100 200 300 400 500 k k

Effective degrees of freedom and GCV The identities derived in the past lecture involving degrees of freedom and generalized cross-validation do not exactly hold However, it is common to proceed by analogy with linear regression, defining tr(l) as the effective degrees of freedom and using GCV to select smoothing parameters: GCV = 1 n ( 2) n i=1 l(y i, ˆπ i ) (1 tr(l)/n) 2

CV versus GCV approximation Effective degrees of freedom 28.6 10.7 7.4 5.1 3.6 CV using deviance 1.28 1.30 1.32 1.34 1.36 1.38 CV GCV 100 200 300 400 500 k

R implementation Fitting local logistic regression models in R can be accomplished using the locfit package Syntax is the same, except one can specify family= binomial to obtain logistic regression: fit <- locfit(chd~lp(sbp,nn=.8,deg=2),fam="binomial") Confidence bands can be calculated via scb, with simul controlling whether to calculate simultaneous or pointwise bands: ci.simult <- scb(chd~lp(sbp,nn=.8,deg=2), family="binomial") ci.ptwise <- scb(chd~lp(sbp,nn=.8,deg=2), family="binomial",simul=false)

Pointwise bands ^(x) f 2 1 0 1 2 3 π^(x) 0.0 0.2 0.4 0.6 0.8 1.0 100 120 140 160 180 200 220 Systolic blood pressure 100 120 140 160 180 200 220 Systolic blood pressure

Simultaneous bands ^(x) f 2 1 0 1 2 3 π^(x) 0.0 0.2 0.4 0.6 0.8 1.0 100 120 140 160 180 200 220 Systolic blood pressure 100 120 140 160 180 200 220 Systolic blood pressure

Extensions k-nearest Neighbor classification We have concentrated on local linear regression models for continuous data and local logistic regression models for binary data, but the idea is very general Any independent and identically distributed data that can be modeled with a likelihood can be modeled nonparametrically by fitting a simple parametric model with continuously varying local weights Poisson regression and Gamma regression are also implemented in locfit