Local classification and local likelihoods



Similar documents
Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Penalized regression: Introduction

Statistical Machine Learning

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Smoothing and Non-Parametric Regression

Data Mining Practical Machine Learning Tools and Techniques

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Section 6: Model Selection, Logistic Regression and more...

Predictive Modeling Techniques in Insurance

Lecture 3: Linear methods for classification

SAS Software to Fit the Generalized Linear Model

Generalized Linear Models

Multivariate Logistic Regression

STA 4273H: Statistical Machine Learning

Machine Learning in Spam Filtering

Data Mining - Evaluation of Classifiers

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

L3: Statistical Modeling with Hadoop

Logistic Regression (1/24/13)

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Decompose Error Rate into components, some of which can be measured on unlabeled data

Examining a Fitted Logistic Model

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Threshold Units

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

11. Analysis of Case-control Studies Logistic Regression

Logistic Regression (a type of Generalized Linear Model)

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Automated Learning and Data Visualization

1 Maximum likelihood estimation

Cross-validation for detecting and preventing overfitting

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Regularized Logistic Regression for Mind Reading with Parallel Validation

Cross Validation. Dr. Thomas Jensen Expedia.com

Lecture 8: Gamma regression

Chapter 6. The stacking ensemble approach

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Regression Modeling Strategies

Supervised Learning (Big Data Analytics)

Quantile Regression under misspecification, with an application to the U.S. wage structure

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

JetBlue Airways Stock Price Analysis and Prediction

Linear Classification. Volker Tresp Summer 2015

GLM, insurance pricing & big data: paying attention to convergence issues.

Classification by Pairwise Coupling

Introduction to Logistic Regression

Penalized Logistic Regression and Classification of Microarray Data

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Statistical Models in R

Regression III: Advanced Methods

Big Data Analytics CSCI 4030

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Poisson Models for Count Data

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Online Appendix The Earnings Returns to Graduating with Honors - Evidence from Law Graduates

CS570 Data Mining Classification: Ensemble Methods

Principles of Data Mining by Hand&Mannila&Smyth

Multiple Linear Regression in Data Mining

CSCI567 Machine Learning (Fall 2014)

The primary goal of this thesis was to understand how the spatial dependence of

Homework Assignment 7

Data Mining Techniques Chapter 6: Decision Trees

Knowledge Discovery and Data Mining

Basic Statistical and Modeling Procedures Using SAS

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Least Squares Estimation

Towards better accuracy for Spam predictions

Logistic regression modeling the probability of success

Java Modules for Time Series Analysis

CSI:FLORIDA. Section 4.4: Logistic Regression

Social Media Mining. Data Mining Essentials

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

5. Multiple regression

Data Mining: An Overview. David Madigan

Evaluation & Validation: Credibility: Evaluating what has been learned

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

240ST014 - Data Analysis of Transport and Logistics

Classification algorithm in Data mining: An Overview

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Lab 13: Logistic Regression

Data Mining. Nonlinear Classification

The CRM for ordinal and multivariate outcomes. Elizabeth Garrett-Mayer, PhD Emily Van Meter

Big Data: a new era for Statistics

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Classification Techniques for Remote Sensing

L13: cross-validation

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Lecture 6: Poisson regression

Everything You Wanted to Know about Moderation (but were afraid to ask) Jeremy F. Dawson University of Sheffield

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Transcription:

Local classification and local likelihoods November 18

k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor classification The idea is very simple: given a point x 0, we find the k observations in the data set which are closest in distance to x 0 (these are x 0 s neighbors) The most likely classification at x 0 is then decided by majority vote among the k neighbors

Bias-variance tradeoff for nearest neighbors The number of neighbors, k, is the smoothing parameter A small value for k will produce a classifier with very low bias, but high variance; the situation is reversed for large k There is an important distinction between the bias-variance tradeoff for nearest neighbors and kernel approaches, however: With kernel density classifiers with a constant bandwidth, the bias remains relatively constant across x 0, but the variance changes depending on the local density With k-nearest neighbors, variance remains relatively constant across x 0 because there same number of observations always goes into the classifier, but the bias changes depending on local density, as the neighborhood must expand in low-density regions

Cross-validation for the heart study A simple measure of classification accuracy is the misclassification rate: the proportion of observations misclassified by the majority vote among the nearest neighbors CV misclassification rate 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0 20 40 60 80 100 k

Heart study: Nearest-neighbor classification Comparing nearest-neighbor classification with k = 63 to kernel density classification and logistic regression: Pr(CHD) 0.0 0.2 0.4 0.6 0.8 1.0 Kernel Logistic k NN 100 120 140 160 180 200 220 Systolic Blood Pressure

Improving on nearest-neighbors Despite its simplicity, k nearest neighbors actually performs pretty well in many problems However, just like the local average, it can be improved upon in two ways: Using kernels to allow the weight given to observation i to vary continuously as a function of its distance between x i and x 0 Fitting a local logistic regression model instead of majority voting: logit(y i = 1) = f(x)

The principle is the same as loess, although instead of minimizing the residual sum of squares, we maximize the log likelihood of the logistic regression model, fitting a new pair of regression coefficients at each target point x 0 : (ˆα, ˆβ) = arg max α,β K h (x 0, x i )l(y i, ˆπ i ), where the contribution of observation i to the likelihood is once again weighted by the kernel, and ˆπ i = eα+xβ 1 + e α+xβ l(y i, ˆπ i ) = y i log(ˆπ i ) + (1 y i ) log(1 ˆπ i ) i

Fitting via IRLS Fitting of logistic regression models already proceeds according to an iteratively reweighted least squares (IRLS) algorithm, which easily incorporates the The weight given to an observation i in a given iteration of the IRLS algorithm is then a product of the weight coming from the quadratic approximation to the likelihood and the weight coming from the kernel (w i = w 1i w 2i )

Heart study Pr(CHD) 0.0 0.2 0.4 0.6 0.8 1.0 Kernel Logistic k NN Local logit 100 120 140 160 180 200 220 Systolic Blood Pressure

Cross-validation The deviance (-2 times the log-likelihood) or average deviance is a natural loss function to use in calculating a cross-validation score: CV = 2 n n l(y i, ˆπ ( 1) ) i=1 Unfortunately, the many simplifications and closed forms that we derived for cross-validation involving least squares regression and squared error loss in the previous lecture do not hold anymore This differs from misclassification (0-1) loss in the sense that the loss incurred by an incorrect prediction with ˆπ i =.6 is less than the loss incurred by an incorrect prediction with ˆπ i =.95, resulting in a smoother curve

CV: Misclassification versus deviance Cross validated misclassification rate 0.33 0.35 0.37 CV using deviance 1.28 1.32 1.36 100 200 300 400 500 100 200 300 400 500 k k

Effective degrees of freedom and GCV The identities derived in the past lecture involving degrees of freedom and generalized cross-validation do not exactly hold However, it is common to proceed by analogy with linear regression, defining tr(l) as the effective degrees of freedom and using GCV to select smoothing parameters: GCV = 1 n ( 2) n i=1 l(y i, ˆπ i ) (1 tr(l)/n) 2

CV versus GCV approximation Effective degrees of freedom 28.6 10.7 7.4 5.1 3.6 CV using deviance 1.28 1.30 1.32 1.34 1.36 1.38 CV GCV 100 200 300 400 500 k

R implementation Fitting local logistic regression models in R can be accomplished using the locfit package Syntax is the same, except one can specify family= binomial to obtain logistic regression: fit <- locfit(chd~lp(sbp,nn=.8,deg=2),fam="binomial") Confidence bands can be calculated via scb, with simul controlling whether to calculate simultaneous or pointwise bands: ci.simult <- scb(chd~lp(sbp,nn=.8,deg=2), family="binomial") ci.ptwise <- scb(chd~lp(sbp,nn=.8,deg=2), family="binomial",simul=false)

Pointwise bands ^(x) f 2 1 0 1 2 3 π^(x) 0.0 0.2 0.4 0.6 0.8 1.0 100 120 140 160 180 200 220 Systolic blood pressure 100 120 140 160 180 200 220 Systolic blood pressure

Simultaneous bands ^(x) f 2 1 0 1 2 3 π^(x) 0.0 0.2 0.4 0.6 0.8 1.0 100 120 140 160 180 200 220 Systolic blood pressure 100 120 140 160 180 200 220 Systolic blood pressure

Extensions k-nearest Neighbor classification We have concentrated on local linear regression models for continuous data and local logistic regression models for binary data, but the idea is very general Any independent and identically distributed data that can be modeled with a likelihood can be modeled nonparametrically by fitting a simple parametric model with continuously varying local weights Poisson regression and Gamma regression are also implemented in locfit