Linear Classification. Volker Tresp Summer 2015

Similar documents

Statistical Machine Learning

Linear Threshold Units

STA 4273H: Statistical Machine Learning

Christfried Webers. Canberra February June 2015

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Lecture 3: Linear methods for classification

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Classification Problems

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Machine Learning and Pattern Recognition Logistic Regression

Linear Models for Classification

CSCI567 Machine Learning (Fall 2014)

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Question 2 Naïve Bayes (16 points)

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Statistical Machine Learning from Data

Basics of Statistical Machine Learning

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Lecture 9: Introduction to Pattern Analysis

Lecture 8 February 4

Machine Learning in Spam Filtering

Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application

Detecting Corporate Fraud: An Application of Machine Learning

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Logistic Regression (1/24/13)

LCs for Binary Classification

STATISTICA Formula Guide: Logistic Regression. Table of Contents

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Supervised Learning (Big Data Analytics)

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

1 Maximum likelihood estimation

An Introduction to Data Mining

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Reject Inference in Credit Scoring. Jie-Men Mok

Multivariate Normal Distribution

Lecture 6: Logistic Regression

3F3: Signal and Pattern Processing

Predict Influencers in the Social Network

An Introduction to Machine Learning

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Penalized Logistic Regression and Classification of Microarray Data

Estimating an ARMA Process

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

GLM, insurance pricing & big data: paying attention to convergence issues.

Practice problems for Homework 11 - Point Estimation

Bayesian probability theory

Factorial experimental designs and generalized linear models

CSI:FLORIDA. Section 4.4: Logistic Regression

Introduction to General and Generalized Linear Models

Statistics in Retail Finance. Chapter 2: Statistical models of default

Logit Models for Binary Data

Classification with Hybrid Generative/Discriminative Models

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Cell Phone based Activity Detection using Markov Logic Network

Introduction to Online Learning Theory

Big Data Analytics CSCI 4030

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

D-optimal plans in observational studies

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Note on the EM Algorithm in Linear Regression Model

Markov Chain Monte Carlo Simulation Made Simple

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

VI. Introduction to Logistic Regression

Monte Carlo Simulation

APPLICATIONS OF BAYES THEOREM

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

SAS Software to Fit the Generalized Linear Model

Local classification and local likelihoods

CHAPTER 2 Estimating Probabilities

Nominal and ordinal logistic regression

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Gaussian Conjugate Prior Cheat Sheet

Bayes and Naïve Bayes. cs534-machine Learning

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Latent Class Regression Part II

Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation

A Simple Introduction to Support Vector Machines

Robert Collins CSE598G. More on Mean-shift. R.Collins, CSE, PSU CSE598G Spring 2006

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Java Modules for Time Series Analysis

Chapter 13 Introduction to Nonlinear Regression( 非線性迴歸 )

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Support Vector Machines Explained

Parametric Survival Models

Transcription:

Linear Classification Volker Tresp Summer 2015 1

Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong (dog, cat,...)? 2

Linear Classifiers Linear classifiers separate classes by a linear hyperplane In high dimensions a linear classifier often can separate the classes Linear classifiers cannot solve the exclusive-or problem In combination with basis functions, kernels or a neural network, linear classifiers can form nonlinear class boundaries 3

Binary Classification Problems We will focus first on binary classification where the task is to assign binary class labels y i = 1 and y i = 0 (or y i = 1 and y i = 1 ) We already know the Perceptron. Now we learn about additional approaches I. Generative models for classification II. Logistic regression III. Classification via regression 4

Two Linearly Separable Classes 5

Two Classes that Cannot be Separated Linearly 6

The Classical Example not two Classes that cannot be Separated Linearly: XOR 7

Separability is not a Goal in Itself. With Overlapping Classes the Goal is the Best Possible Hyperplane 8

I. Generative Model for Classification In a generative model one assumes a probabilistic data generating process (likelihood model). Often generative models are complex and contain unobserved (latent, hidden) variables Here we consider a simple example: how data are generated in a binary classification problem First we have a model how classes are generated P (y). y = 1 could stand for a good customer and y = 0 could stand for a bad customer. Then we have a model how attributes are generated, given the classes P (x y). This could stand for Income, age, occupation (x) given a customer is a good customer (y = 1) Income, age, occupation (x) given a customer is not a good customer (y = 0) Using Bayes formula, we then derive P (y x): the probability that a given customer is a good customer y = 1 or bad customer y = 0, given that we know the customer s income, age and occupation 9

How is Data Generated? We assume that the observed classes y i are generated with probability with 0 κ 1 1. P (y i = 1) = κ 1 P (y i = 0) = κ 0 = 1 κ 1 In a next step, a data point x i has been generated from P (x i y i ) (Note, that x i = (x i,1,..., x i,m ) T, which means that x i does not contain the bias x i,0 ) We now have a complete model: P (y i )P (x i y i ) 10

Bayes Theorem To classify a data point x i, i.e. to determine the y i, we apply Bayes theorem and get P (y i x i ) = P (x i y i )P (y i ) P (x i ) P (x i ) = P (x i y i = 1)P (y i = 1) + P (x i y i = 0)P (y i = 0) 11

ML Estimate for Class Probabilities Maximum-likelihood estimator for the prior class probabilities are and ˆP (y i = 1) = ˆκ 1 = N 1 /N ˆP (y i = 0) = ˆκ 0 = N 0 /N = 1 ˆκ 1 where N 1 and N 0 is the number of training data points for class 1, respectively class 0 12

Class-specific Distributions To model P (x i y i ) one can chose an application specific distribution A popular choice is a Gaussian distribution P (x i y i = l) = N (x i ; µ (l), Σ) with ( ) N x i ; µ (l), Σ = ( 1 (2π) M/2 Σ exp 1 2 (x i µ (l)) T Σ 1 ( x i µ (l))) Note, that both Gaussian distributions have different modes (centers) but the same covariance matrices. This has been shown to often work well 13

Maximum-likelihood Estimators for Modes and Covariances One obtains a maximum likelihood estimators for the modes ˆµ (l) = 1 N l i:y i =l One obtains as unbiased estimators for the covariance matrix x i ˆΣ = 1 N M 1 (x i ˆµ (l) )(x i ˆµ (l) ) T l=0 i:y i =l 14

Expanding the Quadratic Terms in the Exponent Note that (x i µ (l)) T Σ 1 ( x i µ (l)) = x T i Σ 1 x i 2µ (l)t Σ 1 x i + µ (l)t Σ 1 µ (l)... and... ( x i µ (0)) T Σ 1 ( x i µ (0)) = 2 (x i µ (1)) T Σ 1 ( x i µ (1)) ( µ (0) µ (l)) T Σ 1 x i + µ (0)T Σ 1 µ (0) µ (1)T Σ 1 µ (1) 15

A Posteriori Distribution It follows that P (y i = 1 x i ) = P (x i y i = 1)P (y i = 1) P (x i y i = 1)P (y i = 1) + P (x i y i = 0)P (y i = 0) = 1 1 + P (x i y i =0)P (y i =0) P (x i y i =1)P (y i =1) = 1 1 + κ 0 κ exp ((µ (0) µ (1) ) T Σ 1 x 1 i + 2 1 µ(0)t Σ 1 µ (0) 1 2 µ(1)t Σ 1 µ (1)) ) = sig (w 0 + x T i w = sig w 0 + M x i,j w j j w = Σ 1 ( µ (1) µ (0)) 16

w 0 = log κ 1 /κ 0 1 2 µ(0)t Σ 1 µ (0) + 1 2 µ(1)t Σ 1 µ (1) Recall: sig(arg) = 1/(1 + exp( arg))

Comments This specific generative model leads to linear class boundaries The solution is analogue to Fisher s linear discriminant analysis, where one projects the data into a space in which data from the same class have small variance and where the distance between class modes are maximized In other words, one gets the same results from an optimization criterion without assuming Gaussian distributions Although we have used Bayes formula, the analysis was frequentist. A Bayesian analysis with a prior distribution on the parameters is also possible If the two class-specific Gaussians have different covariance matrices the approach is still feasible but one would need to estimate two covariance matrices and the decision boundaries are not linear anymore. 17

Special Case: Naive Bayes If one assumes that the covariance matrices are diagonal, one obtains a Naive-Bayes classifier P (x i y i = l) = M j=1 N (x i,j ; µ (l) j, σ 2 j ) The naive Bayes classifier has considerable fewer parameters but completely ignores class-specific correlations between features; this is sometimes considered to be naive Naive Bayes classifiers are popular in text analysis with often more than 10000 features (key words). For example, the classes might be SPAM and no-spam and the features a keywords in the texts. Instead of a Gaussian distribution, a Bernoulli distribution is employed, with P (word j SPAM) = γ j,s and P (word j no-spam) = γ j,n 18

II. Logistic Regression The generative model motivates ) P (y i = 1 x i ) = sig (x T i w (now we include the bias x T i = (x i,0 = i, 1, x i,1,..., x i,m 1 ) T ). sig() as defined before (logistic funktion). One now optimizes the likelihood of the conditional model L(w) = N i=1 sig ) (x T i w yi ( ( )) 1 sig x T i w 1 yi 19

Log-Likelihood Log-likelihood function l = N i=1 y i log ( )) sig (x T i w + (1 y i ) log ( )) 1 sig (x T i w l = ( ) N 1 y i log 1 + exp( x T i w) i=1 ( ) 1 + (1 y i ) log 1 + exp(x T i w) N = y i log(1 + exp( x T i w)) + (1 y i) log(1 + exp(x T i w)) i=1 20

Adaption The derivatives of the log-likelihood with respect to the parameters l N w = i=1 y i x i exp( x T i w) 1 + exp( x T i w) (1 y i) x i exp(x T i w) 1 + exp(x T i w) = N i=1 y i x i (1 sig(x T i w)) (1 y i)x i sig(x T i w) = N (y i sig(x T i w))x i i=1 A gradient-based optimization of the parameters to maximize the log-likelihood w w + η l w Typically one uses a Newton-Raphson optimization procedure 21

Cross-entropy Cost Function In information theory, the cross entropy between a true distribution p and an approximative distribution q is defined as H(p, q) = x p(x = x) log q(x = x) The likelihood cost function is identical to the cross entropy cost function and is written for y i {0, 1} cost = N i=1 y i log(sig(x T i w)) (1 y i) log(1 sig(x T i w))) = N i=1 y i log(1 + exp( x T i w)) + (1 y i) log(1 + exp(x T i w)) = N i=1 log ( )) 1 + exp ((1 2y i )x T i w 22

... and for y i { 1, 1} cost = N i=1 log ( )) 1 + exp ( y i x T i w

Logistic Regression in Medical Statistics Logistic regression has become one of the the most important tools in medicine to analyse the outcome of treatments y = 1 means that the patient has the disease. x 1 = 1 might represent the fact that the patient was exposed (e.g., by a genetic variant) and x 1 = 0 might mean that the patient was not exposed. The other inputs are often typical confounders (age, sex,...) Logistic regression then permits the prediction of the outcome for any patient Of course, of great interest is if w 1 is significantly positive (i.e., the exposure was harmful) 23

Log-Odds The logarithm of the odds is defined as For logistic regression, log(odds(x i )) = log P (y i = 1 x i ) P (y i = 0 x i ) log(odds(x i )) = log P (y i = 1 x i ) P (y i = 0 x i ) = log 1 1 + exp( x T i w) 1 + exp( x T i w) exp( x T i w) = log 1 exp( x T i w) = xt i w 24

Log Odds Ratio The log odds ratio evaluates the effect of the treatment (valid for any patient) w 1 = log(odds(x 1 = 1)) log(odds(x 1 = 0)) = log Odds(x 1 = 1) Odds(x 1 = 0) This is the logarithm of the so called odds ratio (OR). If w 1 0, then the exposure does not have an effect: the odds ratio is commonly used in case-control studies! What we have calculated is the log-odds for a patient with properties x i to get the disease We get a nice interpretation of the parameters in logistic regression: w 0 is the log odds of getting the disease, when the patient does not get the treatment (and all other inputs are zero). This is the only term that changes, when the class mix is changed w 1 is the log odds ratio associated with the treatment. This is independent of the class mix or of other input factors (thus we cannot model interactions) w 2,..., w M 1 models the log odds ratio associated with confounders (age, sex,...) 25

Logistic Regression as a Generalized Linear Models (GLM) Consider a Bernoulli distribution with P (y = 1) = θ and P (y = 1) = 1 θ, with 0 θ 1 In the theory of the exponential family of distributions, one sets θ = sig(η). Now we get valid probabilities for any η R! η is called the natural parameter and sig( ) the inverse parameter mapping for the Bernoulli distribution This is convenient if we make η a linear function of the inputs and one obtains a Generalized Linear Model (GLM) P (y i = 1 x i, w) = sig(x T i w) Thus logistic regression is the GLM for the Bernoulli likelihood model 26

III. Classification via Regression Linear Regression: f(x i, w) = w 0 + M 1 j=1 w j x i,j = x T i w We define as target y i = 1 if the pattern x i belongs to class 1 and y i = 0 if pattern x i belongs to class 0 We calculate weights w LS = (X T X) 1 X T y as LS solution, exactly as in linear regression For a new pattern z we calculate f(z) = z T w LS and assign the pattern to class 1 if f(z) > 1/2; otherwise we assign the pattern to class 0 27

Bias Asymptotically, a LS-solution converges to the posterior class probabilities, although a linear functions is typically not able to represent P (c = 1 x). The resulting class boundary can still be sensible One can expect good class boundaries in high dimensions and/or in combination with basis functions, kernels and neural networks; in high dimensions sometimes consistency can be achieved 28

Classification via Regression with Linear Functions 29

Classification via Regression with Radial Basis Functions 30

Performance Although the approach might seem simplistic, the performance can be excellent (in particular in high dimensions and/or in combination with basis functions, kernels and neural networks). The calculation of the optimal parameters can be very fast! 31