Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur



Similar documents
CSCI567 Machine Learning (Fall 2014)

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Statistical Machine Learning

Linear Threshold Units

STA 4273H: Statistical Machine Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Lecture 3: Linear methods for classification

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear Classification. Volker Tresp Summer 2015

Christfried Webers. Canberra February June 2015

Lecture 2: The SVM classifier

Lecture 6: Logistic Regression

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Machine Learning and Pattern Recognition Logistic Regression

Bayes and Naïve Bayes. cs534-machine Learning

3F3: Signal and Pattern Processing

Linear Models for Classification

Machine Learning Logistic Regression

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Lecture 8 February 4

Classification Problems

CSC 411: Lecture 07: Multiclass Classification

Logistic Regression for Data Mining and High-Dimensional Classification

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Introduction to Online Learning Theory

Introduction to Logistic Regression

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

An Introduction to Machine Learning

Logistic Regression (1/24/13)

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Predict Influencers in the Social Network

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

1 Maximum likelihood estimation

STATISTICA Formula Guide: Logistic Regression. Table of Contents

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

CSI:FLORIDA. Section 4.4: Logistic Regression

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

CS229 Lecture notes. Andrew Ng

CSE 473: Artificial Intelligence Autumn 2010

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Logistic Regression for Spam Filtering

Statistical Machine Learning from Data

GLM, insurance pricing & big data: paying attention to convergence issues.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Basics of Statistical Machine Learning

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Penalized Logistic Regression and Classification of Microarray Data

Big Data Analytics CSCI 4030

Classification. Chapter 3

Supervised Learning (Big Data Analytics)

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Content-Based Recommendation

Programming Exercise 3: Multi-class Classification and Neural Networks

Designing a learning system

Machine Learning: Multi Layer Perceptrons

Classification using Logistic Regression

Lecture 9: Introduction to Pattern Analysis

Big Data Analytics: Optimization and Randomization

Course: Model, Learning, and Inference: Lecture 5

Lecture 6. Artificial Neural Networks

(Quasi-)Newton methods

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

L3: Statistical Modeling with Hadoop

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Chapter 4: Artificial Neural Networks

Reject Inference in Credit Scoring. Jie-Men Mok

LCs for Binary Classification

A Simple Introduction to Support Vector Machines

Support Vector Machine (SVM)

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Question 2 Naïve Bayes (16 points)

Poisson Models for Count Data

Semi-Supervised Support Vector Machines and Application to Spam Filtering

CHAPTER 2 Estimating Probabilities

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

The equivalence of logistic regression and maximum entropy models

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Multiclass Bounded Logistic Regression Efficient Regularization with Interior Point Method

Support Vector Machines Explained

Transcription:

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 1

Probabilistic Classification Given: N labeled training examples {x n, y n } N, x n R D, y n {0, 1} X : N D feature matrix, y : N 1 label vector y n = 1: positive example, y n = 0: negative example Goal: Learn a classifier that predicts the binary label y for a new input x Want a probabilistic model to be able to also predict the label probabilities p(y n = 1 x n, w) = µ n p(y n = 0 x n, w) = 1 µ n µ n (0, 1) is the probability of y n being 1 Note: Features x n assumed fixed (given). Only labels y n being modeled w is the model parameter (to be learned) How do we define µ n (want it to be a function of w and input x n)? Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 2

Logistic Regression Logistic regression defines µ using the sigmoid function µ = σ(w 1 x) = 1 + exp( w x) = exp(w x) 1 + exp(w x) Sigmoid computes a real-valued score (w x) for input x and squashes it between (0,1) to turn this score into a probability (of x s label being 1) Thus we have p(y = 1 x, w) = µ = σ(w 1 x) = 1 + exp( w x) = exp(w x) 1 + exp(w x) p(y = 0 x, w) = 1 µ = 1 σ(w x) = 1 1 + exp(w x) Note: If we assume y { 1, +1} instead of y {0, 1} then p(y x, w) = 1 1+exp( y w x) Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 3

Logistic Regression: A Closer Look.. What s the underlying decision rule in Logistic Regression? At the decision boundary, both classes are equiprobable. Thus: p(y = 1 x, w) = p(y = 0 x, w) exp(w x) 1 + exp(w x) = exp(w x) = 1 w x = 0 1 1 + exp(w x) Thus the decision boundary of LR is nothing but a linear hyperplane, just like Perceptron, Support Vector Machine (SVM), etc. Therefore y = 1 if w x 0, otherwise y = 0 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 4

Interpreting the probabilities.. Recall that p(y = 1 x, w) = µ = 1 1 + exp( w x) Note that the score w x is also a measure of distance of x from the hyperplane (score is positive for pos. examples, negative for neg. examples) High positive score w x: High probability of label 1 High negative score w x: Low prob. of label 1 (high prob. of label 0) Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 5

Logistic Regression: Parameter Estimation Recall, each label y n is binary with prob. µ n. Assume Bernoulli likelihood: where µ n = N N p(y X, w) = p(y n x n, w) = µ yn n (1 µn)1 yn exp(w xn) 1+exp(w xn) Negative log-likelihood Plugging in µ n = NLL(w) = log p(y X, w) = N (y n log µ n + (1 y n) log(1 µ n)) exp(w xn) and chugging, we get (verify yourself) 1+exp(w xn) N NLL(w) = (y nw x n log(1 + exp(w x n))) To do MLE for w, we ll minimize negative log-likelihood NLL(w) w.r.t. w Important note: NLL(w) is convex in w, so global minima can be found Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 6

MLE Estimation for Logistic Regression We have NLL(w) = N (ynw x n log(1 + exp(w x n))) Taking the derivative of NLL(w) w.r.t. w NLL(w) w N = [ (y nw x n log(1 + exp(w x n)))] w ( ) N = y nx n exp(w x n) (1 + exp(w x n)) x n Can t get a closed form estimate for w by setting the derivative to zero One solution: Iterative minimization via gradient descent. Gradient is: g = NLL(w) N = (y n µ n)x n = X (µ y) w Intuitively, a large error on x n (y n µ n ) will be large large contribution (positive/negative) of x n to the gradient Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 7

MLE Estimation via Gradient Descent Gradient descent (GD) or steepest descent w t+1 = w t η t g t where η t is the learning rate (or step size), and g t is gradient at step t GD can converge slowly and is also sensitive to the step size Several ways to remedy this 1. E.g., Choose the optimal step size η t by line-search Add a momentum term to the updates w t+1 = w t η tg t + α t(w t w t 1) Use methods such as conjugate gradient Use second-order methods (e.g., Newton s method) to exploit the curvature of the objective function NLL(w): Require the Hessian matrix 1 Also see: A comparison of numerical optimizers for logistic regression by Tom Minka Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 8

MLE Estimation via Newton s Method Update via Newton s method: w t+1 = w t H 1 t g t where H t is the Hessian matrix at step t Hessian: double derivative of the objective function (NLL(w) in this case) H = 2 NLL(w) w w Recall that the gradient is: g = N Thus H = g w = w Using the fact that µn w = g w (yn µn)x n = X (µ y) N (yn µn)x n = N µn x w n ( ) = exp(w x n) w 1+exp(w x n) = µ n (1 µ n )x n, we have N H = µ n(1 µ n)x nx n = X SX where S is a diagonal matrix with its n th diagonal element = µ n (1 µ n ) Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 9

MLE Estimation via Newton s Method Update via Newton s method: w t+1 = w t H 1 t g t = w t (X S tx) 1 X (µ t y) = w t + (X S tx) 1 X (y µ t ) = (X S tx) 1 [(X S tx)w t + X (y µ t )] = (X S tx) 1 X [S txw t + y µ t ] = (X S tx) 1 X S t[xw t + S 1 (y µ t )] = (X S tx) 1 X S t ŷ t Interpreting the solution found by Newton s method: It basically solves an Iteratively Reweighted Least Squares (IRLS) problem N arg min S w tn(ŷ tn w x n) 2 Note that the (redefined) response vector ŷ t changes in each iteration Each term in the objective has weight S tn (changes in each iteration) The weight S tn is the n th diagonal element of S t Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 10

MAP Estimation for Logisic Regression MLE estimate of w can lead to overfitting. Solution: use a prior on w Just like the linear regression case, let s put a Gausian prior on w p(w) = N (0, λ 1 I D ) exp( λ 2 w w) MAP objective: MLE objective + log p(w) Leads to the objective (negative of log posterior, ignoring constants): NLL(w) + λ 2 w w Estimation of w proceeds the same way as MLE excepet that now we have Gradient: g = X (µ y) + λw Hessian: H = X SX + λi D Can now apply iterative optimization (gradient des., Newton s method, etc.) Note: MAP estimation for log. reg. is equivalent to regularized log. reg. Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 11

Fully Bayesian Estimation for Logistic Regression What about the full posterior on w? Not as easy to estimate as in the linear regression case! Reason: likelihood (logistic-bernoulli) and prior (Gaussian) not conjugate Need to approximate the posterior in this case A crude approximation: Laplace approximation: Approximate a posterior by a Gaussian with mean = MAP estimate and covariance = inverse hessian p(w X, y) = N (w MAP, H 1 ) Will see other ways of approximating the posterior later during the semester Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 12

Derivation of the Laplace Approximation The posterior p(w X, y) = p(y X,w)p(w). Let s approximate it as p(y X) p(w X, y) = exp( E(w)) Z where E(w) = log p(y X, w)p(w) and Z is the normalizer Expand E(w) around its minima (w = w MAP) using 2 nd order Taylor exp. Thus the posterior E(w) E(w ) + (w w ) g + 1 2 (w w ) H(w w ) = E(w ) + 1 2 (w w ) H(w w ) (because g = 0 at w )) p(w X, y) exp( E(w )) exp( 1 2 (w w ) H(w w ))) Z Using w p(w X, y)d w = 1, we get Z = exp( E(w ))(2π)D/2 H 1/2. Thus p(w X, y) = N (w, H 1 ) Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 13

Multinomial Logistic Regression Logistic reg. can be extended to handle K > 2 classes) In this case, y n {0, 1, 2,..., K 1} and label probabilities are defined as p(y n = k x n, W) = exp(w k x n) K l=1 exp(w l x n) = µ nk µ nk : probability that example n belongs to class k. Also, K l=1 µ nl = 1 W = [w 1 w 2... w K ] is D K weight matrix (column k for class k) Likelihood for the multinomial (or multinoulli) logistic regression model p(y X, W) = N K µ y nl nl l=1 where y nl = 1 if true class of example n is l and y nl = 0 for all other l l Can do MLE/MAP/fully Bayesian estimation for W similar to the binary case Decision rule: y = arg max l=1,...,k w l x, i.e., predict the class whose weight vector gives the largest score (or, equivalently, the largest probability) Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 14

Next class: Generalized Linear Models Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification: Logistic Regression 15