Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.



Similar documents
CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Linear Threshold Units

Statistical Machine Learning

Lecture 3: Linear methods for classification

STA 4273H: Statistical Machine Learning

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Lecture 6: Logistic Regression

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Machine Learning and Pattern Recognition Logistic Regression

Linear Classification. Volker Tresp Summer 2015

Linear Models for Classification

CSCI567 Machine Learning (Fall 2014)

Classification using Logistic Regression

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Logistic Regression (1/24/13)

Logistic Regression for Spam Filtering

Detecting Corporate Fraud: An Application of Machine Learning

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Machine Learning Logistic Regression

1 Maximum likelihood estimation

Predict Influencers in the Social Network

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Classification Problems

Christfried Webers. Canberra February June 2015

Supervised Learning (Big Data Analytics)

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Lecture 9: Introduction to Pattern Analysis

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Question 2 Naïve Bayes (16 points)

Introduction to Logistic Regression

Basics of Statistical Machine Learning

Machine Learning Big Data using Map Reduce

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LCs for Binary Classification

Content-Based Recommendation

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Introduction to General and Generalized Linear Models

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev

Efficient Streaming Classification Methods

(Quasi-)Newton methods

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Maximum likelihood estimation of mean reverting processes

SAS Software to Fit the Generalized Linear Model

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers

CS229 Lecture notes. Andrew Ng

Robert Collins CSE598G. More on Mean-shift. R.Collins, CSE, PSU CSE598G Spring 2006

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Making Sense of the Mayhem: Machine Learning and March Madness

Lecture 8 February 4

Statistics in Retail Finance. Chapter 2: Statistical models of default

STATISTICA Formula Guide: Logistic Regression. Table of Contents

The Optimality of Naive Bayes

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Bayes and Naïve Bayes. cs534-machine Learning

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Introduction to Online Learning Theory

: Introduction to Machine Learning Dr. Rita Osadchy

3F3: Signal and Pattern Processing

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

An Introduction to Machine Learning

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Adaptive Online Gradient Descent

Feature Selection for High-Dimensional Genomic Microarray Data

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

CSE 473: Artificial Intelligence Autumn 2010

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Penalized regression: Introduction

CHAPTER 2 Estimating Probabilities

Decompose Error Rate into components, some of which can be measured on unlabeled data

Reject Inference in Credit Scoring. Jie-Men Mok

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

ECO 317 Economics of Uncertainty Fall Term 2009 Week 5 Precepts October 21 Insurance, Portfolio Choice - Questions

Applications to Data Smoothing and Image Processing I

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Cross-validation for detecting and preventing overfitting

Penalized Logistic Regression and Classification of Microarray Data

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

MACHINE LEARNING IN HIGH ENERGY PHYSICS

The Artificial Prediction Market

Data Mining Techniques Chapter 6: Decision Trees

Multivariate Normal Distribution

Lecture 6. Artificial Neural Networks

Local classification and local likelihoods

L3: Statistical Modeling with Hadoop

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

CSC 411: Lecture 07: Multiclass Classification

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Transcription:

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features Y target classes Generative classifier, e.g., Naïve Bayes: Assume some functional form for P(X Y), P(Y) Estimate parameters of P(X Y), P(Y) directly from training data Use Bayes rule to calculate P(Y X= x) This is a generative model Indirect computation of P(Y X) through Bayes rule As a result, can also generate a sample of the data, P(X) = y P(y) P(X y) Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y X) Estimate parameters of P(Y X) directly from training data This is the discriminative model Directly learn P(Y X) But cannot obtain a sample of the data, because P(X) is not available P(Y X) P(X Y) P(Y) 2

Optimization Linear Regression! (Point-wise) squared error Problem: Minimize error 3

Question: How to solve the optimization problem Set the derivative of J to zero and solve 4

Question: How to solve the optimization problem Homework: Prove this! (Messy; algebraic manipulation) 5

Multivariate Linear Regression Assuming a dummy attribute x 0 =1 for all examples Inner product or dot product (yields a number) X is a m-by-n matrix 6

Overfitting MLE estimate: Some weights are large because of chance (coincidental regularities) Regularize!! Penalize high weights (complex hypothesis) Minimize cost: Loss + Complexity p=1: L1 regularization (Lasso) p=2: L2 regularization 7

Regularization L1 L2 8

Gradient Descent Closed form solution is not always possible. In that case, we can use the following iterative approach. Algorithm Gradient Descent Learning rate

Gradient Descent: Example 10

Gradient Descent: 1-D Remember: Derivative is the slope of the line that is tangent to the function Question: What if the learning rate is small? (Slow convergence) Question: What if the learning rate is large? (Fail to converge; even diverge) Rule: 11

Back to Classification P(edible X)=1 Decision Boundary P(edible X)=0 12

Logistic Regression Learn P(Y X) directly! Assume a particular functional form Not differentiable P(Y)=0 P(Y)=1 13

Logistic Regression Learn P(Y X) directly! Assume a particular functional form Logistic Function Aka Sigmoid 14

Functional Form: Two classes implies Classification Rule: Assign the label Y=0 if Take logs and simplify: linear classification rule! Y=0 if the RHS>0 17

How to learn the weights? Evaluation function: Maximize the conditional log-likelihood indexes the examples Weight vector Note that actually we are just computing P(Y X) W is included in P(Y X) just to show that the probability is computed using W 18

How to Learn the weights? 19

How to Learn the weights? Why? If the domain of a variable Y is {0,1} Then any function f(y) can be written as: f(y) = Yf(Y=1)+(1-Y)f(Y=0) 20

How to learn the weights? Remember Log of this = -ln(denominator) 21

How to Learn the weights? Bad news: no closed-form solution to maximize l(w) Good news: l(w) is concave function of W! No local minima Concave functions easy to optimize using Gradient Ascent Update w i as follows: w i =w i +(learning rate)*(partial derivate of l(w) w.r.t. w i ) Notice that unlike gradient descent, in gradient ascent we are interested in the maximim value and therefore we have a + sign on the RHS of the update rule instead of - sign. Carlos Guestrin 2005-2009 22

How to learn the weights? The term inside the parenthesis is the prediction error (difference between the observed value and the predicted probability) Learning rate 23

Large parameters a=1 a=5 a=10 Maximum likelihood solution: prefers higher weights higher likelihood of (properly classified) examples close to decision boundary larger influence of corresponding features on decision can cause overfitting!!! Regularization: penalize high weights

That s all MCLE. How about MCAP? One common approach is to define priors on W Normal distribution, zero mean, identity covariance Pushes parameters towards zero Regularization Helps avoid very large weights and overfitting MAP estimate:

MCAP as Regularization Weight update rule: Quadratic penalty: drives weights towards zero Adds a negative linear term to the gradients Penalizes high weights, like we did in linear regression

Naïve Bayes Learning: h:x Y vs. Logistic Regression X features Y target classes Generative Assume functional form for P(X Y) assume cond indep P(Y) Est params from train data Gaussian NB for cont features Bayes rule to calc. P(Y X= x) P(Y X) P(X Y) P(Y) Indirect computation Can also generate a sample of the data 28 Discriminative Assume functional form for P(Y X) no assumptions Est params from training data Handles discrete & cont features Directly calculate P(Y X=x) Can t generate data sample

Remember Gaussian Naïve Bayes? Sometimes Assume Variance is independent of Y (i.e., i ), or independent of X i (i.e., k ) or both (i.e., ) P(X i =x Y=y k ) = N( ik, ik ) P(Y X) P(X Y) P(Y) N( ik, ik ) =

Gaussian Naïve Bayes vs. Logistic Regression Learning: h:x Y X Real-valued features Y target classes Generative What can we say about the Assume functional form for form of P(Y=1 X i )? P(X Y) assume X i cond indep given Y P(Y) Est params from train data Gaussian NB for continuous features model P(X i Y = y k ) as Gaussian N( ik, i ) model P(Y) as Bernoulli (π,1-π) Bayes rule to calc. P(Y X= x) P(Y X) P(X Y) P(Y) 30 Cool!!!!

Derive form for P(Y X) for continuous X i up to now, all arithmetic only for Naïve Bayes models Looks like a setting for w 0? Can we solve for w i? Yes, but only in Gaussian case

Ratio of class-conditional probabilities Linear function! Coefficients expressed with original Gaussian parameters!

Derive form for P(Y X) for continuous X i Just like Logistic Regression!!!

Gaussian Naïve Bayes vs. Logistic Regression Set of Gaussian Naïve Bayes parameters (feature variance independent of class label) Can go both ways, we only did one way Set of Logistic Regression parameters Representation equivalence But only in a special case!!! (GNB with class-independent variances) But what s the difference??? LR makes no assumptions about P(X Y) in learning!!! ---- Optimize different functions! Obtain different solutions

Naïve Bayes vs. Logistic Regression Generative vs. Discriminative classifiers Asymptotic comparison (# training examples infinity) when model correct GNB (with class independent variances) and LR produce identical classifiers when model incorrect LR is less biased does not assume conditional independence therefore LR expected to outperform GNB [Ng & Jordan, 2002]

Naïve Bayes vs. Logistic Regression [Ng & Jordan, 2002] Generative vs. Discriminative classifiers Non-asymptotic analysis convergence rate of parameter estimates, (n = # of attributes in X) Size of training data to get close to infinite data solution Naïve Bayes needs O(log n) samples Logistic Regression needs O(n) samples GNB converges more quickly to its (perhaps less helpful) asymptotic estimates

Naïve bayes Logistic Regression Some experiments from UCI data sets Carlos Guestrin 2005-2009 37

What you should know about Logistic Regression (LR) Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR Solution differs because of objective (loss) function In general, NB and LR make different assumptions NB: Features independent given class! assumption on P(X Y) LR: Functional form of P(Y X), no assumption on P(X Y) LR is a linear classifier decision rule is a hyperplane LR optimized by conditional likelihood no closed-form solution concave! global optimum with gradient ascent Maximum conditional a posteriori corresponds to regularization Convergence rates GNB (usually) needs less data LR (usually) gets to better solutions in the limit