CS 688 Pattern Recognition Lecture 4. Linear Models for Classification



Similar documents
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear Models for Classification

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Christfried Webers. Canberra February June 2015

STA 4273H: Statistical Machine Learning

Statistical Machine Learning

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Classification Problems

Lecture 3: Linear methods for classification

Linear Classification. Volker Tresp Summer 2015

CSCI567 Machine Learning (Fall 2014)

Linear Threshold Units

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Lecture 8 February 4

Machine Learning and Pattern Recognition Logistic Regression

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

3F3: Signal and Pattern Processing

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

CSC 411: Lecture 07: Multiclass Classification

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

CS229 Lecture notes. Andrew Ng

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Introduction to Online Learning Theory

Predict Influencers in the Social Network

Logistic Regression (1/24/13)

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Machine Learning Logistic Regression

Classification. Chapter 3

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Efficient Streaming Classification Methods

Classification using Logistic Regression

An Introduction to Machine Learning

Lecture 2: The SVM classifier

Logistic Regression for Spam Filtering

Principled Hybrids of Generative and Discriminative Models

Introduction to Logistic Regression

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Lecture 6: Logistic Regression

Parametric Statistical Modeling

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Lecture 6. Artificial Neural Networks

Classification by Pairwise Coupling

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Designing a learning system

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Revenue Management with Correlated Demand Forecasting

Statistical Machine Learning from Data

Statistical Models in Data Mining

Lecture 9: Introduction to Pattern Analysis

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

STATISTICA Formula Guide: Logistic Regression. Table of Contents

The Probit Link Function in Generalized Linear Models for Data Mining Applications

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

THE GOAL of this work is to learn discriminative components

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

A Log-Robust Optimization Approach to Portfolio Management

Regression III: Advanced Methods

Basics of Statistical Machine Learning

Azure Machine Learning, SQL Data Mining and R

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Weighted Pseudo-Metric Discriminatory Power Improvement Using a Bayesian Logistic Regression Model Based on a Variational Method

Probabilistic Discriminative Kernel Classifiers for Multi-class Problems

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Chapter 4: Artificial Neural Networks

171:290 Model Selection Lecture II: The Akaike Information Criterion

L4: Bayesian Decision Theory

Gaussian Conjugate Prior Cheat Sheet

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Master s Theory Exam Spring 2006

Supervised Learning (Big Data Analytics)

Introduction to General and Generalized Linear Models

Principles of Data Mining by Hand&Mannila&Smyth

Bayes and Naïve Bayes. cs534-machine Learning

The Optimality of Naive Bayes

A Simple Introduction to Support Vector Machines

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

A Game Theoretical Framework for Adversarial Learning

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Programming Exercise 3: Multi-class Classification and Neural Networks

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

L3: Statistical Modeling with Hadoop

A Logistic Regression Approach to Ad Click Prediction

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Penalized Logistic Regression and Classification of Microarray Data

CSI:FLORIDA. Section 4.4: Logistic Regression

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Reject Inference in Credit Scoring. Jie-Men Mok

Simple and efficient online algorithms for real world applications

Transcription:

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1

Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p( Ck ) Ck x = p( x) ( x) p( x C ) p( ) p p = k k C k Two class case: logistic sigmoid function 2

K >2 classes: softmax function Lets assume the class-conditional densities are Gaussian with the same covariance matrix: Two class case first. We can show the following result: 3

4

We have shown: Decision boundary: 5

K >2 classes: We can show the following result: The cancellation of the quadratic terms is due to the assumption of shared covariances. If we allow each class conditional density to have its own covariance matrix, the cancellations no longer occur, and we obtain a quadratic function of x, i.e. a quadratic discriminant. 6

Maximum likelihood solution We have a parametric functional form for the class-conditional densities: We can estimate the parameters and the prior class probabilities using maximum likelihood. Two class case with shared covariance matrix. Training data: 7

Maximum likelihood solution For a data point from class we have and therefore For a data point from class we have and therefore Maximum likelihood solution Assuming observations are drawn independently, we can write the likelihood function as follows: N p( t π,µ 1,µ 2,Σ) = p( t n π,µ 1,µ 2,Σ) = N n=1 [ p( x n,c 1 )] t n p x n,c 2 n=1 N n=1 [ πn ( x n µ 1,Σ)] t n [ ( )] 1 t n = [( 1 π)n ( x n µ 2,Σ)] 1 t n t = ( t 1,,t N ) T 8

Maximum likelihood solution We want to find the values of the parameters that maximize the likelihood function, i.e., fit a model that best describes the observed data. p t π,µ 1,µ 2,Σ N n=1 ( ) = [ πn ( x n µ 1,Σ)] t n [( 1 π)n ( x n µ 2,Σ)] 1 t n As usual, we consider the log of the likelihood: Maximum likelihood solution We first maximize the log likelihood with respect to terms that depend on are. The 9

Maximum likelihood solution Thus, the maximum likelihood estimate of of points in class is the fraction The result can be generalized to the multiclass case: the maximum likelihood estimate of is given by the fraction of points in the training set that belong to Maximum likelihood solution We now maximize the log likelihood with respect to terms that depend on are. The 10

Maximum likelihood solution Thus, the maximum likelihood estimate of is the sample mean of all the input vectors assigned to class By maximizing the log likelihood with respect to obtain a similar result for we Maximum likelihood solution Maximizing the log likelihood with respect to the maximum likelihood estimate we obtain Thus: the maximum likelihood estimate of the covariance is given by the weighted average of the sample covariance matrices associated with each of the classes. This results extend to K classes. 11

Probabilistic Discriminative Models Two-class case: Multiclass case: Discriminative approach: use the functional form of the generalized linear model for the posterior probabilities and determine its parameters directly using maximum likelihood. Probabilistic Discriminative Models Advantages: Fewer parameters to be determined Improved predictive performance, especially when the class-conditional density assumptions give a poor approximation of the true distributions. 12

Probabilistic Discriminative Models Two-class case: In the terminology of statistics, this model is known as logistic regression. Assuming estimate? how many parameters do we need to Probabilistic Discriminative Models How many parameters did we estimate to fit Gaussian class-conditional densities (generative approach)? 13

Logistic Regression We use maximum likelihood to determine the parameters of the logistic regression model. { x n,t n } n =1,,N t n =1 denotes class C 1 t n = 0 denotes class C 2 We want to find the values of w that maximize the posterior probabilities associated to the observed data Likelihood function : L w N ( ) = P( C 1 x n ) t n n=1 ( 1 P( C 1 x n )) 1 t n N ( ) t n = y x n n=1 ( 1 y( x n )) 1 t n Logistic Regression We consider the negative logarithm of the likelihood: 14

Logistic Regression We compute the derivative of the error function with respect to w (gradient): We need to compute the derivative of the logistic sigmoid function: a σ ( a ) = a 1 1+ e = e a a 1+ e a 1 1 1 = σ a 1+ e a 1+ e a ( ) 2 = 1 ( ) ( ) 1 σ( a) e a 1+ e a 1+ e = a Logistic Regression 15

Logistic Regression The gradient of E at w gives the direction of the steepest increase of E at w. We need to minimize E. Thus we need to update w so that we move along the opposite direction of the gradient: This technique is called gradient descent It can be shown that E is a concave function of w. Thus, it has a unique minimum. An efficient iterative technique exists to find the optimal w parameters (Newton-Raphson optimization). Batch vs. on-line learning The computation of the above gradient requires the processing of the entire training set (batch technique) If the data set is large, the above technique can be costly; For real time applications in which data become available as continuous streams, we may want to update the parameters as data points are presented to us (on-line technique). 16

On-line learning After the presentation of each data point n, we compute the contribution of that data point to the gradient (stochastic gradient): The on-line updating rule for the parameters becomes: η > 0 is called learning rate. It's value needs to be chosen carefully to ensure convergence Multiclass Logistic Regression Multiclass case: We use maximum likelihood to determine the parameters of the logistic regression model. { x n,t n } n =1,, N ( ) denotes class C k t n = 0,, 1,,0 We want to find the values of w 1,,w K that maximize the posterior probabilities associated to the observed data Likelihood function : ( ) = P( C k x n ) t nk L w 1,,w K N K = y k x n n=1 k =1 n=1 k =1 N K ( ) t nk 17

Multiclass Logistic Regression We consider the negative logarithm of the likelihood: Multiclass Logistic Regression We compute the gradient of the error function with respect to one of the parameter vectors: 18

Multiclass Logistic Regression Thus, we need to compute the derivatives of the softmax function: y k = a k a k e ak e a j = j e a k j e a j j e a j e a k ea k ( ) 2 = y k y 2 k = y k 1 y k Multiclass Logistic Regression Thus, we need to compute the derivatives of the softmax function: for j k, y k = a j a j e ak j e a j = e a k e a j = y 2 k y j e a j j 19

Multiclass Logistic Regression Compact expression: Multiclass Logistic Regression 20

Multiclass Logistic Regression It can be shown that E is a concave function of w. Thus, it has a unique minimum. For a batch solution, we can use the Newton-Raphson optimization technique. On-line solution (stochastic gradient descent): 21