CSCI567 Machine Learning (Fall 2014)



Similar documents
Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

STA 4273H: Statistical Machine Learning

Christfried Webers. Canberra February June 2015

Lecture 3: Linear methods for classification

CSC 411: Lecture 07: Multiclass Classification

Lecture 8 February 4

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Statistical Machine Learning

Linear Classification. Volker Tresp Summer 2015

Linear Threshold Units

Machine Learning and Pattern Recognition Logistic Regression

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Predict Influencers in the Social Network

Linear Models for Classification

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Lecture 2: The SVM classifier

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

A Logistic Regression Approach to Ad Click Prediction

Introduction to Logistic Regression

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

(Quasi-)Newton methods

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Machine Learning in Spam Filtering

Classification using Logistic Regression

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Classification Problems

Inner Product Spaces

Lecture 6: Logistic Regression

Question 2 Naïve Bayes (16 points)

CS229 Lecture notes. Andrew Ng

Supervised Learning (Big Data Analytics)

Feature Engineering in Machine Learning

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

1 Maximum likelihood estimation

Dot product and vector projections (Sect. 12.3) There are two main ways to introduce the dot product

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

A Simple Introduction to Support Vector Machines

Programming Exercise 3: Multi-class Classification and Neural Networks

An Introduction to Machine Learning

Cross product and determinants (Sect. 12.4) Two main ways to introduce the cross product

Making Sense of the Mayhem: Machine Learning and March Madness

3F3: Signal and Pattern Processing

2.3 Convex Constrained Optimization Problems

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Simple and efficient online algorithms for real world applications

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Big Data Analytics. Lucas Rego Drumond

L3: Statistical Modeling with Hadoop

Lecture notes: single-agent dynamics 1

Learning from Data: Naive Bayes

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Classification by Pairwise Coupling

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

The equivalence of logistic regression and maximum entropy models

Machine Learning Logistic Regression

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

CSE 473: Artificial Intelligence Autumn 2010

Introduction to Online Learning Theory

Big Data Analytics CSCI 4030

GI01/M055 Supervised Learning Proximal Methods

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

: Introduction to Machine Learning Dr. Rita Osadchy

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Lecture 9: Introduction to Pattern Analysis

Bayes and Naïve Bayes. cs534-machine Learning

Machine Learning: Multi Layer Perceptrons

Mechanics 1: Conservation of Energy and Momentum

Big Data - Lecture 1 Optimization reminders

Logistic Regression for Spam Filtering

5.3 The Cross Product in R 3

Multi-Class and Structured Classification

Basics of Statistical Machine Learning

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Elementary Gradient-Based Parameter Estimation

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Local classification and local likelihoods

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Tagging with Hidden Markov Models

Support Vector Machines

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Lecture 6. Artificial Neural Networks

Multiclass Classification Class 06, 25 Feb 2008 Ryan Rifkin

Designing a learning system

Statistical Machine Translation: IBM Models 1 and 2

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Chapter 4: Artificial Neural Networks

Transcription:

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 / 31

Outline Administration 1 Administration 2 Logistic Regression - continued 3 Multiclass classification Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 2 / 31

Administration A few announcements Homework 1: due 9/24 (see the homework sheets for detailed submission information) Revised lecture slides are on Blackboard and the course website Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 3 / 31

Outline Logistic Regression - continued 1 Administration 2 Logistic Regression - continued Logistic regression Numerical optimization Gradient descent Gradient descent for logistic regression Newton method 3 Multiclass classification Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 4 / 31

Logistic Regression - continued Logistic regression Logistic classification Setup for two classes Input: x R D Output: y {0, 1} Training data: D = {(x n, y n ), n = 1, 2,..., N} Model of conditional distribution p(y = 1 x; b, w) = σ[g(x)] where g(x) = b + d w d x d = b + w T x Linear decision boundary g(x) = b + w T x = 0 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 5 / 31

Logistic Regression - continued Logistic regression Maximum likelihood estimation Cross-entropy error (negative log-likelihood) E(b, w) = n {y n log σ(b + w T x n ) + (1 y n ) log[1 σ(b + w T x n )]} Numerical optimization Gradient descent: simple, scalable to large-scale problems Newton method: fast but not scalable Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 6 / 31

Logistic Regression - continued Logistic regression Shorthand notation This is for convenience Append 1 to x x [1 x 1 x 2 x D ] Append b to w w [b w 1 w 2 w D ] Cross-entropy is then E(w) = n {y n log σ(w T x n ) + (1 y n ) log[1 σ(w T x n )]} NB. We are not using the x and w (as in several textbooks) for cosmetic reasons. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 7 / 31

Logistic Regression - continued Numerical optimization How to find the optimal parameters for logistic regression? We will minimize the error function E(w) = n {y n log σ(w T x n ) + (1 y n ) log[1 σ(w T x n )]} However, this function is complex and we cannot find the simple solution as we did in Naive Bayes. So we need to use numerical methods. Numerical methods are messier, in contrast to cleaner analytic solutions. In practice, we often have to tune a few optimization parameters patience is necessary. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 8 / 31

Logistic Regression - continued Numerical optimization An overview of numerical methods We describe two Gradient descent (our focus in lecture): simple, especially effective for large-scale problems Newton method: classical and powerful method Gradient descent is often referred to as an first-order method as it requires only to compute the gradients (i.e., the first-order derivative) of the function. In contrast, Newton method is often referred as to an second-order method. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 9 / 31

Logistic Regression - continued Gradient descent Example: min f(θ) = 0.5(θ 2 1 θ 2 ) 2 + 0.5(θ 1 1) 2 We compute the gradients f = 2(θ1 2 θ 2 )θ 1 + θ 1 1 θ 1 (1) f = (θ1 2 θ 2 ) θ 2 (2) Use the following iterative procedure for gradient descent 1 Initialize θ (0) 1 and θ (0) 2, and t = 0 2 do [ θ (t+1) 1 θ (t) 1 η 2(θ (t) 2 ] (t) 1 θ 2 )θ(t) 1 + θ (t) 1 1 [ θ (t+1) 2 θ (t) 2 η (θ (t) 2 ] (t) 1 θ 2 ) 3 until f(θ (t) ) does not change much (3) (4) t t + 1 (5) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 10 / 31

Logistic Regression - continued Gradient descent Gradient descent General form for minimizing f(θ) Remarks θ t+1 θ η f θ η is often called step size literally, how far our update will go along the the direction of the negative gradient Note that this is for minimizing a function, hence the subtraction ( η) With a suitable choice of η, the iterative procedure converges to a stationary point where f θ = 0 A stationary point is only necessary for being the minimum. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 11 / 31

Seeing in action Logistic Regression - continued Gradient descent Choose the right η is important small η is too slow? 3 2.5 2 1.5 1 0.5 0 0.5 0 0.5 1 1.5 2 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 12 / 31

Seeing in action Logistic Regression - continued Gradient descent Choose the right η is important small η is too slow? large η is too unstable? 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0.5 0 0.5 1 1.5 2 0.5 0 0.5 1 1.5 2 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 12 / 31

Logistic Regression - continued Gradient descent for logistic regression How do we do this for logistic regression? Simple fact: derivatives of σ(a) d σ(a) d a = d ( ) 1 d a 1 + e a = (1 + e a ) (1 + e a ) 2 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 13 / 31

Logistic Regression - continued Gradient descent for logistic regression How do we do this for logistic regression? Simple fact: derivatives of σ(a) d σ(a) d a = d ( 1 d a 1 + e a = ) = (1 + e a ) e a (1 + e a ) 2 = 1 1 + e a (1 + e a ) ( 2 1 1 1 + e a ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 13 / 31

Logistic Regression - continued Gradient descent for logistic regression How do we do this for logistic regression? Simple fact: derivatives of σ(a) d σ(a) d a d log σ(a) d a = d ( 1 d a 1 + e a = ) = (1 + e a ) e a (1 + e a ) 2 = 1 1 + e a = σ(a)[1 σ(a)] = 1 σ(a) (1 + e a ) ( 2 1 1 1 + e a ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 13 / 31

Logistic Regression - continued Gradient descent for logistic regression Gradients of the cross-entropy error function Gradients E(w) w = n { yn [1 σ(w T x n )]x n (1 y n )σ(w T x n )x n } (6) = n { σ(w T x n ) y n } xn (7) Remarks e n = { σ(w T } x n ) y n is called error for the nth training sample. Stationary point (in this case, the optimum): σ(w T x n )x n = x n y n n n Intuition: on average, the error is zero. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 14 / 31

Logistic Regression - continued Gradient descent for logistic regression Numerical optimization Gradient descent Choose a proper step size η > 0 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 15 / 31

Logistic Regression - continued Gradient descent for logistic regression Numerical optimization Gradient descent Choose a proper step size η > 0 Iteratively update the parameters following the negative gradient to minimize the error function w (t+1) w (t) η n { σ(w T x n ) y n } xn Remarks The step size needs to be chosen carefully to ensure convergence. The step size can be adaptive (i.e. varying from iteration to iteration). For example, we can use techniques such as line search There is a variant called stochastic gradient descent, also popularly used (later in this semester). Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 15 / 31

Logistic Regression - continued Intuition for Newton method Newton method Approximate the true function with an easy-to-solve optimization problem f(x) f quad (x) f(x) f quad (x) x k x k +d k x k x k +d k Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 16 / 31

Logistic Regression - continued Newton method Approximation Taylor expansion of the cross-entropy function E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) + 1 2 (w w(t) ) T H (t) (w w (t) ) where E(w (t) ) is the gradient H (t) is the Hessian matrix evaluated at w (t) Example: a scalar function sin(θ) sin(0) + θ cos(θ = 0) + 1 2 θ2 [ sin(θ = 0)] = θ where sin(θ) = cos(θ) and H = cos(θ) = sin(θ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 17 / 31

Logistic Regression - continued So what is the Hessian matrix? Newton method The matrix of second-order derivatives In other words, H = 2 E(w) ww T H ij = ( ) E(w) w j w i So the Hessian matrix is R D D, where w R D. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 18 / 31

Logistic Regression - continued Optimizing the approximation Newton method Minimize the approximation E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) + 1 2 (w w(t) ) T H (t) (w w (t) ) and use the solution as the new estimate of the parameters w (t+1) min w (w w(t) ) T E(w (t) ) + 1 2 (w w(t) ) T H (t) (w w (t) ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 19 / 31

Logistic Regression - continued Newton method Optimizing the approximation Minimize the approximation E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) + 1 2 (w w(t) ) T H (t) (w w (t) ) and use the solution as the new estimate of the parameters w (t+1) min w (w w(t) ) T E(w (t) ) + 1 2 (w w(t) ) T H (t) (w w (t) ) The quadratic function minimization has a closed form, thus, we have i.e., the Newton method. ( w (t+1) w (t) H (t)) 1 E(w (t) ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 19 / 31

Logistic Regression - continued Newton method Contrast gradient descent and Newton method Similar Both are iterative procedures. Difference Newton method requires second-order derivatives. Newton method does not have the magic η to be set. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 20 / 31

Logistic Regression - continued Newton method Other important things about Hessian Our cross-entropy error function is convex E(w) w = n {σ(w T x n ) y n }x n (8) H = 2 E(w) = homework (9) wwt Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 21 / 31

Logistic Regression - continued Newton method Other important things about Hessian Our cross-entropy error function is convex For any vector v, E(w) w = n {σ(w T x n ) y n }x n (8) H = 2 E(w) = homework (9) wwt v T Hv = homework 0 Thus, positive definite. Thus, the cross-entropy error function is convex, with only one global optimum. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 21 / 31

Logistic Regression - continued Good about Newton method Newton method Fast! Suppose we want to minimize f(x) = x 2 + 2x and we have its current estimate at x (t) 1. So what is the next estimate? x (t+1) x (t) [f (x)] 1 f (x) = x (t) 1 2 (2x(t) + 2) = 1 Namely, the next step (of iteration) immediately tells us the global optimum! (In optimization, this is called superlinear convergence rate). In general, the better our approximation, the fast the Newton method is in solving our optimization problem. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 22 / 31

Logistic Regression - continued Newton method Bad about Newton method Not scalable! Computing and inverting Hessian matrix can be very expensive for large-scale problems where the dimensionally D is very large. Newton method does not guarantee convergence if your starting point is far away from the optimum NB. There are fixes and alternatives, such as Quasi-Newton/Quasi-second order method. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 23 / 31

Outline Multiclass classification 1 Administration 2 Logistic Regression - continued 3 Multiclass classification Use binary classifiers as building blocks Multinomial logistic regression Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 24 / 31

Multiclass classification Setup Suppose we need to predict multiple classes/outcomes: C 1, C 2,..., C K Weather prediction: sunny, cloudy, raining, etc Optical character recognition: 10 digits + 26 characters (lower and upper cases) + special characters, etc Studied methods Nearest neighbor classifier Naive Bayes Gaussian discriminant analysis Logistic regression Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 25 / 31

Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 26 / 31

Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 26 / 31

Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers 1 What if all the classifiers say negative? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 26 / 31

Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers 1 What if all the classifiers say negative? 2 What if multiple classifiers say positive? Take-home exercise: there are different combination strategies. Can you think of any? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 26 / 31

Multiclass classification Use binary classifiers as building blocks Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 27 / 31

Multiclass classification Use binary classifiers as building blocks Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Train K(K 1)/2 binary classifiers using logistic regression to differentiate the two classes Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 27 / 31

Multiclass classification Use binary classifiers as building blocks Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Train K(K 1)/2 binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers There are K(K 1)/2 votes! Take-home exercise: can you think of any good combination strategies? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 27 / 31

Multiclass classification Use binary classifiers as building blocks Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Make a huge difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 28 / 31

Multiclass classification Use binary classifiers as building blocks Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Make a huge difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? one versus one: only needs to train a smaller subset of data (only those labeled with those two classes would be involved). Make a huge difference if you have a lot of data to go through. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 28 / 31

Multiclass classification Use binary classifiers as building blocks Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Make a huge difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? one versus one: only needs to train a smaller subset of data (only those labeled with those two classes would be involved). Make a huge difference if you have a lot of data to go through. Bad about both of them Combining classifiers outputs seem to be a bit tricky. Any other good methods? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 28 / 31

Multiclass classification Multinomial logistic regression Multinomial logistic regression Intuition: from the decision rule of our naive Bayes classifier y = arg max c p(y = c x) = arg max c log p(x y = c)p(y = c) (10) = arg max c log π c + z k log θ ck = arg max c wc T x (11) k Essentially, we are comparing with one for each category. w T 1 x, w T 2 x,, w T C x (12) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 29 / 31

First try Multiclass classification Multinomial logistic regression So, can we define the following conditional model? p(y = c x) = σ[w T c x] Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 30 / 31

First try Multiclass classification Multinomial logistic regression So, can we define the following conditional model? p(y = c x) = σ[w T c x] This would not work at least for the reason p(y = c x) = σ[wc T x] 1 c c as each the summand can be any number (independently) between 0 and 1. But we are close Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 30 / 31

Multiclass classification Multinomial logistic regression Definition of multinomial logistic regression Model For each class C k, we have a parameter vector w k and model the posterior probability as p(c k x) = ewt k x k ewt k x This is called softmax function Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 31 / 31

Multiclass classification Multinomial logistic regression Definition of multinomial logistic regression Model For each class C k, we have a parameter vector w k and model the posterior probability as p(c k x) = ewt k x k ewt k x This is called softmax function Decision boundary: assign x with the label that is the maximum of posterior arg max k P (C k x) arg max k w T k x Note: the notation is changed to denote the classes as C k instead of just c Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 31 / 31