CSC 411: Lecture 07: Multiclass Classification



Similar documents
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

CSCI567 Machine Learning (Fall 2014)

STA 4273H: Statistical Machine Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Linear Models for Classification

Christfried Webers. Canberra February June 2015

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Lecture 8 February 4

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

LCs for Binary Classification

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Statistical Machine Learning

Machine Learning Logistic Regression

Lecture 6: Logistic Regression

Lecture 3: Linear methods for classification

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression

Question 2 Naïve Bayes (16 points)

Programming Exercise 3: Multi-class Classification and Neural Networks

Classification using Logistic Regression

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Simple and efficient online algorithms for real world applications

Big Data Analytics CSCI 4030

Data Mining Part 5. Prediction

Classification Problems

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Introduction to Logistic Regression

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Lecture 2: The SVM classifier

Nominal and ordinal logistic regression

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Linear Threshold Units

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Multi-Class and Structured Classification

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Logistic Regression (1/24/13)

A Simple Introduction to Support Vector Machines

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

International Doctoral School Algorithmic Decision Theory: MCDA and MOO

Data Mining: Algorithms and Applications Matrix Math Review

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Linear Classification. Volker Tresp Summer 2015

A fast multi-class SVM learning method for huge databases

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Predict Influencers in the Social Network

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Nominal and Real U.S. GDP

Implementation of Neural Networks with Theano.

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Microsoft Azure Machine learning Algorithms

Chapter 6. The stacking ensemble approach

Classification of Bad Accounts in Credit Card Industry

Supervised Learning (Big Data Analytics)

Machine Learning Big Data using Map Reduce

Support Vector Machine (SVM)

Lecture 9: Introduction to Pattern Analysis

Multi-class Classification: A Coding Based Space Partitioning

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Part-Based Recognition

ADVANCED MACHINE LEARNING. Introduction

3F3: Signal and Pattern Processing

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Multiple Kernel Learning on the Limit Order Book

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

8. Linear least-squares

Lecture 5 Least-squares

Logistic Regression for Spam Filtering

Machine Learning in Spam Filtering

REVIEW OF ENSEMBLE CLASSIFICATION

Systems of Linear Equations

Feature Engineering in Machine Learning

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets

Statistical Models in Data Mining

Decision Trees from large Databases: SLIQ

IBM SPSS Neural Networks 22

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

2.3 Convex Constrained Optimization Problems

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Data Mining Practical Machine Learning Tools and Techniques

Content-Based Recommendation

Online Semi-Supervised Learning

Online Convex Optimization

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Transcription:

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 1, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 1 / 17

Today Multi-class classification with: Least-squares regression Logistic Regression K-NN Decision trees Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 2 / 17

Discriminant Functions for K > 2 classes First idea: Use K 1 classifiers, each solving a two class problem of separating point in a class C from points not in the class. Known as 1 vs all or 1 vs the rest classifier PROBLEM: More than one good answer for green region! Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 3 / 17

Discriminant Functions for K > 2 classes Another simple idea: Introduce K(K 1)/2 two-way classifiers, one for each possible pair of classes Each point is classified according to majority vote amongst the disc. func. Known as the 1 vs 1 classifier PROBLEM: Two-way preferences need not be transitive Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 4 / 17

K-Class Discriminant We can avoid these problems by considering a single K-class discriminant comprising K functions of the form y (x) = w T x + w,0 and then assigning a point x to class C if j y (x) > y j (x) Note that w T is now a vector, not the -th coordinate The decision boundary between class C j and class C is given by y j (x) = y (x), and thus it s a (D 1) dimensional hyperplane defined as (w w j ) T x + (w 0 w j0 ) = 0 What about the binary case? Is this different? What is the shape of the overall decision boundary? Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 5 / 17

K-Class Discriminant The decision regions of such a discriminant are always singly connected and convex In Euclidean space, an object is convex if for every pair of points within the object, every point on the straight line segment that joins the pair of points is also within the object Which object is convex? Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 6 / 17

K-Class Discriminant The decision regions of such a discriminant are always singly connected and convex Consider 2 points x A and x B that lie inside decision region R Any convex combination ˆx of those points also will be in R ˆx = λx A + (1 λ)x B Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 7 / 17

Proof A convex combination point, i.e., λ [0, 1] ˆx = λx A + (1 λ)x B From the linearity of the classifier y(x) y (ˆx) = λy (x A ) + (1 λ)y (x B ) Since x A and x B are in R, it follows that y (x A ) > y j (x A ), y (x B ) > y j (x B ), j Since λ and 1 λ are positive, then ˆx is inside R Thus R is singly connected and convex Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 8 / 17

Example Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 9 / 17

Multi-class Classification with Linear Regression From before we have: which can be rewritten as: y (x) = w T x + w,0 y(x) = W T x where the -th column of W is [w,0, w T ]T, and x is [1, x T ] T Training: How can I find the weights W with the standard sum-of-squares regression loss? 1-of-K encoding: For multi-class problems (with K classes), instead of using t = (target has label ) we often use a 1-of-K encoding, i.e., a vector of K target values containing a single 1 for the correct class and zeros elsewhere Example: For a 4-class problem, we would write a target with class label 2 as: t = [0, 1, 0, 0] T Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 10 / 17

Multi-class Classification with Linear Regression Sum-of-least-squares loss: l( W) = N W T x (n) t (n) 2 n=1 = X W T 2 F where the n-th row of X is [ x (n) ] T, and n-th row of T is [t (n) ] T Setting derivative wrt W to 0, we get: W = ( X T X) 1 X T T Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 11 / 17

Multi-class Logistic Regression Associate a set of weights with each class, then use a normalized exponential output where the activations are given by p(c x) = y (x) = exp(z ) j exp(z j) z = w T x The function exp(z ) j exp(z j ) is called a softmax function Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 12 / 17

Multi-class Logistic Regression The lielihood p(t w 1,, w ) = N K n=1 =1 p(c x (n) ) t(n) = N K n=1 =1 with p(c x) = y (x) = exp(z ) j exp(z j) where n-th row of T is 1-of-K encoding of example n and z = w T x + w 0 What assumptions have I used to derive the lielihood? Derive the loss by computing the negative log-lielihood: E(w 1,, w K ) = log p(t w 1,, w K ) = N K n=1 =1 y (n) (x (n) ) t (n) t (n) log[y (n) (x (n) )] This is nown as the cross-entropy error for multiclass classification How do we obtain the weights? Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 13 / 17

Training Multi-class Logistic Regression How do we obtain the weights? E(w 1,, w K ) = log p(t w 1,, w K ) = Do gradient descent, where the derivatives are and E w,j = N E z (n) K n=1 j=1 y (n) j z (n) = E K j=1 y (n) j = δ(, j)y (n) j E y (n) j y (n) j z (n) y (n) j z (n) The derivative is the error times the input z(n) = w,j N K n=1 =1 y (n) j y (n) = y (n) N t (n) (y (n) n=1 t (n) log[y (n) (x (n) )] t (n) ) x (n) j Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 14 / 17

Softmax for 2 Classes Let s write the probability of one of the classes p(c 1 x) = y 1 (x) = exp(z 1) j exp(z j) = exp(z 1 ) exp(z 1 ) + exp(z 2 ) I can equivalently write this as p(c 1 x) = y 1 (x) = exp(z 1 ) exp(z 1 ) + exp(z 2 ) = 1 1 + exp ( (z 1 z 2 )) So the logistic is just a special case that avoids using redundant parameters Rather than having two separate set of weights for the two classes, combine into one z = z 1 z 2 = w T 1 x w T 2 x = w T x The over-parameterization of the softmax is because the probabilities must add to 1. Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 15 / 17

Multi-class K-NN Can directly handle multi class problems Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 16 / 17

Multi-class Decision Trees Can directly handle multi class problems How is this decision tree constructed? Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass Classification Feb 1, 2016 17 / 17