PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION



Similar documents
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Models for Classification

STA 4273H: Statistical Machine Learning

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

CSC 411: Lecture 07: Multiclass Classification

Christfried Webers. Canberra February June 2015

Statistical Machine Learning

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression

Lecture 3: Linear methods for classification

CSCI567 Machine Learning (Fall 2014)

Linear Classification. Volker Tresp Summer 2015

Machine Learning Logistic Regression

Linear Threshold Units

Lecture 8 February 4

Support Vector Machine (SVM)

Classification Problems

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression (1/24/13)

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Logistic Regression

Lecture 2: The SVM classifier

Introduction to Online Learning Theory

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Least-Squares Intersection of Lines

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Support Vector Machines Explained

Classification by Pairwise Coupling

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

LCs for Binary Classification

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

A fast multi-class SVM learning method for huge databases

An Introduction to Machine Learning

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Lecture 9: Introduction to Pattern Analysis

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승

A Simple Introduction to Support Vector Machines

Lecture 6: Logistic Regression

Chapter 6. The stacking ensemble approach

Supervised Learning (Big Data Analytics)

Big Data Analytics CSCI 4030

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Machine Learning in Spam Filtering

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Local classification and local likelihoods

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

4.1 Learning algorithms for neural networks

Monotonicity Hints. Abstract

Dimensionality Reduction: Principal Components Analysis

SAS Software to Fit the Generalized Linear Model

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Predict Influencers in the Social Network

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Introduction to Machine Learning Using Python. Vikram Kamath

Simple and efficient online algorithms for real world applications

REVIEW OF ENSEMBLE CLASSIFICATION

Data Mining Part 5. Prediction

Making Sense of the Mayhem: Machine Learning and March Madness

Supporting Online Material for

Chapter 4: Artificial Neural Networks

3F3: Signal and Pattern Processing

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Section 1.1. Introduction to R n

Data Mining. Nonlinear Classification

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

CSI:FLORIDA. Section 4.4: Logistic Regression

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Gaussian Processes in Machine Learning

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

The equivalence of logistic regression and maximum entropy models

Machine Learning Big Data using Map Reduce

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Orthogonal Projections

Big Data - Lecture 1 Optimization reminders

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

: Introduction to Machine Learning Dr. Rita Osadchy

Classification Techniques for Remote Sensing

Principles of Data Mining by Hand&Mannila&Smyth

Pricing and calibration in local volatility models via fast quantization

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Learning Vector Quantization: generalization ability and dynamics of competing prototypes

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Logistic Regression for Spam Filtering

Nonlinear Iterative Partial Least Squares Method

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Lecture 6. Artificial Neural Networks

Transcription:

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical and computational properties. We now discuss an analogous class of models for solving classification problems. The goal in classification is to take an input vector x and to assign it to one of K discrete classes C k where k = 1,..., K. In the most common scenario, the classes are taken to be disjoint, so that each input is assigned to one and only one class.

The input space is thereby divided into decision regions whose boundaries are called decision boundaries or decision surfaces. by which we mean that the decision surfaces are linear functions of the input vector x and hence are defined by (D 1)-dimensional hyperplanes within the D-dimensional input space. Data sets whose classes can be separated exactly by linear decision surfaces are said to be linearly separable.

Discriminant Functions A discriminant is a function that takes an input vector x and assigns it to one of K classes, denoted C k. In this chapter(linear discriminants,) To simplify the discussion, we consider first the case of two classes

Two classes The simplest representation of a linear discriminant function where w is called a weight vector, and w 0 is a bias The negative of the bias is sometimes called a threshold. An input vector x is assigned to class C 1 if y(x) 0 and to class C 2 otherwise.

The corresponding decision boundary is therefore defined by the relation y(x) = 0, Which corresponds to a (D 1)-dimensional hyperplane within the D-dimensional input space. Consider two points x A and x B both of which lie on the decision surface. Because y(x A ) = y(x B ) = 0, we have w T (x A x B ) = 0, hence the vector w is orthogonal to every vector lying within the decision surface, and so w determines the orientation of the decision surface. Similarly, if x is a point on the decision surface, then y(x) = 0, and so the normal distance from the origin to the decision surface is given by

Multiple classes Now consider the extension of linear discriminants to K >2 classes We might be tempted be to build a K-class discriminant by combining a number of two-class discriminant functions. However, this leads to some serious difficulties (Duda and Hart, 1973) as we now show. Consider the use of K 1 classifiers each of which solves a two-class problem of separating points in a particular class C k from points not in that class. This is known as a one-versus-the-rest classifier.

An alternative is to introduce K(K 1)/2 binary discriminant functions, one for every possible pair of classes This is known as a one-versus-one classifier. Each point is then classified according to a majority vote amongst the discriminant functions. We can avoid these difficulties by considering a single K- class discriminant comprising K linear functions of the form Then assigning a point x to class C k if y k (x) > y j (x) for all j not equal to k.

Then assigning a point x to class C k if y k (x) > y j (x) for all j not equal to k. The decision boundary between class C k and class C j is therefore given by y k (x) = y j (x) hence corresponds to a (D 1)-dimensional hyperplane defined by The decision regions of such a discriminant are always singly connected and convex.

The decision regions of such a discriminant are always singly connected and convex. x A and x B both of which lie inside decision region R k

Least squares for classification Each class C k is described by its own linear model so that We now determine the parameter matrix,w by minimizing a sum-of-squares error function, as we did for regression in Chapter 3.

Consider a training data set {x n, t n } where n = 1,..., N, and define a matrix T whose n-th row is the vector t n T..

Pseudo inverse Moore-Penrose pseudo-inverse of the matrix

Linearity An interesting property of least-squares solutions with multiple target variables is that if every target vector in the training set satisfies some linear constraint for some constants a and b, then the model prediction for any value of x will satisfy.

Outliers least squares is highly sensitiveto outliers, unlike logistic regression. The sum-of-squares error function penalizes predictions that are too correct in that they lie a long way on the correct side of the decision boundary. In chapter 7, we shall consider several alternative error functions for classification and we shall see that they do not suffer from this difficulty..

On the left is the result of using a least-squares discriminant. On the right is the result of using logistic regressions..

Fisher s linear discriminant One way to view a linear classification model is in terms of dimensionality reduction. Consider first the case of two classes, take the D dimensional input vector x and project it down to one dimension using If we place a threshold on y and classify y > w 0 as class C1, and otherwise class C2, then we obtain our standard linear classifier discussed in the previous section In general, the projection onto one dimension leads to a considerable loss of information, and classes that are well separated in the original D-dimensional space may become strongly overlapping in one dimension.

expression can be made arbitrarily large simply by increasing the magnitude of w we could constrain w to have unit length, but that have considerable overlap when projected onto the line joining their means.

The least-squares approach to the determination of a linear discriminant was based on the goal of making the model predictions as close as possible to a set of target values. By contrast, the Fisher criterion was derived by requiring maximum class separation in the output space. It is interesting to see the relationship between these two approaches. In particular, we shall show that, for the two-class problem, the Fisher criterion can be obtained as a special case of least squares.

Fisher and Least square

Fisher s discriminant for multiple classes

The perceptron algorithm Another example of a linear discriminant model is the perceptron of Rosenblatt (1962), which occupies an important place in the history of pattern recognition algorithms. It corresponds to a two-class model in which the input vector x is first transformed using a fixed nonlinear transformation to give a feature vector φ(x), and this is then used to construct a generalized linear model of the form nonlinear activation function f

The perceptron algorithm The perceptron criterion associates zero error with any pattern that is correctly classified, whereas for a misclassified pattern x n it tries to minimize the quantity The perceptron criterion is therefore given by Where M denotes the set of all misclassified patterns We now apply the stochastic gradient descent algorithm to this error function. we can set the learning rate parameter η equal to 1 without of generality.

The perceptron algorithm Perceptron convergence theorem states that if there exists an exact solution, then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps...

Probabilistic Generative Models.

Continuous inputs.

Maximum likelihood solution.

Probabilistic Generative Models Illustration of the role of nonlinear basis functions in linear classification models. left plot shows the original input space (x1, x2) together with data points from two classes labelled red and blue. Two Gaussian basis functions φ1(x) and φ2(x) are defined in this space with centres shown by the green crosses and with contours shown by the green circles..

Logistic regression.

Iterative reweighted least squares the error function can be minimized by an efficient iterative technique based on the Newton-Raphson iterative optimization scheme, uses a local quadratic approximation to the log likelihood function.

Iterative reweighted least squares

Iterative reweighted least squares

Multiclass logistic regression

Probit regression We have seen that, for a broad range of classconditional distributions, described by the exponential family, the resulting posterior class probabilities are given by a logistic (or softmax) transformation acting on a linear function However, not all choices of class-conditional density give rise to such a simple form for the posterior probabilities This suggests that it might be worth exploring other types of discriminative probabilistic model.

Probit regression

Probit regression.