Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on



Similar documents
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Models for Classification

STA 4273H: Statistical Machine Learning

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Christfried Webers. Canberra February June 2015

Statistical Machine Learning

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Lecture 3: Linear methods for classification

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Classification Problems

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Linear Classification. Volker Tresp Summer 2015

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Linear Threshold Units

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Machine Learning and Pattern Recognition Logistic Regression

Lecture 8 February 4

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

CSCI567 Machine Learning (Fall 2014)

CSC 411: Lecture 07: Multiclass Classification

CSI:FLORIDA. Section 4.4: Logistic Regression

Logistic Regression (1/24/13)

Machine Learning Logistic Regression

Classification by Pairwise Coupling

GLM, insurance pricing & big data: paying attention to convergence issues.

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

STATISTICA Formula Guide: Logistic Regression. Table of Contents

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Least Squares Estimation

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Component Ordering in Independent Component Analysis Based on Data Power

Probabilistic Discriminative Kernel Classifiers for Multi-class Problems

An Introduction to Machine Learning

Designing a learning system

You have data! What s next?

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Java Modules for Time Series Analysis

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

TIME SERIES ANALYSIS OF COMPOSITIONAL DATA USING A DYNAMIC LINEAR MODEL APPROACH

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Lecture 9: Introduction to Pattern Analysis

Programming Exercise 3: Multi-class Classification and Neural Networks

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Introduction to Logistic Regression

Statistical Machine Learning from Data

Data Mining: Algorithms and Applications Matrix Math Review

Statistics Graduate Courses

Predict Influencers in the Social Network

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

: Introduction to Machine Learning Dr. Rita Osadchy

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

A Simple Introduction to Support Vector Machines

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Towards running complex models on big data

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Lecture 10: Regression Trees

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Machine Learning.

Least-Squares Intersection of Lines

LCs for Binary Classification

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU

Gamma Distribution Fitting

Logistic Regression for Data Mining and High-Dimensional Classification

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Introduction: Overview of Kernel Methods

The equivalence of logistic regression and maximum entropy models

The Probit Link Function in Generalized Linear Models for Data Mining Applications

3F3: Signal and Pattern Processing

(Quasi-)Newton methods

Classification. Chapter 3

D-optimal plans in observational studies

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Local classification and local likelihoods

Why Ensembles Win Data Mining Competitions

Support Vector Machine (SVM)

SAS Software to Fit the Generalized Linear Model

Differential privacy in health care analytics and medical research An interactive tutorial

α = u v. In other words, Orthogonal Projection

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method

Convolution. 1D Formula: 2D Formula: Example on the web:

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

Efficient Streaming Classification Methods

Linear regression methods for large n and streaming data

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Transcription:

Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classifica6on

Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output that has target values of 1 for the posi've class and 0 (or some'mes - 1) for the other class For probabilis'c class labels the output of the model can represent the probability the model gives to the posi've class. If there are N classes we oden use a vector of N target values containing a single 1 for the correct class and zeros elsewhere. For probabilis'c labels we can then use a vector of class probabili'es as the target vector.

Three approaches to classifica'on Use discriminant func'ons directly without probabili'es: Convert the input vector into one or more real values so that a simple opera'on (like threshholding) can be applied to get the class. Infer condi'onal class probabili'es: p ( class = C k x) Then make a decision that minimizes some loss func'on Compare the probability of the input under separate, class- specific, genera've models. E.g. fit a mul'variate Gaussian to the input vectors of each class and see which Gaussian makes a test data vector most probable.

Discriminant func'ons The planar decision surface in data-space for the simple linear discriminant function: y = w T x + w 0 0

Discriminant func'ons for N> classes One possibility is to use N two- way discriminant func'ons. Each func'on discriminates one class from the rest. Another possibility is to use N(N- 1)/ two- way discriminant func'ons Each func'on discriminates between two par'cular classes. Both these methods have problems

Problems with mul'- class discriminant func'ons

Use K discriminant functions, yi, y j, yk and pick the max. This is guaranteed to give consistent and convex decision regions if y is linear. y y k k ( x A ) implies A simple solu'on ( α x + (1 α) x ) > y ( α x + (1 α) x ) A > ( y j for ( x A... ) and positive α) B y k that j ( x B ) A > y j ( x B ) B

Using least squares for classifica'on This is not the right thing to do and it doesn t work as well as other methods (we will see later), but it is easy. It reduces classifica'on to least squares regression. We already know how to do regression. We can just solve for the op'mal weights with some matrix algebra.

Least squares for classifica6on {x n, t n } The sum-of-squares error function: nth row of matrix T is t T n

Problems with using least squares for classifica'on logistic regression least squares regression If the right answer is 1 and the model says 1.5, it loses, so it changes the boundary to avoid being too correct

Another example where least squares regression gives poor decision surfaces

Fisher s linear discriminant A simple linear discriminant func'on is a projec'on of the data down to 1- D. So choose the projec'on that gives the best separa'on of the classes. What do we mean by best separa'on? An obvious direc'on to choose is the direc'on of the line joining the class means. But if the main direc'on of variance in each class is not orthogonal to this line, this will not give good separa'on. Fisher s method chooses the direc'on that maximizes the ra'o of between class variance to within class variance. This is the direc'on in which the projected points contain the most informa'on about class membership (under Gaussian assump'ons)

Fisher s linear discriminant The two- class linear discriminant acts as a projec'on: y = w T x followed by a threshold In which direc'on w should we project? One which separates classes well y w 0 Intui'on: We want the (projected) centers of the classes to be far apart, and each class (projec'on) to be clustered around its centre.

Fisher s linear discriminant A natural idea would be to project in the direc'on of the line connec'ng class means However, problema'c if classes have variance in this direc'on (ie are not clustered around the mean) Fisher criterion: maximize ra'o of inter- class separa'on (between) to intra- class variance (inside)

A picture showing the advantage of Fisher s linear discriminant. When projected onto the line joining the class means, the classes are not well separated. Fisher chooses a direction that makes the projected classes much tighter, even though their projected means are less far apart.

Math of Fisher s linear discriminants What linear transforma'on is best for discrimina'on? The projec'on onto the vector separa'ng the class means seems sensible. But we also want small variance within each class: Fisher s objec've func'on is: y = w T x ) ( ) ( 1 1 1 m y s m y s C n n C n n = = ε ε 1 1 ) ( ) ( s s m m J + = w between within

) ( : ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( 1 1 1 1 1 1 1 1 1 m m S w m x m x m x m x S m m m m S w S w w S w w + = = = + = W C n T n n C n T n n W T B W T B T solution Optimal s s m m J More math of Fisher s linear discriminants

Probabilis6c Genera6ve Models Model the class-conditional densities and class prior Use Bayes theorem to compute posterior probabilities

Logistic Sigmoid Function

Probabilis6c Genera6ve Models (k class) Beyes theorem if a k a j for all j k, then p(ck x) 1, and p(cj x) 0.

Continuous inputs Model the class-conditional densities Here, all classes share the same covariance matrix. Two-classes Problem

Two-Classes Problem

The posterior when the covariance matrices are different for different classes. The decision surface is planar when the covariance matrices are the same and quadratic when they are not.

Maximum likelihood solu6on The likelihood func'on:

Maximum likelihood solu6on Note that the approach of fi`ng Gaussian distribu'ons to the classes is not robust to outliers, because the maximum likelihood es*ma*on of a Gaussian is not robust.

Discrete features When The class-conditional distributions: Linear functions of the input values:

Probabilistic Discriminative Models What is it? It means to use the functional form of the generalized linear probabilistic model explicitly, and to determine its parameters directly The difference from Discriminant Functions, and Probabilistic Generative Models 7

Fixed Basis Functions Space transformation by nonlinear function 8

Logistic Regression To reduce the complexity of probabilistic models, the logistic sigmoid is introduced For Gaussian class conditional densities: M(M+5)/+1 parameters Logistic Regression: M parameters 9

The Likelihood p(t w) = N n=1y t n n {1 y n } 1 t n y n = p(c 1 n) 30

The Gradient of the Log Likelihood Error function: Gradient of the Error function 31

Iterative Reweighted Least Squares Newton-Raphson H is the Hessian Matrix H = T R R is an NxN diagonal matrix with elements: R nn = y n (1 y n ) 3

Iterative Reweighted Least Squares The Newton-Raphson update takes this form: z = w (old) R 1 (y t) 33

Multiclass Logistic Regression Generative models (discussed in section 4.) where Using the maximum likelihood to determine the parameters directly 34

Multiclass Logistic Regression Maximum likelihood method Using IRLS algorithm to do the w updating work 35