Linear Models for Classification



Similar documents
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Christfried Webers. Canberra February June 2015

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

STA 4273H: Statistical Machine Learning

Statistical Machine Learning

Linear Classification. Volker Tresp Summer 2015

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Machine Learning and Pattern Recognition Logistic Regression

Lecture 3: Linear methods for classification

Classification Problems

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

CSC 411: Lecture 07: Multiclass Classification

CSCI567 Machine Learning (Fall 2014)

Linear Threshold Units

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

An Introduction to Machine Learning

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

3F3: Signal and Pattern Processing

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Support Vector Machine (SVM)

A Simple Introduction to Support Vector Machines

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Support Vector Machines Explained

Reject Inference in Credit Scoring. Jie-Men Mok

Classification by Pairwise Coupling

Linear regression methods for large n and streaming data

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Lecture 6: Logistic Regression

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Machine Learning Logistic Regression

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Lecture 2: The SVM classifier

Lecture 8 February 4

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Predict Influencers in the Social Network

The Optimality of Naive Bayes

Machine Learning in Spam Filtering

Lecture 6. Artificial Neural Networks

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Lecture 9: Introduction to Pattern Analysis

APPLICATIONS OF BAYES THEOREM

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Message-passing sequential detection of multiple change points in networks

Basics of Statistical Machine Learning

Principles of Data Mining by Hand&Mannila&Smyth

Least Squares Estimation

Supervised Learning (Big Data Analytics)

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Classification Techniques for Remote Sensing

Introduction to Logistic Regression

Support Vector Machines for Classification and Regression

Introduction to General and Generalized Linear Models

1 Maximum likelihood estimation

Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Neural Networks: a replacement for Gaussian Processes?

One-Class Classifiers: A Review and Analysis of Suitability in the Context of Mobile-Masquerader Detection

SAS Software to Fit the Generalized Linear Model

Least-Squares Intersection of Lines

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

(Quasi-)Newton methods

MATHEMATICAL METHODS OF STATISTICS

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Nonlinear Iterative Partial Least Squares Method

Multivariate Statistical Inference and Applications

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

COMBINED NEURAL NETWORKS FOR TIME SERIES ANALYSIS

Logistic Regression for Spam Filtering

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

NEURAL NETWORKS A Comprehensive Foundation

Gaussian Processes in Machine Learning

Simple Linear Regression Inference

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

Exploratory Data Analysis Using Radial Basis Function Latent Variable Models

Making Sense of the Mayhem: Machine Learning and March Madness

Computing with Finite and Infinite Networks

Neural Networks Lesson 5 - Cluster Analysis

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Estimating an ARMA Process

A Game Theoretical Framework for Adversarial Learning

Introduction: Overview of Kernel Methods

Introduction to Online Learning Theory

Transcription:

Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML)

Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci Model the conditional class distribution p(c i x): allows separation of inference and decision Generative approach: model class likelihoods, p(x Ci), and priors, p(ci); use Bayes' theorem to get posteriors: p(ci x) ~ p(x Ci)p(Ci)

Linear discriminant functions y(x) = wtx + w0

Multiple Classes Problem of ambiguous regions

Multiple Classes Consider a single K-class discriminant, with K linear functions: yk(x) = wktx + wk0 And assign x to class Ck if yk(x) > yj(x) for all j k Implies singly connected and convex decision regions:

Least squares for classification Too sensitive to outliers:

Least squares for classification Problematic due to evidently non-gaussian distribution of target values:

Fisher's linear discriminant Linear classification model is like 1-D projection of data: y = wtx. Thus we need to find a decision threshold along this 1-D projection (line). Simplest measure is separation of the class means: m2 m1 = wt(m2 m1). If classes have nondiagonal covariances, then a better idea is to use the Fisher criterion: J(w) = (m2 m1)2 / (s12 + s22) Where s12 denotes the variance of class 1 in the 1-D projection. Maximising J() attempts to give a large separation between projected class means, but also a small variance within each class.

Fisher's linear discriminant Line joining class means Fisher discriminant

The Perceptron Φ1(x) w1 f(wtφ(x)) Φ2(x) w2 Φ3(x) w3 f() Activation function 1 0-1 Φ4(x) w4 A non-linear transformation in the form of a step function is applied to the weighted sum of the input features. This is inspired by the way neurons appear to function, mimicking the action potential. wtφ(x)

The perceptron criterion We'd like a weight vector w such that wtφ(xi) > 0 for xi C1 (say, ti=1) and wtφ(xi) < 0 for xi C2 (ti=-1) Thus, we want wtφ(xi)ti > 0 i; those data points for which this is not true will be misclassified The perceptron criterion tries to minimise the 'magnitude' of misclassification, i.e., it tries to minimise -wtφ(xi)ti for all misclassified points (the set of which is denoted by M): EP(w) = - i M wtφ(xi)ti Why not just count the number of misclassified points? Because this is a piecewise constant function of w, and thus the gradient is zero at most places, making optimisation hard

Learning by gradient descent w(τ+1) = w(τ) η EP(w) = w(τ) + ηφ(xi)ti (if xi is misclassified) We can show that after this update, the error due to xi will be reduced: -w(τ+1)tφ(xi)ti = -w(τ)tφ(xi)ti (Φ(xi)ti)TΦ(xi)ti < -w(τ)tφ(xi)ti (having set η=1, which can be done without loss of generality)

Perceptron convergence Perceptron convergence theorem guarantees exact solution in finite steps for linearly separable data; but no convergence for nonseparable data

Gaussian Discriminant Analysis Generative approach, with class-conditional densities (likelihoods) modelled as Gaussians For the case of two classes, we have: Logistic sigmoid

Gaussian Discriminant Analysis In the Gaussian case, we get The assumption of equal covariance matrices leads to linear decision boundaries

Gaussian Discriminant Analysis Allowing for unequal covariance matrices for different classes leads to quadratic decision boundaries

Parameter estimation for GDA Likelihood: (assuming equal covariance matrices) Maximum Likelihood Estimators

Logistic Regression An example of a probabilistic discriminative model Rather than learning P(x Ci) and P(Ci), attempts to directly learn P(Ci x) Advantages: fewer parameters, better if assumptions in class-conditional density formulation are inaccurate We have seen how the class posterior for a two-class setting can be written as a logistic sigmoid acting on a linear function of the feature vector Φ: This model is called logistic regression, even though it is a model for classification, not regression!

Parameter learning If we let then the likelihood function is and we can define a corresponding error, known as cross-entropy:

Parameter learning The derivative of the sigmoid function is given by: Using this, we can obtain the gradient of the error function with respect to w: Thus the contribution to the gradient from point n is given by the 'error' between model prediction and actual class label (yn tn) times the basis function vector for that point, Φn Could use this for sequential learning by gradient descent, exactly as for least-squares linear regression

Nonlinear basis functions