Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Similar documents
Christfried Webers. Canberra February June 2015

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Classification. Volker Tresp Summer 2015

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

STA 4273H: Statistical Machine Learning

Lecture 3: Linear methods for classification

Classification Problems

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Statistical Machine Learning

Linear Threshold Units

Machine Learning and Pattern Recognition Logistic Regression

Linear Models for Classification

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

CSC 411: Lecture 07: Multiclass Classification

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

CSCI567 Machine Learning (Fall 2014)

Lecture 9: Introduction to Pattern Analysis

Statistical Machine Learning from Data

Logistic Regression (1/24/13)

Statistical Models in Data Mining

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

A Log-Robust Optimization Approach to Portfolio Management

An Introduction to Machine Learning

Machine Learning Logistic Regression

Lecture 8 February 4

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Classification. Chapter 3

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Classification by Pairwise Coupling

L4: Bayesian Decision Theory

Lecture 8: Signal Detection and Noise Assumption

Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009

Introduction to General and Generalized Linear Models

Data Mining: Algorithms and Applications Matrix Math Review

Interpretation of Somers D under four simple models

NC STATE UNIVERSITY Exploratory Analysis of Massive Data for Distribution Fault Diagnosis in Smart Grids

Basics of Statistical Machine Learning

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Course 4 Examination Questions And Illustrative Solutions. November 2000

Credit Risk Models: An Overview

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

Least-Squares Intersection of Lines

Reject Inference in Credit Scoring. Jie-Men Mok

The Probit Link Function in Generalized Linear Models for Data Mining Applications

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Support Vector Machines Explained

Monte Carlo Simulation

Gaussian Conjugate Prior Cheat Sheet

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

The zero-adjusted Inverse Gaussian distribution as a model for insurance claims

Predict Influencers in the Social Network

Lecture Notes on Elasticity of Substitution

Section 1.4. Lines, Planes, and Hyperplanes. The Calculus of Functions of Several Variables

Statistics 100A Homework 8 Solutions

One-Class Classifiers: A Review and Analysis of Suitability in the Context of Mobile-Masquerader Detection

Lecture 6: Logistic Regression

CSI:FLORIDA. Section 4.4: Logistic Regression

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

VI. Introduction to Logistic Regression

Dealing with large datasets

APPLICATIONS OF BAYES THEOREM

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R.

Maximum Likelihood Estimation

Extreme Value Modeling for Detection and Attribution of Climate Extremes

Message-passing sequential detection of multiple change points in networks

Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application

A General Approach to Variance Estimation under Imputation for Missing Survey Data

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Support Vector Machine (SVM)

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Manifold Learning with Variational Auto-encoder for Medical Image Analysis

INSURANCE RISK THEORY (Problems)

Statistics and Data Analysis

Parametric Survival Models

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

Bayes and Naïve Bayes. cs534-machine Learning

Introduction: Overview of Kernel Methods

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

>

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Local classification and local likelihoods

Machine Learning.

Transcription:

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based: Assume a model for p( x C i ). Use Bayes rule to calculate P(C i x) g i ( x) = log P(C i x) Discriminant-based: Assume a model for g i ( x φ i ). Vapnik: Estimating the class densities is a harder problem than estimating the class discriminants. It does not make sense to solve a hard problem to solve an easier one. Linear discriminant: i x + w i0 = d w ij x j + w i0 j= Advantages: Simple: O(d) space/computation Knowledge extraction: Weights sizes give an indication of significance of contribution of each attribute Optimal when p( x C i ) are Gaussian with shared covariance matrix Useful when classes are (almost) linearly separable 3 4

More General Linear Models g i ( x w i, w i0 ) = d w ij x j + w i0 We can replace the x i on the right by any linearly independent set of basis functions: { C if g( x) > 0 Choose ow C 2 j= g( x) = g ( x) g 2 ( x) = w T x + w 0 Geometric Interpretation Rewrite x as w x = x p + r w where x p is the projection of x onto the hyperplane g( x) = 0 w is normal to the hyperplane r = g( x) w is the (signed) distance 5 6 Linearly Separable Systems For multiple classes with i x + w i0 with the w i normalized Choose C i if g i ( x) = k max j= g j( x) Pairwise Separation If not linearly separable, compute discriminants between each pair of classes: g ij ( x w ij, w ij0 ) = w T ij x+w ij0 Choose C i if j i, g ij ( x) > 0 7 8

Revisiting Parametric Methods When p( x C i ) N ( µ, Σ), i x + w i0 w i = Σ µ i w i0 = 2 µt i Σ µ i + log P(C i ) Let y P(C x). Then P(C 2 x) = y y We choose C if y > 0.5, or alternatively, if [ ] y >. Equivalently, if log y y > 0 The latter is called the log odds of y or logit. log odds For 2 normal classes with a shared cov. matrix, the log odds is linear logit(p(c x)) = log P(C x) P(C 2 x) = log P( x C ) P( x C 2 ) + log P(C ) P(C 2 ) = log P( x C ) log P( x C 2 ) + log P(C ) P(C 2 ) The P( x C) terms are exponential in x (Gaussian pdf), so the log is linear logit(p(c x)) = w T x + w 0 with w = Σ ( µ µ 2 ), w 0 = 2 ( µ + µ 2 ) T Σ ( µ + µ 2 ) 9 0 logistic The inverse of the logit function: logit(p(c x)) = w T x + w 0 is called the logistic a.k.a. the sigmoid: P(C x) = sigmoid( w T x + w 0 ) = + exp[ w T x + w 0 ] Using the Sigmoid During training During training, estimate m, m 2, S, then compute the w During testing, either Calculate g( x w, w 0 ) = w T x + w 0 and choose C i if g( x) > 0, or Calculate y = sigmoid( w T x + w 0 ) and choose C i if y > 0.5 2

Logistic Discrimination Estimating w For two classes, assume the log likelihood ratio is linear log p( x C ) p( x C 2 ) = w T x + w 0 logit(p(c )) = w T x + w 0 Likelihood y = ˆP(C x) = l( w, w 0 X ) = t + exp [ w T x + w 0 ] (y t ) r t ( y t ) r t Error ( cross-entropy ) E( w, w 0 X ) = t r t log y t + ( r t ) log ( y t ) Train by numerical optimization to minimize E 3 4 Multiple classes For K classes, take C K as a reference class log p( x C i ) p( x C K ) = w T x + w 0 p(c i x) [ ] p(c K x) = exp w T x + w 0 y i = ˆP(C i x) = exp [ w i T ] x + w i0 + [ ] K j= exp w j T x + w j0 This is called the softmax function because exponentiation combined with normalization tends to exaggerate weight of the maximum term Likelihood l( w, w 0 X ) = t (yi t ) r i t i Multiple classes (cont.) Error ( cross-entropy) ) E( w, w 0 X ) = t i ri t log yi t Train by numerical optimization to minimize E 5 6

Softmax Classification Softmax Discriminants 7 8