L4: Bayesian Decision Theory



Similar documents
Lecture 9: Introduction to Pattern Analysis

Basics of Statistical Machine Learning

Statistical Machine Learning

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Statistical Machine Learning from Data

CHAPTER 2 Estimating Probabilities

Lecture 3: Linear methods for classification

Message-passing sequential detection of multiple change points in networks

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

L13: cross-validation

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Reject Inference in Credit Scoring. Jie-Men Mok

1 Prior Probability and Posterior Probability

Linear Classification. Volker Tresp Summer 2015

L25: Ensemble learning

5 Directed acyclic graphs

Christfried Webers. Canberra February June 2015

Maximum Likelihood Estimation

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Constrained optimization.

6. Budget Deficits and Fiscal Policy

Principle of Data Reduction

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

STA 4273H: Statistical Machine Learning

Introduction to Support Vector Machines. Colin Campbell, Bristol University

CSCI567 Machine Learning (Fall 2014)

Online Appendix to Stochastic Imitative Game Dynamics with Committed Agents

Economics of Insurance

Gaussian Conjugate Prior Cheat Sheet

Elasticity Theory Basics

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

1 Teaching notes on GMM 1.

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Bayesian probability theory

The Optimality of Naive Bayes

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Inference of Probability Distributions for Trust and Security applications

A Simple Introduction to Support Vector Machines

Bayesian Model Averaging Continual Reassessment Method BMA-CRM. Guosheng Yin and Ying Yuan. August 26, 2009

Nonparametric adaptive age replacement with a one-cycle criterion

Metric Spaces. Chapter Metrics

Notes on metric spaces

Notes on the Negative Binomial Distribution

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

Section 1.1 Linear Equations: Slope and Equations of Lines

Lecture Notes on Elasticity of Substitution

A Game Theoretical Framework for Adversarial Learning

Least-Squares Intersection of Lines

1 Sufficient statistics

Continued Fractions and the Euclidean Algorithm

Introduction to General and Generalized Linear Models

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Language Modeling. Chapter Introduction

Classification Problems

15. Symmetric polynomials

Monte Carlo-based statistical methods (MASM11/FMS091)

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

V out. Figure 1: A voltage divider on the left, and potentiometer on the right.

Nonlinear Algebraic Equations Example

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Vector and Matrix Norms

A Brief Review of Elementary Ordinary Differential Equations

ST 371 (IV): Discrete Random Variables

Solving Quadratic Equations

Practice problems for Homework 11 - Point Estimation

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

3 Monomial orders and the division algorithm

Inner Product Spaces

Time Series Analysis

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

ASEN Structures. MDOF Dynamic Systems. ASEN 3112 Lecture 1 Slide 1

Linear Models for Classification

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

Monte Carlo-based statistical methods (MASM11/FMS091)

Association Between Variables

Testing Research and Statistical Hypotheses

Review of Fundamental Mathematics

Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application

minimal polyonomial Example

Inner product. Definition of inner product

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

Likelihood Approaches for Trial Designs in Early Phase Oncology

Fuzzy Probability Distributions in Bayesian Analysis

5. Continuous Random Variables

Automata on Infinite Words and Trees

Signal Detection C H A P T E R SIGNAL DETECTION AS HYPOTHESIS TESTING

Course: Model, Learning, and Inference: Lecture 5

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

The elements used in commercial codes can be classified in two basic categories:

Equations, Inequalities & Partial Fractions

1 Another method of estimation: least squares

5.1 Identifying the Target Parameter

Multi-variable Calculus and Optimization

Transcription:

L4: Bayesian Decision Theory Likelihood ratio test Probability of error Bayes risk Bayes, MAP and ML criteria Multi-class problems Discriminant functions CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 1

Likelihood ratio test (LRT) Assume we are to classify an object based on the evidence provided by feature vector x Would the following decision rule be reasonable? "Choose the class that is most probable given observation x More formally: Evaluate the posterior probability of each class P(ω i x) and choose the class with largest P(ω i x) Let s examine this rule for a 2-class problem In this case the decision rule becomes if P ω 1 x P ω 2 x choose ω 1 else choose ω 2 Or, in a more compact form Applying Bayes rule p x ω 1 P ω 1 p x P ω 1 x ω 1 P ω 2 x ω2 ω1 p x ω 2 P ω 2 ω2 p x CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 2

Since p(x) does not affect the decision rule, it can be eliminated* Rearranging the previous expression Λ x = p x ω 1 p x ω 2 ω1 P ω 2 ω2 P ω 1 The term Λ x is called the likelihood ratio, and the decision rule is known as the likelihood ratio test *p(x) can be disregarded in the decision rule since it is constant regardless of class ω i. However, p(x) will be needed if we want to estimate the posterior P ω i x which, unlike p x ω 1 P ω 1, is a true probability value and, therefore, gives us an estimate of the goodness of our decision CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 3

Likelihood ratio test: an example Problem Given the likelihoods below, derive a decision rule based on the LRT (assume equal priors) p x ω 1 = N 4,1 ; p x ω 2 = N 10,1 Solution Substituting into the LRT expression Λ x = 1 2π e 1 2 x 4 2 1 2π e 1 2 x 10 2 ω1 1 1 ω2 Simplifying the LRT expression Λ x = e 1 2 x 4 2 + 1 2 x 10 2 ω 1 1 ω2 Changing signs and taking logs x 4 2 x 10 2ω 1 0 ω2 Which yields x ω 1 R 1 : say 1 7 ω2 This LRT result is intuitive since the likelihoods differ only in their mean How would the LRT decision rule change if the priors were such thatp ω 1 = 2P(ω 2 )? R 2 : say 2 P(x 1 ) P(x 2 ) 4 10 CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 4 x

Probability of error The performance of any decision rule can be measured by P[error] Making use of the Theorem of total probability (L2): P error = C i=1 P error ω i P[ω i ] The class conditional probability P error ω i can be expressed as P error ω i = P choose ω j ω i = p x ω i dx R j So, for our 2-class problem, P error becomes = ε i P error = P ω 1 p x ω 1 dx R 2 ε 1 + P ω 2 p x ω 2 dx ε 2 where ε i is the integral of p x ω i over region R j where we choose ω j For the previous example, since we assumed equal priors, then P[error] = (ε 1 + ε 2 )/2 How would you compute P error numerically? : say 1 R 2 : say 2 P(x 1 ) P(x 2 ) 4 10 2 1 CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 5 x

How good is the LRT decision rule? To answer this question, it is convenient to express P[error] in terms of the posterior P[error x] P error = P error x p x dx The optimal decision rule will minimize P[error x] at every value of x in feature space, so that the integral above is minimized CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 6

At each x, P[error x ] is equal to P[ω i x ] when we choose ω j This is illustrated in the figure below, ALT R 2, ALT,LTR R 2,LRT Probability P( 1 x) P[error x' ] for ALT decision rule P( 2 x) P[error x' ] for LRT decision rule x From the figure it becomes clear that, for any value of x, the LRT will always have a lower P[error x ] Therefore, when we integrate over the real line, the LRT decision rule will yield a lower P[error] x For any given problem, the minimum probability of error is achieved by the LRT decision rule; this probability of error is called the Bayes Error Rate and is the best any classifier can do. CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 7

Bayes risk So far we have assumed that the penalty of misclassifying x ω 1 as ω 2 is the same as the reciprocal error In general, this is not the case For example, misclassifying a cancer sufferer as a healthy patient is a much more serious problem than the other way around This concept can be formalized in terms of a cost function C ij C ij represents the cost of choosing class ω i when ω j is the true class We define the Bayes Risk as the expected value of the cost 2 R = E C = i=1 2 = i=1 2 j=1 2 j=1 C ij P choose ω i and x ω j = C ij P x R i ω j P ω j CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 8

What is the decision rule that minimizes the Bayes Risk? First notice that P x R i ω j = p x ω j dx R i We can express the Bayes Risk as R = [C 11 P ω 1 p(x ω 1 ) + C 12 P ω 2 p x ω 2 dx + [C 21 P ω 1 p(x ω 1 ) + C 22 P ω 2 p x ω 2 dx R 2 Then we note that, for either likelihood, one can write: p x ω i dx + p x ω i dx R 2 = p x ω i dx R 2 = 1 CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 9

Merging the last equation into the Bayes Risk expression yields R = C 11 P 1 +C 21 P 1 +C 21 P 1 C 21 P 1 p x ω 1 dx + C 12 P 2 p x ω 2 dx p x ω 1 dx + C 22 P 2 p x ω 2 dx R 2 R 2 p x ω 1 dx + C 22 P 2 p x ω 2 dx p x ω 1 dx C 22 P 2 p x ω 2 dx Now we cancel out all the integrals over R 2 R = C 21 P 1 + C 22 P 2 + C 12 C 22 P 2 p x ω 2 dx C 21 C 11 P 1 0 0 p x ω 1 dx The first two terms are constant w.r.t. so they can be ignored Thus, we seek a decision region that minimizes = argmin C 12 C 22 P 2 p x ω 2 C 21 C 11 P 1 p(x ω 1 ) dx = argmin g x CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 10

Let s forget about the actual expression of g(x) to develop some intuition for what kind of decision region we are looking for Intuitively, we will select for those regions that minimize g x In other words, those regions where g x 0 g(x) =A B C A B C x So we will choose such that C 21 C 11 P 1 p x ω 1 C 12 C 22 P 2 p x ω 2 And rearranging P x ω 1 P x ω 2 ω1 ω2 C 12 C 22 P ω 2 C 21 C 11 P ω 1 Therefore, minimization of the Bayes Risk also leads to an LRT CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 11

The Bayes risk: an example Consider a problem with likelihoods L 1 = N 0, 3 and L 2 = N 2,1 Λ x = 1 2 Sketch the two densities What is the likelihood ratio? Assume P 1 = P 2, C ii = 0, C 12 = 1 and C 21 = 3 1/2 Determine a decision rule to minimize P[error] x 2 N 0, 3 N 2,1 ω 1 ω 2 1 3 3 + 1 2 x 2 2 2x 2 12x + 12 x = 4.73,1.27 ω 1 0 ω 2 ω 1 0 ω 2 likelihood 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0-6 -4-2 0 2 4 6 x 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0-6 -4-2 0 2 4 6 x R 2 CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 12

Bayes criterion LRT variations This is the LRT that minimizes the Bayes risk Λ Bayes x = p x ω ω1 1 C 12 C 22 P ω 2 p x ω 2 ω2 C 21 C 11 P ω 1 Maximum A Posteriori criterion Sometimes we may be interested in minimizing P error A special case of Λ Bayes x that uses a zero-one cost C ij = Known as the MAP criterion, since it seeks to maximize P ω i x Λ MAP x = p x ω ω1 1 P ω 2 P ω ω1 1 x 1 p x ω 2 ω2 P ω 1 P ω 2 x ω2 0; i = j 1; i j Maximum Likelihood criterion For equal priors P[ω i ] = 1/2 and 0/1 loss function, the LTR is known as a ML criterion, since it seeks to maximize P(x ω i ) Λ ML x = p x ω 1 p x ω 2 ω1 1 ω2 CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 13

Two more decision rules are commonly cited in the literature The Neyman-Pearson Criterion, used in Detection and Estimation Theory, which also leads to an LRT, fixes one class error probabilities, say ε 1 α, and seeks to minimize the other For instance, for the sea-bass/salmon classification problem of L1, there may be some kind of government regulation that we must not misclassify more than 1% of salmon as sea bass The Neyman-Pearson Criterion is very attractive since it does not require knowledge of priors and cost function The Minimax Criterion, used in Game Theory, is derived from the Bayes criterion, and seeks to minimize the maximum Bayes Risk The Minimax Criterion does nor require knowledge of the priors, but it needs a cost function For more information on these methods, refer to Detection, Estimation and Modulation Theory, by H.L. van Trees CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 14

Probability Minimum P[error] for multi-class problems Minimizing P[error] generalizes well for multiple classes For clarity in the derivation, we express P[error] in terms of the probability of making a correct assignment P error = 1 P[correct] The probability of making a correct assignment is P correct = Σ C i=1 P ω i p x ω i dx R i Minimizing P[error] is equivalent to maximizing P[correct], so expressing the latter in terms of posteriors C P correct = Σ i=1 R i p x P ω i x dx To maximize P[correct], we must maximize each integral Ri, which we achieve by choosing the class with largest posterior So each R i is the region where P ω i x is maximum, and the decision rule that minimizes P[error] is the MAP criterion R 2 R 3 R 2 P( 3 x) P( 2 x) P( 1 x) x CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 15

Risk Minimum Bayes risk for multi-class problems Minimizing the Bayes risk also generalizes well As before, we use a slightly different formulation We denote by α i the decision to choose class ω i We denote by α(x) the overall decision rule that maps feature vectors x into classes ω i, α x α 1, α 2, α C The (conditional) risk R α i x of assigning x to class ω i is R α x α i = R α i x = Σ C j=1 C ij P ω j x And the Bayes Risk associated with decision rule α(x) is R α x = R α x x p x dx To minimize this expression, we must minimize the conditional risk R α x x at each x, which is equivalent to choosing ω i such that R α i x is minimum R 2 R 2 R 3 R 2 R 2 ( 3 x) ( 1 x) ( 2 x) CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 16 x

Discriminant functions All the decision rules shown in L4 have the same structure At each point x in feature space, choose class ω i that maximizes (or minimizes) some measure g i (x) This structure can be formalized with a set of discriminant functions g i (x), i = 1.. C, and the decision rule assign x to class ω i if g i x Therefore, we can visualize the decision rule as a network that computes C df s and selects the class with highest discriminant And the three decision rules can be summarized as Discriminant functions g j x j i g 1 (x) Class assignment Select max g 2 (x) g C (x) Costs C rite rio n D is c rim in a n t F u n c tio n B a y e s g i (x )= - ( i x ) Features x 1 x 2 x 3 x d M A P g i (x )= P( i x ) ML g i (x )= P (x i ) CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 17