Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma



Similar documents
171:290 Model Selection Lecture II: The Akaike Information Criterion

Maximum Likelihood Estimation

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Introduction to Detection Theory

The CUSUM algorithm a small review. Pierre Granjon

Statistical Machine Learning

. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.

Differentiating under an integral sign

Multivariate Normal Distribution

Why is Insurance Good? An Example Jon Bakija, Williams College (Revised October 2013)

Signal Detection. Outline. Detection Theory. Example Applications of Detection Theory

Time Series Analysis

Testing theory an introduction

Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]

UNDERSTANDING INFORMATION CRITERIA FOR SELECTION AMONG CAPTURE-RECAPTURE OR RING RECOVERY MODELS

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Network quality control

JUST THE MATHS UNIT NUMBER 1.8. ALGEBRA 8 (Polynomials) A.J.Hobson

FALSE ALARMS IN FAULT-TOLERANT DOMINATING SETS IN GRAPHS. Mateusz Nikodem

Lecture 8: Signal Detection and Noise Assumption

On using numerical algebraic geometry to find Lyapunov functions of polynomial dynamical systems

On Adaboost and Optimal Betting Strategies

6 EXTENDING ALGEBRA. 6.0 Introduction. 6.1 The cubic equation. Objectives

Linear Programming Notes V Problem Transformations

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Introduction to General and Generalized Linear Models

Bayesian Information Criterion The BIC Of Algebraic geometry

Likelihood Approaches for Trial Designs in Early Phase Oncology

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

1 Maximum likelihood estimation

Reject Inference in Credit Scoring. Jie-Men Mok

PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include

2.3 Convex Constrained Optimization Problems

Nonparametric adaptive age replacement with a one-cycle criterion

Penalized Logistic Regression and Classification of Microarray Data

Lecture 3: Linear methods for classification

Inference of Probability Distributions for Trust and Security applications

A linear algebraic method for pricing temporary life annuities

arxiv: v1 [math.pr] 5 Dec 2011

Notes from February 11

Linear Programming I

Penalized regression: Introduction

Continuity of the Perron Root

Non Parametric Inference

From the help desk: Bootstrapped standard errors

A Model of Optimum Tariff in Vehicle Fleet Insurance

1 3 4 = 8i + 20j 13k. x + w. y + w

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

MINIMIZATION OF ENTROPY FUNCTIONALS UNDER MOMENT CONSTRAINTS. denote the family of probability density functions g on X satisfying

AP Physics 1 and 2 Lab Investigations

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Improving predictive inference under covariate shift by weighting the log-likelihood function

Dongfeng Li. Autumn 2010

Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the school year.

MOP 2007 Black Group Integer Polynomials Yufei Zhao. Integer Polynomials. June 29, 2007 Yufei Zhao

Testing predictive performance of binary choice models 1

The Kelly criterion for spread bets

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

Basics of Statistical Machine Learning

4.6 Linear Programming duality

Breaking The Code. Ryan Lowe. Ryan Lowe is currently a Ball State senior with a double major in Computer Science and Mathematics and

INTRODUCTION TO RATING MODELS

SAS Software to Fit the Generalized Linear Model

Exact Nonparametric Tests for Comparing Means - A Personal Summary

OPRE 6201 : 2. Simplex Method

Separation Properties for Locally Convex Cones

Lecture 14: Section 3.3

Review Horse Race Gambling and Side Information Dependent horse races and the entropy rate. Gambling. Besma Smida. ES250: Lecture 9.

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Permutation Tests for Comparing Two Populations

Technical Efficiency Accounting for Environmental Influence in the Japanese Gas Market

6 Hedging Using Futures

The Envelope Theorem 1

Integer roots of quadratic and cubic polynomials with integer coefficients

FIN FINANCIAL INSTRUMENTS SPRING Options

Examples of Tasks from CCSS Edition Course 3, Unit 5

18.06 Problem Set 4 Solution Due Wednesday, 11 March 2009 at 4 pm in Total: 175 points.

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Poisson Models for Count Data

Important Probability Distributions OPRE 6301

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

Introduction to Logistic Regression

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Exact Confidence Intervals

The relationship between exchange rates, interest rates. In this lecture we will learn how exchange rates accommodate equilibrium in

MATH 90 CHAPTER 1 Name:.

Linear Codes. Chapter Basics

Statistical Machine Learning from Data

Rolle s Theorem. q( x) = 1

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne


2 Primality and Compositeness Tests

Statistical Machine Translation: IBM Models 1 and 2

RINGS WITH A POLYNOMIAL IDENTITY

Transcription:

Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma Shinto Eguchi a, and John Copas b a Institute of Statistical Mathematics and Graduate University of Advanced Studies, Minami-azabu 4-6-7, Tokyo 106-8569, Japan b Department of Statistics, University of Warwick, Coventry CV4 7AL, UK Abstract Kullback-Leibler divergence and the Neyman-Pearson lemma are two fundamental concepts in statistics. Both are about likelihood ratios: Kullback-Leibler divergence is the expected log-likelihood ratio, and the Neyman-Pearson lemma is about error rates of likelihood ratio tests. Exploring this connection gives another statistical interpretation of the Kullback-Leibler divergence in terms of the loss of power of the likelihood ratio test when the wrong distribution is used for one of the hypotheses. In this interpretation, the standard non-negativity property of the Kullback-Leibler divergence is essentially a restatement of the optimal property of likelihood ratios established by the Neyman-Pearson lemma. Keywords: Kullback-Leibler divergence; Neyman-Pearson lemma; Information geometry; Testing hypotheses; Maximum likelihood. Corresponding author. E-mail address: eguchi@ism.ac.jp (S. Eguchi) 1

1. The Kullback-Leibler Divergence, D Let P and Q be two probability distributions of a random variable X. Ifx is continuous, with P and Q described by probability density functions p(x) and q(x) respectively, the Kullback-Leibler (KL) divergence from P to Q is defined by { D(P, Q) = log p(x) } p(x)dx. q(x) See Kullback and Leibler (1951). More generally, D(P, Q) is defined by { D(P, Q) = log P } Q (x) dp (x), where P / Q(x) is the density function of P relative to Q. Another way of writing D is where D(P, Q) =L(P, P) L(P, Q) (1) L(P, Q) =E P {log q(x)}, and the symbol E P means taking the expectation over the distribution P. Two fundamental properties of D are non-negativity : D(P, Q) 0 with equality if and only if P = Q. asymmetry : D(P, Q) D(Q, P ). KL divergence plays a central role in the theory of statistical inference. Properties of D, in particular its asymmetry, are exploited through the concept of dual affine connections in the information geometry of Amari (1985) and Amari and Nagaoka (2001). A well known application is Akaike s information criterion (AIC) used to control over-fitting in statistical modeling (Akaike, 1973). In this note we suggest some statistical interpretations of KL divergence which can help our understanding of this important theoretical concept. Section 2 reviews the well known connections with maximum likelihood and expected log likelihood ratios. Section 3 shows how the Neyman-Pearson lemma gives a new interpretation of D and gives an alternative 2

proof of the non-negativity property. Section 4. An extension of the argument is mentioned in 2. Well Known Statistical Interpretations of D 2.1 Maximum likelihood Given a random sample x 1,x 2,,x n from the underlying distribution P, let P n be the empirical distribution, which puts probability 1/n on each sample value x i. Now let Q θ be a statistical model f(x, θ) with unknown parameter θ. Then the empirical version of L(P, Q θ )is L(P n,q θ )= 1 n log f(x i,θ). n Apart from the factor 1/n, this is just the log likelihood function. Note that the empirical version L(P n,q θ ) reduces to the population version L(P, Q θ ) for any n by taking its expectation, i.e., E P {L(P n,q θ )} = L(P, Q θ ). Hence, from (1), maximizing the likelihood to find the maximum likelihood estimate (MLE) is analogous to finding θ which minimizes D(P, Q θ ). It is obvious that the best possible model is the one that fits the data exactly, i.e. when P = Q θ, so for any general model Q θ L(P, Q θ ) L(P, P) which is just the same as the non-negativity property D(P, Q θ ) 0. Asymmetry is equally clear because of the very distinct roles of P (the data) and Q θ (the model). Just as likelihood measures how well a model explains the data, so we can think of D as measuring the lack of fit between model and data relative to a perfect fit. It follows from the law of large numbers that the MLE converges almost surely to θ(p ) = arg min θ D(P, Q θ ). If P = Q θ then θ(p )=θ. This outlines the elegant proof of the consistency of the MLE by Wald (1949). In this way, KL divergence helps us to understand the theoretical properties of the maximum likelihood method. 2.2 Likelihood ratios Now suppose that P and Q are possible models for data (vector) X, which we now think of as being a null hypothesis H and an alternative hypothesis A. Suppose that X 3 i=1

has probability density or mass function f H (x) under H and f A (x) under A. Then the log likelihood ratio is λ = λ(x) = log f A(x) f H (x). (2) If these two hypotheses are reasonably well separated, intuition suggests that λ(x) will tend to be positive if A is true (the correct model f A fits the data better than wrong model f H ), and will tend to be negative if H is true (then we would expect f H to fit the data better than f A ). The expected values of λ(x) are E A {λ(x)} = D(P A,P H ), E H {λ(x)} = D(P H,P A ). Thus D(P A,P H ) is the expected log likelihood ratio when the alternative hypothesis is true. The larger is the likelihood ratio, the more evidence we have for the alternative hypothesis. In this interpretation, D is reminiscent of the power function in hypothesis testing, measuring the degree to which the data will reveal that the null hypothesis is false when the alternative is in fact true. Both this, and its dual when H is true, are zero when the two hypotheses coincide so that no statistical discrimination is possible. The asymmetry property of D corresponds to the asymmetric roles that the null and alternative hypotheses play in the theory of hypothesis testing. 3. Interpreting D through the Neyman-Pearson Lemma In terms of the set-up of Section 2.2, the Neyman-Pearson lemma establishes the optimality of the likelihood ratio critical region W = {x : λ(x) u}, in terms of Type I and Type II error rates. If we choose u such that P H (W ) = α, the likelihood ratio test offers W as a critical region with size α. For any other critical region W with the same size, i.e. with P H (W )=P H (W ), the lemma claims that P A (W ) P A (W ). (3) 4

The proof is easily shown by noting that and λ(x) u on W W λ(x) <u on W W, and then integrating separately over W W and W W leads to P A (W ) P A (W ) e u {P H (W ) P H (W )}. (4) This yields (3) from the size α condition on W and W. The lemma and proof are covered in most text books on mathematical statistics, for example Cox and Hinkley (1974) and Lindsey (1996). However, to use the optimal test we need to know both f H (x) and f A (x) so that we have the correct log likelihood ratio λ(x). Suppose we mis-specified the alternative hypothesis as f Q (x) instead of f A (x). We would then use the incorrect log likelihood ratio λ = λ(x) = log f Q(x) f H (x). The same argument as that used in arriving at (4) now gives P A (λ>u) P A ( λ >u) e u {P H (λ>u) P H ( λ >u)}. (5) The left hand side of this inequality is the loss of power when our test is based on λ(x) instead of λ(x). To measure the overall loss of power for different thresholds u we can integrate this over all possible values of u from to +. Integration by parts gives power = {P A (λ>u) P A ( λ >u)}du = u{p A (λ>u) P A ( λ >u)} + = (λ λ) f A (x)dx = log f A(x) f Q (x) f A(x)dx u u {P A(λ u) P A ( λ u)}du 5

by the definition of the expectations of λ and λ. Doing the same thing to the right hand side of (5) gives e u {P H (λ>u) P H ( λ >u)}du = e u {P H (λ>u) P H ( λ >u)} + = (e λ e λ) f H (x)dx =1 1= 0. Hence we get e u u {P H(λ u) P H ( λ u)}du power = D(P A,P Q ) 0. (6) We now have another interpretation of D: the KL divergence from P to Q measures how much power we lose with the likelihood ratio test if we mis-specify the alternative hypothesis P as Q. The non-negativity of D in (6) is essentially a restatement of the Neyman-Pearson lemma. Interestingly, this argument is independent of the choice of null hypothesis. We get a dualistic version of this if we imagine that it is the null hypothesis that is mis-specified. If we mistakenly take the null as f Q (x) instead of f H (x), we would now use the log likelihood ratio λ = λ(x) = log f A(x) f Q (x). Multiply both sides of the inequality (5) by e u to give e u {P A (λ>u) P A ( λ >u)} P H (λ>u) P H ( λ >u) and integrate both sides with respect to u as before. This gives (e λ e λ)f A (x)dx (λ λ)f H (x)dx which leads to power = D(P H,P Q ) 0. KL divergence now corresponds to loss of power if we mis-specify the null hypothesis. Note the essential asymmetry in these arguments: power is explicitly about the alternative 6

hypothesis, and in arriving at the second version we have integrated the power difference using a different weight function. 4. Comment The receiver operating characteristic (ROC) curve of a test statistic is the graph of the power (one minus the Type II error) against the size (Type I error) as the threshold u takes all possible values. Thus the two sides of inequality (5) are about the differences in coordinates of two ROC curves, one for the optimum λ(x) and the other for the suboptimum statistic λ(x). Integrating each side of (5) over u is finding a weighted area between these two curves. The smaller is this area, the closer are the statistical properties of λ(x) to those of λ(x) in the context of discriminant analysis. Eguchi and Copas (2002) exploit this interpretation by minimizing this area over a parametric family of statistics λ(x). In this way we can find a discriminant function which best approximates the true, but generally unknown, log likelihood ratio statistic. References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Proc. 2nd Int. Symp. on Inference Theory (eds B. N. Petrov and F. Csáki), 267-281. Budapest: Académiai Kiadó. Amari, S. (1985). Statistics, vol. Differential-Geometrical Methods in Statistics. Lecture Notes in 28. New York: Springer. Amari, S. and Nagaoka, H. (2001). Methods of Information Geometry. Translations of Mathematical Monographs, vol. 191. New York: American Mathematical Society. Cox, D. R. and Hinkley, D. V. (1974). London. Theoretical Statistics. Chapman and Hall, Eguchi, S and Copas, J. B. (2002). Biometrika, 89, 1-22. A class of logistic-type discriminant functions. Kullback, S and Leibler, R. A. (1951). On information and sufficiency. Ann. Math. Statist., 55, 79-86. 7

Lindsey, J. K. (1996). Parametric Statistical Inference. Oxford University Press, Oxford. Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Math. Statist., 20, 595-601. Ann. 8