# Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma

Save this PDF as:

Size: px
Start display at page:

Download "Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma"

## Transcription

1 Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma Shinto Eguchi a, and John Copas b a Institute of Statistical Mathematics and Graduate University of Advanced Studies, Minami-azabu 4-6-7, Tokyo , Japan b Department of Statistics, University of Warwick, Coventry CV4 7AL, UK Abstract Kullback-Leibler divergence and the Neyman-Pearson lemma are two fundamental concepts in statistics. Both are about likelihood ratios: Kullback-Leibler divergence is the expected log-likelihood ratio, and the Neyman-Pearson lemma is about error rates of likelihood ratio tests. Exploring this connection gives another statistical interpretation of the Kullback-Leibler divergence in terms of the loss of power of the likelihood ratio test when the wrong distribution is used for one of the hypotheses. In this interpretation, the standard non-negativity property of the Kullback-Leibler divergence is essentially a restatement of the optimal property of likelihood ratios established by the Neyman-Pearson lemma. Keywords: Kullback-Leibler divergence; Neyman-Pearson lemma; Information geometry; Testing hypotheses; Maximum likelihood. Corresponding author. address: (S. Eguchi) 1

2 1. The Kullback-Leibler Divergence, D Let P and Q be two probability distributions of a random variable X. Ifx is continuous, with P and Q described by probability density functions p(x) and q(x) respectively, the Kullback-Leibler (KL) divergence from P to Q is defined by { D(P, Q) = log p(x) } p(x)dx. q(x) See Kullback and Leibler (1951). More generally, D(P, Q) is defined by { D(P, Q) = log P } Q (x) dp (x), where P / Q(x) is the density function of P relative to Q. Another way of writing D is where D(P, Q) =L(P, P) L(P, Q) (1) L(P, Q) =E P {log q(x)}, and the symbol E P means taking the expectation over the distribution P. Two fundamental properties of D are non-negativity : D(P, Q) 0 with equality if and only if P = Q. asymmetry : D(P, Q) D(Q, P ). KL divergence plays a central role in the theory of statistical inference. Properties of D, in particular its asymmetry, are exploited through the concept of dual affine connections in the information geometry of Amari (1985) and Amari and Nagaoka (2001). A well known application is Akaike s information criterion (AIC) used to control over-fitting in statistical modeling (Akaike, 1973). In this note we suggest some statistical interpretations of KL divergence which can help our understanding of this important theoretical concept. Section 2 reviews the well known connections with maximum likelihood and expected log likelihood ratios. Section 3 shows how the Neyman-Pearson lemma gives a new interpretation of D and gives an alternative 2

3 proof of the non-negativity property. Section 4. An extension of the argument is mentioned in 2. Well Known Statistical Interpretations of D 2.1 Maximum likelihood Given a random sample x 1,x 2,,x n from the underlying distribution P, let P n be the empirical distribution, which puts probability 1/n on each sample value x i. Now let Q θ be a statistical model f(x, θ) with unknown parameter θ. Then the empirical version of L(P, Q θ )is L(P n,q θ )= 1 n log f(x i,θ). n Apart from the factor 1/n, this is just the log likelihood function. Note that the empirical version L(P n,q θ ) reduces to the population version L(P, Q θ ) for any n by taking its expectation, i.e., E P {L(P n,q θ )} = L(P, Q θ ). Hence, from (1), maximizing the likelihood to find the maximum likelihood estimate (MLE) is analogous to finding θ which minimizes D(P, Q θ ). It is obvious that the best possible model is the one that fits the data exactly, i.e. when P = Q θ, so for any general model Q θ L(P, Q θ ) L(P, P) which is just the same as the non-negativity property D(P, Q θ ) 0. Asymmetry is equally clear because of the very distinct roles of P (the data) and Q θ (the model). Just as likelihood measures how well a model explains the data, so we can think of D as measuring the lack of fit between model and data relative to a perfect fit. It follows from the law of large numbers that the MLE converges almost surely to θ(p ) = arg min θ D(P, Q θ ). If P = Q θ then θ(p )=θ. This outlines the elegant proof of the consistency of the MLE by Wald (1949). In this way, KL divergence helps us to understand the theoretical properties of the maximum likelihood method. 2.2 Likelihood ratios Now suppose that P and Q are possible models for data (vector) X, which we now think of as being a null hypothesis H and an alternative hypothesis A. Suppose that X 3 i=1

4 has probability density or mass function f H (x) under H and f A (x) under A. Then the log likelihood ratio is λ = λ(x) = log f A(x) f H (x). (2) If these two hypotheses are reasonably well separated, intuition suggests that λ(x) will tend to be positive if A is true (the correct model f A fits the data better than wrong model f H ), and will tend to be negative if H is true (then we would expect f H to fit the data better than f A ). The expected values of λ(x) are E A {λ(x)} = D(P A,P H ), E H {λ(x)} = D(P H,P A ). Thus D(P A,P H ) is the expected log likelihood ratio when the alternative hypothesis is true. The larger is the likelihood ratio, the more evidence we have for the alternative hypothesis. In this interpretation, D is reminiscent of the power function in hypothesis testing, measuring the degree to which the data will reveal that the null hypothesis is false when the alternative is in fact true. Both this, and its dual when H is true, are zero when the two hypotheses coincide so that no statistical discrimination is possible. The asymmetry property of D corresponds to the asymmetric roles that the null and alternative hypotheses play in the theory of hypothesis testing. 3. Interpreting D through the Neyman-Pearson Lemma In terms of the set-up of Section 2.2, the Neyman-Pearson lemma establishes the optimality of the likelihood ratio critical region W = {x : λ(x) u}, in terms of Type I and Type II error rates. If we choose u such that P H (W ) = α, the likelihood ratio test offers W as a critical region with size α. For any other critical region W with the same size, i.e. with P H (W )=P H (W ), the lemma claims that P A (W ) P A (W ). (3) 4

5 The proof is easily shown by noting that and λ(x) u on W W λ(x) <u on W W, and then integrating separately over W W and W W leads to P A (W ) P A (W ) e u {P H (W ) P H (W )}. (4) This yields (3) from the size α condition on W and W. The lemma and proof are covered in most text books on mathematical statistics, for example Cox and Hinkley (1974) and Lindsey (1996). However, to use the optimal test we need to know both f H (x) and f A (x) so that we have the correct log likelihood ratio λ(x). Suppose we mis-specified the alternative hypothesis as f Q (x) instead of f A (x). We would then use the incorrect log likelihood ratio λ = λ(x) = log f Q(x) f H (x). The same argument as that used in arriving at (4) now gives P A (λ>u) P A ( λ >u) e u {P H (λ>u) P H ( λ >u)}. (5) The left hand side of this inequality is the loss of power when our test is based on λ(x) instead of λ(x). To measure the overall loss of power for different thresholds u we can integrate this over all possible values of u from to +. Integration by parts gives power = {P A (λ>u) P A ( λ >u)}du = u{p A (λ>u) P A ( λ >u)} + = (λ λ) f A (x)dx = log f A(x) f Q (x) f A(x)dx u u {P A(λ u) P A ( λ u)}du 5

6 by the definition of the expectations of λ and λ. Doing the same thing to the right hand side of (5) gives e u {P H (λ>u) P H ( λ >u)}du = e u {P H (λ>u) P H ( λ >u)} + = (e λ e λ) f H (x)dx =1 1= 0. Hence we get e u u {P H(λ u) P H ( λ u)}du power = D(P A,P Q ) 0. (6) We now have another interpretation of D: the KL divergence from P to Q measures how much power we lose with the likelihood ratio test if we mis-specify the alternative hypothesis P as Q. The non-negativity of D in (6) is essentially a restatement of the Neyman-Pearson lemma. Interestingly, this argument is independent of the choice of null hypothesis. We get a dualistic version of this if we imagine that it is the null hypothesis that is mis-specified. If we mistakenly take the null as f Q (x) instead of f H (x), we would now use the log likelihood ratio λ = λ(x) = log f A(x) f Q (x). Multiply both sides of the inequality (5) by e u to give e u {P A (λ>u) P A ( λ >u)} P H (λ>u) P H ( λ >u) and integrate both sides with respect to u as before. This gives (e λ e λ)f A (x)dx (λ λ)f H (x)dx which leads to power = D(P H,P Q ) 0. KL divergence now corresponds to loss of power if we mis-specify the null hypothesis. Note the essential asymmetry in these arguments: power is explicitly about the alternative 6

7 hypothesis, and in arriving at the second version we have integrated the power difference using a different weight function. 4. Comment The receiver operating characteristic (ROC) curve of a test statistic is the graph of the power (one minus the Type II error) against the size (Type I error) as the threshold u takes all possible values. Thus the two sides of inequality (5) are about the differences in coordinates of two ROC curves, one for the optimum λ(x) and the other for the suboptimum statistic λ(x). Integrating each side of (5) over u is finding a weighted area between these two curves. The smaller is this area, the closer are the statistical properties of λ(x) to those of λ(x) in the context of discriminant analysis. Eguchi and Copas (2002) exploit this interpretation by minimizing this area over a parametric family of statistics λ(x). In this way we can find a discriminant function which best approximates the true, but generally unknown, log likelihood ratio statistic. References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Proc. 2nd Int. Symp. on Inference Theory (eds B. N. Petrov and F. Csáki), Budapest: Académiai Kiadó. Amari, S. (1985). Statistics, vol. Differential-Geometrical Methods in Statistics. Lecture Notes in 28. New York: Springer. Amari, S. and Nagaoka, H. (2001). Methods of Information Geometry. Translations of Mathematical Monographs, vol New York: American Mathematical Society. Cox, D. R. and Hinkley, D. V. (1974). London. Theoretical Statistics. Chapman and Hall, Eguchi, S and Copas, J. B. (2002). Biometrika, 89, A class of logistic-type discriminant functions. Kullback, S and Leibler, R. A. (1951). On information and sufficiency. Ann. Math. Statist., 55,

8 Lindsey, J. K. (1996). Parametric Statistical Inference. Oxford University Press, Oxford. Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Math. Statist., 20, Ann. 8

### 171:290 Model Selection Lecture II: The Akaike Information Criterion

171:290 Model Selection Lecture II: The Akaike Information Criterion Department of Biostatistics Department of Statistics and Actuarial Science August 28, 2012 Introduction AIC, the Akaike Information

### Maximum Likelihood Estimation

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for

### Hypothesis Testing. 1 Introduction. 2 Hypotheses. 2.1 Null and Alternative Hypotheses. 2.2 Simple vs. Composite. 2.3 One-Sided and Two-Sided Tests

Hypothesis Testing 1 Introduction This document is a simple tutorial on hypothesis testing. It presents the basic concepts and definitions as well as some frequently asked questions associated with hypothesis

### Introduction to Detection Theory

Introduction to Detection Theory Reading: Ch. 3 in Kay-II. Notes by Prof. Don Johnson on detection theory, see http://www.ece.rice.edu/~dhj/courses/elec531/notes5.pdf. Ch. 10 in Wasserman. EE 527, Detection

### The CUSUM algorithm a small review. Pierre Granjon

The CUSUM algorithm a small review Pierre Granjon June, 1 Contents 1 The CUSUM algorithm 1.1 Algorithm............................... 1.1.1 The problem......................... 1.1. The different steps......................

### Measuring the Power of a Test

Textbook Reference: Chapter 9.5 Measuring the Power of a Test An economic problem motivates the statement of a null and alternative hypothesis. For a numeric data set, a decision rule can lead to the rejection

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### . (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.

Chapter 3 Kolmogorov-Smirnov Tests There are many situations where experimenters need to know what is the distribution of the population of their interest. For example, if they want to use a parametric

### Signal Detection. Outline. Detection Theory. Example Applications of Detection Theory

Outline Signal Detection M. Sami Fadali Professor of lectrical ngineering University of Nevada, Reno Hypothesis testing. Neyman-Pearson (NP) detector for a known signal in white Gaussian noise (WGN). Matched

### Testing theory an introduction

Testing theory an introduction P.J.G. Teunissen Series on Mathematical Geodesy and Positioning Testing theory an introduction Testing theory an introduction P.J.G. Teunissen Delft University of Technology

### How to Conduct a Hypothesis Test

How to Conduct a Hypothesis Test The idea of hypothesis testing is relatively straightforward. In various studies we observe certain events. We must ask, is the event due to chance alone, or is there some

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### Differentiating under an integral sign

CALIFORNIA INSTITUTE OF TECHNOLOGY Ma 2b KC Border Introduction to Probability and Statistics February 213 Differentiating under an integral sign In the derivation of Maximum Likelihood Estimators, or

### Betting rules and information theory

Betting rules and information theory Giulio Bottazzi LEM and CAFED Scuola Superiore Sant Anna September, 2013 Outline Simple betting in favorable games The Central Limit Theorem Optimal rules The Game

### Why is Insurance Good? An Example Jon Bakija, Williams College (Revised October 2013)

Why is Insurance Good? An Example Jon Bakija, Williams College (Revised October 2013) Introduction The United States government is, to a rough approximation, an insurance company with an army. 1 That is

### JUST THE MATHS UNIT NUMBER 1.8. ALGEBRA 8 (Polynomials) A.J.Hobson

JUST THE MATHS UNIT NUMBER 1.8 ALGEBRA 8 (Polynomials) by A.J.Hobson 1.8.1 The factor theorem 1.8.2 Application to quadratic and cubic expressions 1.8.3 Cubic equations 1.8.4 Long division of polynomials

### Time Series Analysis

Time Series Analysis hm@imm.dtu.dk Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby 1 Outline of the lecture Identification of univariate time series models, cont.:

### SECOND PART, LECTURE 3: HYPOTHESIS TESTING

Massimo Guidolin Massimo.Guidolin@unibocconi.it Dept. of Finance STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN SECOND PART, LECTURE 3: HYPOTHESIS TESTING Lecture 3: Hypothesis Testing Prof.

### Sufficient Statistics and Exponential Family. 1 Statistics and Sufficient Statistics. Math 541: Statistical Theory II. Lecturer: Songfeng Zheng

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Sufficient Statistics and Exponential Family 1 Statistics and Sufficient Statistics Suppose we have a random sample X 1,, X n taken from a distribution

### On using numerical algebraic geometry to find Lyapunov functions of polynomial dynamical systems

Dynamics at the Horsetooth Volume 2, 2010. On using numerical algebraic geometry to find Lyapunov functions of polynomial dynamical systems Eric Hanson Department of Mathematics Colorado State University

### FALSE ALARMS IN FAULT-TOLERANT DOMINATING SETS IN GRAPHS. Mateusz Nikodem

Opuscula Mathematica Vol. 32 No. 4 2012 http://dx.doi.org/10.7494/opmath.2012.32.4.751 FALSE ALARMS IN FAULT-TOLERANT DOMINATING SETS IN GRAPHS Mateusz Nikodem Abstract. We develop the problem of fault-tolerant

### Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]

Survival Analysis of Left Truncated Income Protection Insurance Data [March 29, 2012] 1 Qing Liu 2 David Pitt 3 Yan Wang 4 Xueyuan Wu Abstract One of the main characteristics of Income Protection Insurance

### Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2

CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2 Proofs Intuitively, the concept of proof should already be familiar We all like to assert things, and few of us

### 6 EXTENDING ALGEBRA. 6.0 Introduction. 6.1 The cubic equation. Objectives

6 EXTENDING ALGEBRA Chapter 6 Extending Algebra Objectives After studying this chapter you should understand techniques whereby equations of cubic degree and higher can be solved; be able to factorise

### Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725

Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T

### UNDERSTANDING INFORMATION CRITERIA FOR SELECTION AMONG CAPTURE-RECAPTURE OR RING RECOVERY MODELS

UNDERSTANDING INFORMATION CRITERIA FOR SELECTION AMONG CAPTURE-RECAPTURE OR RING RECOVERY MODELS David R. Anderson and Kenneth P. Burnham Colorado Cooperative Fish and Wildlife Research Unit Room 201 Wagar

### Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Math Review for the Quantitative Reasoning Measure of the GRE revised General Test www.ets.org Overview This Math Review will familiarize you with the mathematical skills and concepts that are important

### Network quality control

Network quality control Network quality control P.J.G. Teunissen Delft Institute of Earth Observation and Space systems (DEOS) Delft University of Technology VSSD iv Series on Mathematical Geodesy and

### Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative

### Model selection bias and Freedman s paradox

Ann Inst Stat Math (2010) 62:117 125 DOI 10.1007/s10463-009-0234-4 Model selection bias and Freedman s paradox Paul M. Lukacs Kenneth P. Burnham David R. Anderson Received: 16 October 2008 / Revised: 10

### Lecture 8: Signal Detection and Noise Assumption

ECE 83 Fall Statistical Signal Processing instructor: R. Nowak, scribe: Feng Ju Lecture 8: Signal Detection and Noise Assumption Signal Detection : X = W H : X = S + W where W N(, σ I n n and S = [s, s,...,

### Linear Programming Notes V Problem Transformations

Linear Programming Notes V Problem Transformations 1 Introduction Any linear programming problem can be rewritten in either of two standard forms. In the first form, the objective is to maximize, the material

### Generalization of Chisquare Goodness-of-Fit Test for Binned Data Using Saturated Models, with Application to Histograms

Generalization of Chisquare Goodness-of-Fit Test for Binned Data Using Saturated Models, with Application to Histograms Robert D. Cousins Dept. of Physics and Astronomy University of California, Los Angeles,

### On Adaboost and Optimal Betting Strategies

On Adaboost and Optimal Betting Strategies Pasquale Malacaria School of Electronic Engineering and Computer Science Queen Mary, University of London Email: pm@dcs.qmul.ac.uk Fabrizio Smeraldi School of

### PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include 2 + 5.

PUTNAM TRAINING POLYNOMIALS (Last updated: November 17, 2015) Remark. This is a list of exercises on polynomials. Miguel A. Lerma Exercises 1. Find a polynomial with integral coefficients whose zeros include

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

### Generalized BIC for Singular Models Factoring through Regular Models

Generalized BIC for Singular Models Factoring through Regular Models Shaowei Lin http://math.berkeley.edu/ shaowei/ Department of Mathematics, University of California, Berkeley PhD student (Advisor: Bernd

### arxiv:1112.0829v1 [math.pr] 5 Dec 2011

How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly

### Linear Programming I

Linear Programming I November 30, 2003 1 Introduction In the VCR/guns/nuclear bombs/napkins/star wars/professors/butter/mice problem, the benevolent dictator, Bigus Piguinus, of south Antarctica penguins

### Likelihood Approaches for Trial Designs in Early Phase Oncology

Likelihood Approaches for Trial Designs in Early Phase Oncology Clinical Trials Elizabeth Garrett-Mayer, PhD Cody Chiuzan, PhD Hollings Cancer Center Department of Public Health Sciences Medical University

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### Penalized regression: Introduction

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

### Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

sheng@mail.ncyu.edu.tw 1 Content Introduction Expectation and variance of continuous random variables Normal random variables Exponential random variables Other continuous distributions The distribution

### INFORMATION THEORY AND LOG-LIKELIHOOD A BASIS FOR MODEL SELECTION AND INFERENCE

INFORMATION THEORY AND LOG-LIKELIHOOD MODELS: A BASIS FOR MODEL SELECTION AND INFERENCE Full reality cannot be included in a model; thus we seek a good model to approximate the effects or factors supported

### Inference of Probability Distributions for Trust and Security applications

Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi Outline 2 Outline Motivations 2 Outline Motivations

### Solving Linear Diophantine Matrix Equations Using the Smith Normal Form (More or Less)

Solving Linear Diophantine Matrix Equations Using the Smith Normal Form (More or Less) Raymond N. Greenwell 1 and Stanley Kertzner 2 1 Department of Mathematics, Hofstra University, Hempstead, NY 11549

### Generating Gaussian Mixture Models by Model Selection For Speech Recognition

Generating Gaussian Mixture Models by Model Selection For Speech Recognition Kai Yu F06 10-701 Final Project Report kaiy@andrew.cmu.edu Abstract While all modern speech recognition systems use Gaussian

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### A linear algebraic method for pricing temporary life annuities

A linear algebraic method for pricing temporary life annuities P. Date (joint work with R. Mamon, L. Jalen and I.C. Wang) Department of Mathematical Sciences, Brunel University, London Outline Introduction

### Testing predictive performance of binary choice models 1

Testing predictive performance of binary choice models 1 Bas Donkers Econometric Institute and Department of Marketing Erasmus University Rotterdam Bertrand Melenberg Department of Econometrics Tilburg

### The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

Cointegration The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series. Economic theory, however, often implies equilibrium

### The Kelly criterion for spread bets

IMA Journal of Applied Mathematics 2007 72,43 51 doi:10.1093/imamat/hxl027 Advance Access publication on December 5, 2006 The Kelly criterion for spread bets S. J. CHAPMAN Oxford Centre for Industrial

### Notes from February 11

Notes from February 11 Math 130 Course web site: www.courses.fas.harvard.edu/5811 Two lemmas Before proving the theorem which was stated at the end of class on February 8, we begin with two lemmas. The

### 18.06 Problem Set 4 Solution Due Wednesday, 11 March 2009 at 4 pm in 2-106. Total: 175 points.

806 Problem Set 4 Solution Due Wednesday, March 2009 at 4 pm in 2-06 Total: 75 points Problem : A is an m n matrix of rank r Suppose there are right-hand-sides b for which A x = b has no solution (a) What

### MATH 90 CHAPTER 1 Name:.

MATH 90 CHAPTER 1 Name:. 1.1 Introduction to Algebra Need To Know What are Algebraic Expressions? Translating Expressions Equations What is Algebra? They say the only thing that stays the same is change.

### Inferential Statistics

Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are

### Continuity of the Perron Root

Linear and Multilinear Algebra http://dx.doi.org/10.1080/03081087.2014.934233 ArXiv: 1407.7564 (http://arxiv.org/abs/1407.7564) Continuity of the Perron Root Carl D. Meyer Department of Mathematics, North

### Non Parametric Inference

Maura Department of Economics and Finance Università Tor Vergata Outline 1 2 3 Inverse distribution function Theorem: Let U be a uniform random variable on (0, 1). Let X be a continuous random variable

### 9-3.4 Likelihood ratio test. Neyman-Pearson lemma

9-3.4 Likelihood ratio test Neyman-Pearson lemma 9-1 Hypothesis Testing 9-1.1 Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental

### 1 3 4 = 8i + 20j 13k. x + w. y + w

) Find the point of intersection of the lines x = t +, y = 3t + 4, z = 4t + 5, and x = 6s + 3, y = 5s +, z = 4s + 9, and then find the plane containing these two lines. Solution. Solve the system of equations

### MINIMIZATION OF ENTROPY FUNCTIONALS UNDER MOMENT CONSTRAINTS. denote the family of probability density functions g on X satisfying

MINIMIZATION OF ENTROPY FUNCTIONALS UNDER MOMENT CONSTRAINTS I. Csiszár (Budapest) Given a σ-finite measure space (X, X, µ) and a d-tuple ϕ = (ϕ 1,..., ϕ d ) of measurable functions on X, for a = (a 1,...,

### Classifiers & Classification

Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin

### From the help desk: Bootstrapped standard errors

The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### AP Physics 1 and 2 Lab Investigations

AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Fabienne Comte, Celine Duval, Valentine Genon-Catalot To cite this version: Fabienne

### Performance measures for Neyman-Pearson classification

Performance measures for Neyman-Pearson classification Clayton Scott Abstract In the Neyman-Pearson (NP) classification paradigm, the goal is to learn a classifier from labeled training data such that

### FIN-40008 FINANCIAL INSTRUMENTS SPRING 2008. Options

FIN-40008 FINANCIAL INSTRUMENTS SPRING 2008 Options These notes describe the payoffs to European and American put and call options the so-called plain vanilla options. We consider the payoffs to these

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### A Model of Optimum Tariff in Vehicle Fleet Insurance

A Model of Optimum Tariff in Vehicle Fleet Insurance. Bouhetala and F.Belhia and R.Salmi Statistics and Probability Department Bp, 3, El-Alia, USTHB, Bab-Ezzouar, Alger Algeria. Summary: An approach about

### Dongfeng Li. Autumn 2010

Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the 2012-13 school year.

This document is designed to help North Carolina educators teach the Common Core (Standard Course of Study). NCDPI staff are continually updating and improving these tools to better serve teachers. Algebra

### Functions and Equations

Centre for Education in Mathematics and Computing Euclid eworkshop # Functions and Equations c 014 UNIVERSITY OF WATERLOO Euclid eworkshop # TOOLKIT Parabolas The quadratic f(x) = ax + bx + c (with a,b,c

### Exact Nonparametric Tests for Comparing Means - A Personal Summary

Exact Nonparametric Tests for Comparing Means - A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola

### 4.6 Linear Programming duality

4.6 Linear Programming duality To any minimization (maximization) LP we can associate a closely related maximization (minimization) LP. Different spaces and objective functions but in general same optimal

### 1 Portfolio mean and variance

Copyright c 2005 by Karl Sigman Portfolio mean and variance Here we study the performance of a one-period investment X 0 > 0 (dollars) shared among several different assets. Our criterion for measuring

### MOP 2007 Black Group Integer Polynomials Yufei Zhao. Integer Polynomials. June 29, 2007 Yufei Zhao yufeiz@mit.edu

Integer Polynomials June 9, 007 Yufei Zhao yufeiz@mit.edu We will use Z[x] to denote the ring of polynomials with integer coefficients. We begin by summarizing some of the common approaches used in dealing

### Breaking The Code. Ryan Lowe. Ryan Lowe is currently a Ball State senior with a double major in Computer Science and Mathematics and

Breaking The Code Ryan Lowe Ryan Lowe is currently a Ball State senior with a double major in Computer Science and Mathematics and a minor in Applied Physics. As a sophomore, he took an independent study

### 3. INNER PRODUCT SPACES

. INNER PRODUCT SPACES.. Definition So far we have studied abstract vector spaces. These are a generalisation of the geometric spaces R and R. But these have more structure than just that of a vector space.

### princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 6: Provable Approximation via Linear Programming Lecturer: Sanjeev Arora

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 6: Provable Approximation via Linear Programming Lecturer: Sanjeev Arora Scribe: One of the running themes in this course is the notion of

### Separation Properties for Locally Convex Cones

Journal of Convex Analysis Volume 9 (2002), No. 1, 301 307 Separation Properties for Locally Convex Cones Walter Roth Department of Mathematics, Universiti Brunei Darussalam, Gadong BE1410, Brunei Darussalam

### OPRE 6201 : 2. Simplex Method

OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2

### Lecture 14: Section 3.3

Lecture 14: Section 3.3 Shuanglin Shao October 23, 2013 Definition. Two nonzero vectors u and v in R n are said to be orthogonal (or perpendicular) if u v = 0. We will also agree that the zero vector in

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania

Moral Hazard Itay Goldstein Wharton School, University of Pennsylvania 1 Principal-Agent Problem Basic problem in corporate finance: separation of ownership and control: o The owners of the firm are typically

### INTRODUCTION TO RATING MODELS

INTRODUCTION TO RATING MODELS Dr. Daniel Straumann Credit Suisse Credit Portfolio Analytics Zurich, May 26, 2005 May 26, 2005 / Daniel Straumann Slide 2 Motivation Due to the Basle II Accord every bank

### Permutation Tests for Comparing Two Populations

Permutation Tests for Comparing Two Populations Ferry Butar Butar, Ph.D. Jae-Wan Park Abstract Permutation tests for comparing two populations could be widely used in practice because of flexibility of

### Review Horse Race Gambling and Side Information Dependent horse races and the entropy rate. Gambling. Besma Smida. ES250: Lecture 9.

Gambling Besma Smida ES250: Lecture 9 Fall 2008-09 B. Smida (ES250) Gambling Fall 2008-09 1 / 23 Today s outline Review of Huffman Code and Arithmetic Coding Horse Race Gambling and Side Information Dependent

### CHAPTER SIX IRREDUCIBILITY AND FACTORIZATION 1. BASIC DIVISIBILITY THEORY

January 10, 2010 CHAPTER SIX IRREDUCIBILITY AND FACTORIZATION 1. BASIC DIVISIBILITY THEORY The set of polynomials over a field F is a ring, whose structure shares with the ring of integers many characteristics.

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not