Maximum Likelihood Estimation and the Bayesian Information Criterion

Similar documents
Sections 2.11 and 5.8

Penalized regression: Introduction

Multivariate Normal Distribution

Chapter 6: Point Estimation. Fall Probability & Statistics

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Basics of Statistical Machine Learning

Time Series Analysis

Lecture 8: Signal Detection and Noise Assumption

Nonlinear Regression Functions. SW Ch 8 1/54/

MATHEMATICAL METHODS OF STATISTICS

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

**BEGINNING OF EXAMINATION** The annual number of claims for an insured has probability function: , 0 < q < 1.

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

The Method of Partial Fractions Math 121 Calculus II Spring 2015

LOGNORMAL MODEL FOR STOCK PRICES

1.5 Oneway Analysis of Variance

Some Essential Statistics The Lure of Statistics

Maximum likelihood estimation of mean reverting processes

Statistical Theory. Prof. Gesine Reinert

Multiple Linear Regression in Data Mining

1 Review of Least Squares Solutions to Overdetermined Systems

Statistics in Retail Finance. Chapter 6: Behavioural models

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

1. Let A, B and C are three events such that P(A) = 0.45, P(B) = 0.30, P(C) = 0.35,

Maximum Likelihood Estimation

An Introduction to Basic Statistics and Probability

Indices of Model Fit STRUCTURAL EQUATION MODELING 2013

11. Analysis of Case-control Studies Logistic Regression

Statistical Machine Learning

Normal distribution. ) 2 /2σ. 2π σ

Simple Regression Theory II 2010 Samuel L. Baker

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

Partial Fractions. Combining fractions over a common denominator is a familiar operation from algebra:

Fairfield Public Schools

Recall this chart that showed how most of our course would be organized:

Chapter 3 RANDOM VARIATE GENERATION

Introduction to General and Generalized Linear Models

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Statistics Graduate Courses

The Variability of P-Values. Summary

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

1 Maximum likelihood estimation

171:290 Model Selection Lecture II: The Akaike Information Criterion

Chapter 6: Multivariate Cointegration Analysis

Linear Threshold Units

1 Teaching notes on GMM 1.

CHAPTER 2 Estimating Probabilities

Nonparametric statistics and model selection

Estimation across multiple models with application to Bayesian computing and software development

2 Integrating Both Sides

STA 4273H: Statistical Machine Learning

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Simple linear regression

Likelihood: Frequentist vs Bayesian Reasoning

Statistics in Astronomy

This content downloaded on Tue, 19 Feb :28:43 PM All use subject to JSTOR Terms and Conditions

Exact Nonparametric Tests for Comparing Means - A Personal Summary

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

CS229 Lecture notes. Andrew Ng

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Simple Linear Regression Inference

Two-sample inference: Continuous data

Advanced statistical inference. Suhasini Subba Rao

STAT 830 Convergence in Distribution

Imputing Missing Data using SAS

Chapter 4: Statistical Hypothesis Testing

Introduction to Regression and Data Analysis

Section 13, Part 1 ANOVA. Analysis Of Variance

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Correlation key concepts:

Signal Detection. Outline. Detection Theory. Example Applications of Detection Theory

THE CENTRAL LIMIT THEOREM TORONTO

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Lectures 5-6: Taylor Series

= δx x + δy y. df ds = dx. ds y + xdy ds. Now multiply by ds to get the form of the equation in terms of differentials: df = y dx + x dy.

Lecture 3: Linear methods for classification

Lecture 8: Gamma regression

Master s Theory Exam Spring 2006

ALGEBRA REVIEW LEARNING SKILLS CENTER. Exponents & Radicals

Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes

Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

AP Physics 1 and 2 Lab Investigations

Math 120 Final Exam Practice Problems, Form: A

Least Squares Estimation

Chapter 5 Analysis of variance SPSS Analysis of variance

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Time Series and Forecasting

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Transcription:

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 1/3 Maximum Likelihood Estimation and the Bayesian Information Criterion Donald Richards Penn State University

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 2/3 The Method of Maximum Likelihood R. A. Fisher (1912), On an absolute criterion for fitting frequency curves, Messenger of Math. 41, 155 160 Fisher s first mathematical paper, written while a final-year undergraduate in mathematics and mathematical physics at Gonville and Caius College, Cambridge University Fisher s paper started with a criticism of two methods of curve fitting: the method of least-squares and the method of moments It is not clear what motivated Fisher to study this subject; perhaps it was the influence of his tutor, F. J. M. Stratton, an astronomer

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 3/3 X: a random variable θ is a parameter f(x;θ): A statistical model for X X 1,...,X n : A random sample from X We want to construct good estimators for θ The estimator, obviously, should depend on our choice of f

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 4/3 Protheroe, et al. Interpretation of cosmic ray composition The path length distribution, ApJ., 247 1981 X: Length of paths Parameter: θ > 0 Model: The exponential distribution, f(x;θ) = θ 1 exp( x/θ), x > 0 Under this model, E(X) = θ Intuition suggests using X to estimate θ X is unbiased and consistent

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 5/3 LF for globular clusters in the Milky Way X: The luminosity of a randomly chosen cluster van den Bergh s Gaussian model, f(x) = 1 ( exp 2πσ (x ) µ)2 2σ 2 µ: Mean visual absolute magnitude σ: Standard deviation of visual absolute magnitude X and S 2 are good estimators for µ and σ 2, respectively We seek a method which produces good estimators automatically: No guessing allowed

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 6/3 Choose a globular cluster at random; what is the chance that the LF will be exactly -7.1 mag? exactly -7.2 mag? For any continuous random variable X, P(X = x) = 0 Suppose X N(µ = 6.9,σ 2 = 1.21), i.e., X has probability density function f(x) = 1 ( exp 2πσ (x ) µ)2 2σ 2 then P(X = 7.1) = 0 However,...

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 7/3 f( 7.1) = 1 ( 1.1 2π exp ( 7.1 + ) 6.9)2 2(1.1) 2 = 0.37 Interpretation: In one simulation of the random variable X, the likelihood of observing the number -7.1 is 0.37 f( 7.2) = 0.28 In one simulation of X, the value x = 7.1 is 32% more likely to be observed than the value x = 7.2 x = 6.9 is the value with highest (or maximum) likelihood; the prob. density function is maximized at that point Fisher s brilliant idea: The method of maximum likelihood

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 8/3 Return to a general model f(x;θ) Random sample: X 1,...,X n Recall that the X i are independent random variables The joint probability density function of the sample is f(x 1 ;θ)f(x 2 ;θ) f(x n ;θ) Here the variables are the X s, while θ is fixed Fisher s ingenious idea: Reverse the roles of the x s and θ Regard the X s as fixed and θ as the variable

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 9/3 The likelihood function is L(θ;X 1,...,X n ) = f(x 1 ;θ)f(x 2 ;θ) f(x n ;θ) Simpler notation: L(θ) ˆθ, the maximum likelihood estimator of θ, is the value of θ where L is maximized ˆθ is a function of the X s Note: The MLE is not always unique.

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 10/3 Example:... cosmic ray composition The path length distribution... X: Length of paths Parameter: θ > 0 Model: The exponential distribution, f(x;θ) = θ 1 exp( x/θ), x > 0 Random sample: X 1,...,X n Likelihood function: L(θ) = f(x 1 ;θ)f(x 2 ;θ) f(x n ;θ) = θ n exp( n X/θ)

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 11/3 Maximize L using calculus It is also equivalent to maximize lnl: lnl(θ) is maximized at θ = X Conclusion: The MLE of θ is ˆθ = X

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 12/3 LF for globular clusters: X N(µ,σ 2 ), with both µ,σ unknown f(x;µ,σ 2 ) = 1 ( exp 2πσ 2 A likelihood function of two variables, (x ) µ)2 2σ 2 L(µ,σ 2 ) = f(x 1 ;µ,σ 2 ) f(x n ;µ,σ 2 ) 1 ( = (2πσ 2 ) n/2 exp 1 n 2σ 2 (X i µ) 2) i=1 Solve for µ and σ 2 the simultaneous equations: µ ln L = 0, (σ 2 ) ln L = 0 Check that L is concave at the solution (Hessian matrix)

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 13/3 Conclusion: The MLEs are ˆµ = X, ˆσ 2 = 1 n (X i n X) 2 i=1 ˆµ is unbiased: E(ˆµ) = µ ˆσ 2 is not unbiased: E(ˆσ 2 ) = n 1 n σ2 σ 2 For this reason, we use S 2 n n 1ˆσ2 instead of ˆσ 2

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 14/3 Calculus cannot always be used to find MLEs Example:... cosmic ray composition... Parameter: θ > 0 { exp( (x θ)), x θ Model: f(x;θ) = 0, x < θ Random sample: X 1,...,X n L(θ) = f(x 1 ;θ) f(x n ;θ) { exp( n i=1 = (X i θ)), all X i θ 0, otherwise ˆθ = X (1), the smallest observation in the sample

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 15/3 General Properties of the MLE ˆθ may not be unbiased. We often can remove this bias by multiplying ˆθ by a constant. For many models, ˆθ is consistent. The Invariance Property: For many nice functions g, if ˆθ is the MLE of θ then g(ˆθ) is the MLE of g(θ). The Asymptotic Property: For large n, ˆθ has an approximate normal distribution with mean θ and variance 1/B where ( ) 2 B = ne θ ln f(x;θ) The asymptotic property can be used to construct large-sample confidence intervals for θ

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 16/3 The method of maximum likelihood works well when intuition fails and no obvious estimator can be found. When an obvious estimator exists the method of ML often will find it. The method can be applied to many statistical problems: regression analysis, analysis of variance, discriminant analysis, hypothesis testing, principal components, etc.

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 17/3 The ML Method for Linear Regression Analysis Scatterplot data: (x 1,y 1 ),...,(x n,y n ) Basic assumption: The x i s are non-random measurements; the y i are observations on Y, a random variable Statistical model: Y i = α + βx i + ǫ i, i = 1,...,n Errors ǫ 1,...,ǫ n : A random sample from N(0,σ 2 ) Parameters: α, β, σ 2 Y i N(α + βx i,σ 2 ): The Y i s are independent The Y i are not identically distributed; they have differing means

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 18/3 The likelihood function is the joint density of the observed data L(α,β,σ 2 ) = n i=1 1 ( exp 2πσ 2 ( = (2πσ 2 ) n/2 exp (Y i α βx i ) 2 2σ 2 ) n (Y i α βx i ) 2/ 2σ 2) i=1 Use calculus to maximize ln L w.r.t. α,β,σ 2 The ML estimators are: n i=1 ˆβ = (x i x)(y i Ȳ ) n i=1 (x i x) 2, ˆα = Ȳ ˆβ x ˆσ 2 = 1 n n (Y i ˆα ˆβx i ) 2 i=1

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 19/3 The ML Method for Testing Hypotheses X N(µ,σ 2 ) ( Model: f(x;µ,σ 2 ) = 1 exp 2πσ 2 ) (x µ)2 2σ 2 Random sample: X 1,...,X n We wish to test H 0 : µ = 3 vs. H a : µ 3 The space of all permissible values of the parameters Ω = {(µ,σ) : < µ <,σ > 0} H 0 and H a represent restrictions on the parameters, so we are led to parameter subspaces ω 0 = {(µ,σ) : µ = 3,σ > 0}, ω a = {(µ,σ) : µ 3,σ > 0}

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 20/3 Construct the likelihood function L(µ,σ 2 1 ( ) = (2πσ 2 ) n/2 exp 1 2σ 2 n (X i µ) 2) i=1 Maximize L(µ,σ 2 ) over ω 0 and then over ω a The likelihood ratio test statistic is λ = max L(µ,σ 2 ) ω 0 max L(µ,σ 2 ) = ω a max σ>0 L(3,σ2 ) max µ 3,σ>0 L(µ,σ2 ) Fact: 0 λ 1

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 21/3 L(3,σ 2 ) is maximized over ω 0 at σ 2 = 1 n (X i 3) 2 n i=1 ( max L(3,σ 2 ) =L 3, 1 ω 0 n = n (X i 3) 2) i=1 ( n 2πe n i=1 (X i 3) 2 ) n/2

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 22/3 L(µ,σ 2 ) is maximized over ω a at µ = X, σ 2 = 1 n (X i n X) 2 i=1 ( max L(µ,σ 2 ) =L X, 1 ω a n = n (X i i=1 X) 2) ( n 2πe n i=1 (X i X) 2 ) n/2

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 23/3 The likelihood ratio test statistic: λ 2/n n = 2πe n i=1 (X i 3) 2 n 2πe n i=1 (X i X) 2 n = (X i X) n 2 (X i 3) 2 i=1 i=1 λ is close to 1 iff X is close to 3 λ is close to 0 iff X is far from 3 λ is equivalent to a t-statistic In this case, the ML method discovers the obvious test statistic

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 24/3 The Bayesian Information Criterion Suppose that we have two competing statistical models We can fit these models using the methods of least squares, moments, maximum likelihood,... The choice of model cannot be assessed entirely by these methods By increasing the number of parameters, we can always reduce the residual sums of squares Polynomial regression: By increasing the number of terms, we can reduce the residual sum of squares More complicated models generally will have lower residual errors

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 25/3 BIC: Standard approach to model fitting for large data sets The BIC penalizes models with larger numbers of free parameters Competing models:f 1 (x;θ 1,...,θ m1 ) and f 2 (x;φ 1,...,φ m2 ) Random sample: X 1,...,X n Likelihood functions: L 1 (θ 1,...,θ m1 ) and L 2 (φ 1,...,φ m2 ) BIC = 2ln L 1(θ 1,...,θ m1 ) L 2 (φ 1,...,φ m2 ) (m 1 m 2 )ln n The BIC balances an increase in the likelihood with the number of parameters used to achieve that increase

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 26/3 Calculate all MLEs ˆθ i and ˆφ i and the estimated BIC: BIC = 2ln L 1(ˆθ 1,..., ˆθ m1 ) L 2 (ˆφ 1,..., ˆφ m2 ) (m 1 m 2 )ln n General Rules: BIC < 2: Weak evidence that Model 1 is superior to Model 2 2 BIC 6: Moderate evidence that Model 1 is superior 6 < BIC 10: Strong evidence that Model 1 is superior BIC > 10: Very strong evidence that Model 1 is superior

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 27/3 Competing models for GCLF in the Galaxy 1. A Gaussian model (van den Bergh 1985, ApJ, 297) f(x;µ,σ) = 1 ( (x ) µ)2 exp 2πσ 2σ 2 2. A t-distn. model (Secker 1992, AJ 104) g(x;µ,σ,δ) = Γ(δ+1 2 ) (x µ)2 (1 πδ σ Γ( δ 2 ) + δσ 2 < µ <,σ > 0,δ > 0 ) In each model, µ is the mean and σ 2 is the variance In Model 2, δ is a shape parameter δ+1 2

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 28/3 We use the data of Secker (1992), Table 1 We assume that the data constitute a random sample ML calculations suggest that Model 1 is inferior to Model 2 Question: Is the increase in likelihood due to larger number of parameters? This question can be studied using the BIC Test of hypothesis H 0 : Gaussian model vs. H a : t- model

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 29/3

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 30/3 Model 1: Write down the likelihood function, 1 ( L 1 (µ,σ) = (2πσ 2 ) n/2 exp 1 n 2σ 2 (X i µ) 2) ˆµ = X, the ML estimator i=1 ˆσ 2 = S 2, a multiple of the ML estimator of σ 2 L 1 ( X,S) = (2πS 2 ) n/2 exp( (n 1)/2) For the Milky Way data, x = 7.14 and s = 1.41 Secker (1992, p. 1476): lnl 1 ( 7.14,1.41) = 176.4

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 31/3 Model 2: Write down the likelihood function n Γ( δ+1 2 L 2 (µ,σ,δ) = ) (1 πδ σ Γ( δ 2 ) + (X i µ) 2 ) δ+1 2 δσ 2 i=1 Are the MLEs of µ,σ 2,δ unique? No explicit formulas for them are known; we evaluate them numerically Substitute the Milky Way data for the X i s in the formula for L, and maximize L numerically Secker (1992): ˆµ = 7.31, ˆσ = 1.03, ˆδ = 3.55 Secker (1992, p. 1476): lnl 2 ( 7.31,1.03,3.55) = 173.0

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 32/3 Finally, calculate the estimated BIC: With m 1 = 2, m 2 = 3, n = 100 L 1 ( 7.14,1.41) BIC =2ln L 2 ( 7.31,1.03,3.55) (m 1 m 2 )n = 2.2 Apply the General Rules on p. 26 to assess the strength of the evidence that Model 1 may be superior to Model 2. Since BIC < 2, we have weak evidence that the t-distribution model is superior to the Gaussian distribution model. We fail to reject the null hypothesis that the GCLF follows the Gaussian model over the t-model

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 33/3 Concluding General Remarks on the BIC The BIC procedure is consistent: If Model 1 is the true model then, as n, the BIC will determine (with probability 1) that it is. In typical significance tests, any null hypothesis is rejected if n is sufficiently large. Thus, the factor lnn gives lower weight to the sample size. Not all information criteria are consistent; e.g., the AIC is not consistent (Azencott and Dacunha-Castelle, 1986). The BIC is not a panacea; some authors recommend that it be used in conjunction with other information criteria.

Maximum Likelihood Estimation and the Bayesian Information Criterion p. 34/3 There are also difficulties with the BIC Findley (1991, Ann. Inst. Statist. Math.) studied the performance of the BIC for comparing two models with different numbers of parameters: Suppose that the log-likelihood-ratio sequence of two models with different numbers of estimated parameters is bounded in probability. Then the BIC will, with asymptotic probability 1, select the model having fewer parameters.