Maximum Likelihood Estimation

Similar documents
i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Chapter 6: Point Estimation. Fall Probability & Statistics

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Maximum likelihood estimation of mean reverting processes

An extension of the factoring likelihood approach for non-monotone missing data

Multivariate Normal Distribution

1 Maximum likelihood estimation

Lecture 3: Linear methods for classification

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Basics of Statistical Machine Learning

STA 4273H: Statistical Machine Learning

(Quasi-)Newton methods

Note on the EM Algorithm in Linear Regression Model

Data analysis in supersaturated designs

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Statistical Machine Learning

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

Practice problems for Homework 11 - Point Estimation

Analysis of Reliability and Warranty Claims in Products with Age and Usage Scales

Logistic Regression (1/24/13)

Statistics 580 The EM Algorithm Introduction

1 Another method of estimation: least squares

GLM, insurance pricing & big data: paying attention to convergence issues.

Beyond stochastic gradient descent for large-scale machine learning

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

STAT 830 Convergence in Distribution

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Chapter 2. Dynamic panel data models

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Econometrics Simple Linear Regression

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Advanced statistical inference. Suhasini Subba Rao

Master s Theory Exam Spring 2006

Linear Classification. Volker Tresp Summer 2015

1 Teaching notes on GMM 1.

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CHAPTER 2 Estimating Probabilities

Maximum Likelihood Estimation

How big is large? a study of the limit for large insurance claims in case reserves. Karl-Sebastian Lindblad

A.II. Kernel Estimation of Densities

Package EstCRM. July 13, 2015

Introduction to General and Generalized Linear Models

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Robust Multiple Imputation

Maximum Likelihood Estimation by R

Sections 2.11 and 5.8

Linear Threshold Units

Methodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS). 1

An Internal Model for Operational Risk Computation

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

MSc. Econ: MATHEMATICAL STATISTICS, 1995 MAXIMUM-LIKELIHOOD ESTIMATION

171:290 Model Selection Lecture II: The Akaike Information Criterion

UNIVERSITY OF MINNESOTA. Alejandro Ribeiro

Chapter 3: The Multiple Linear Regression Model

Factor analysis. Angela Montanari

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Multiple Linear Regression in Data Mining

Revenue Management with Correlated Demand Forecasting

Imputing Missing Data using SAS

LOGNORMAL MODEL FOR STOCK PRICES

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Lecture 8: Gamma regression

Detekce změn v autoregresních posloupnostech

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

From the help desk: hurdle models

2. Linear regression with multiple regressors

Operational Risk Modeling Analytics

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Estimating an ARMA Process

The Method of Least Squares

Statistical Theory. Prof. Gesine Reinert

Machine learning and optimization for massive data

From the help desk: Bootstrapped standard errors

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

Introduction to Logistic Regression

Linear Programming Notes V Problem Transformations

Aggregate Loss Models

Monte Carlo Simulation

Regression analysis of probability-linked data

problem arises when only a non-random sample is available differs from censored regression model in that x i is also unobserved

Goodness of fit assessment of item response theory models

Lecture Notes 1. Brief Review of Basic Probability

Introduction to Detection Theory

MINIMIZATION OF ENTROPY FUNCTIONALS UNDER MOMENT CONSTRAINTS. denote the family of probability density functions g on X satisfying


Representing Uncertainty by Probability and Possibility What s the Difference?

PS 271B: Quantitative Methods II. Lecture Notes

CSCI567 Machine Learning (Fall 2014)

Improving predictive inference under covariate shift by weighting the log-likelihood function

Hyperspectral images retrieval with Support Vector Machines (SVM)

Least Squares Estimation

Distribution (Weibull) Fitting

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Transcription:

Maximum Likelihood Estimation Quantitative Microeconomics R. Mora Department of Economics Universidad Carlos III de Madrid

Outline Motivation 1 Motivation 2 3 4 5

General Approaches to Parameter Estimation There are estimation criteria that produce estimators with good properties Least Squares (OLS or GLS Method of Moments (OLS, GLS, and IV: Maximum Likelihood (ML θ = g (E(Y ˆθ = g (E N [y i ] It chooses the vector ˆθ which makes the estimation of the probability of the sample most likely

Basic Setup Motivation Let {y 1, y 2,..., y N } be an iid sample from the population with density f (Y ;θ 0. We aim to estimate θ 0 Because of the iid assumption, the joint distribution of {y 1, y 2,..., y N } is simply the product of the densities: f (y 1, y 2,..., y N ;θ 0 = f (y 1 ;θ 0 f (y 2 ;θ 0...f (y N ;θ 0 The Likelihood Function is the function obtained for a given sample after replacing true θ 0 by any θ L(θ = f (y 1 ;θf (y 2 ;θ...f (y N ;θ L(θ is a random variable because it depends on the sample

Motivation The maximum likelihood estimator of θ 0, ˆθ ML, is the value of θ that maximizes the likelihood function L(θ It is more convenient to work with the logarithm of the likelihood function l (θ = N i=1 log (f (y i;θ Since the logarithmiic transform is monotonic, ˆθ ML also maximizes l (θ

Example: Bernoulli (1/3 Assume that Y is Bernoulli: { Likelihood for observation i : { 1 with probability p0 0 with probability 1 p 0 p 0 if y i = 1 1 p 0 if y i = 0 Let n 1 be the number of observations with 1. Then, under iid sampling L(p = p n 1 (1 p n n 1 We have a likelihood for each sample With {0,1,0,0,0} L(p = p (1 p 4 With {1,0,0,1,1} L(p = p 3 (1 p ²

Example: Bernoulli (2/3 Sample: {0,1,0,0,0} Sample: {1,0,0,1,1} 0.09 0.035 0.08 0.03 0.07 0.06 0.025 L(p 0.05 0.04 L(p 0.02 0.015 0.03 0.02 0.01 0.01 0.005 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 p p With {0,1,0,0,0} ˆp = 0.2 With {1,0,0,1,1} ˆp ML = 0.6

Example: Bernoulli (3/3 The maximum likelihood estimator is the value that maximizes L(p = p n 1 (1 p n n 1 The same ˆp ML maximizes the logarithm of the likelihood function l(p = n 1 log(p + (n n 1 log(1 p l(p p = 0 n 1 = n n ˆp ML 1 1 ˆp ML ˆp ML = n 1 n With {0,1,0,0,0} ˆp ML = 1 5 = 0.2 With {1,0,0,1,1} ˆp ML = 3 5 = 0.6

Basic Setup Motivation ( Let {y 1, y 2,..., y N } be an iid sample from y x N β0 x,σ0 2. We aim to estimate θ 0 = ( β 0,σ0 2 Because of the iid assumption, the joint distribution of {y 1, y 2,..., y N } is simply the product of the densities: f (y 1, y 2,..., y N x 1,..., x N ;θ 0 = f (y 1 x 1 ;θ 0 f (y 2 x 2 ;θ 0...f (y N x N ;θ 0 Note that y x N ( β0 x,σ 2 0 y β0 x ε N ( 0,σ 2 0. This implies that f y x (y i x i ;θ 0 = f ε (y i β x i ;θ 0

Density of the Error Term We have that ε N ( 0,σ 2 0, so what is its density fε (z;θ 0? ( 1 ε N 0,σ 2 0 ε σ 0 N (0,1 ( 2 CDF ε (z Pr (ε z = ε Pr z σ0 σ0 ( 3 Hence, CDF ε (z = Φ z σ 0 4 The density of a continuous random variable is the rst derivative of its CDF: ( ( 1 z f ε (z;θ 0 = σ 0 φ σ 0

Density of the Sample Since and ( ( 1 z f ε (z;θ 0 = σ 0 φ σ 0 f y x (y i x i ;θ 0 = f ε (y i β x i ;θ 0 and f (y 1, y 2,..., y N x 1,..., x N ;θ 0 = f (y 1 x 1 ;θ 0 f (y 2 x 2 ;θ 0...f (y N x N ;θ 0 then we have that f (y 1, y 2,..., y N x 1,..., x N ;θ 0 = N i=1 {( 1 σ 0 φ ( } yi β 0 x i σ 0

The Log-likelihood function The likelihood replaces the actual values of the parameters for real variables: {( ( } N 1 yi β x i L(β,σ = φ σ σ i=1 i=1 taking the log makes the problem easier { ( N 1 log (L(β,σ = log + log σ and given that φ ( yi β x i σ = (2π 1 2 exp [ φ [ ( ]} yi β x i ( yi β x i σ σ ] 2 have that ( 1 1 2 ( N yi β x i log (L(β,σ = Nlog 2πσ 2 σ i=1 2 we

The ML Estimator: FOC With respect to β : which implies 2 ˆσ ² N i=1 N i=1 With respect to σ, this implies ˆσ ² = 1 N x i ( y i ˆβ x i = 0 x i ( y i ˆβ x i = 0 N i=1 (y i ˆβ x i 2 MLE for ˆβ is exactly the same estimator as OLS; ˆσ 2 = N 1 N s2 is biased, but the bias disappears as N increases

Computing the MLE ML estimates are often easy to compute, as in the two previous examples Sometimes, however, there is no algebraic solution to the maximization problem It is then necessary to use some sort of numerical maximization procedure

Numerical Maximization Procedures Newton's method Start with an initial value ˆθ 0 At any iteration, ˆθ j+1 = ˆθ j H 1 g g is the rst derivative of the likelihood (i.e. the gradient H is the second derivative (the Hessian Check if there is convergence Which ˆθ increases the most the quadratic Taylor approximation of L(ˆθ + ˆθ, L(ˆθ + ˆθ L(ˆθ + g (ˆθ ˆθ + (ˆθ 1 2 H ˆθ 2?

Quasi-Newton Methods Newton's Method will not work well when the Hessian is not negative denite. In such cases, one popular way to obtain the MLE is to replace the Hessian by a matrix which is always negative denite These approaches are referred to as quasi-newton algorithms gretl uses one of them: the BFGS algorithm (Broyden, Fletcher, Goldfarb and Shanno

Consistency Motivation Assumptions 1 Finite-sample identication: l(θ takes dierent values for dierent θ 2 Sampling: a law of large numbers is satised by 1 n Σ i l i (ˆθ 3 Asymptotic identication: max l(θ provides a unique way to determine the parameter in the limit as the sample size tends to innity. Under these conditions, the ML estimator is consistent plim (ˆθ ML = θ 0

Identicación Motivation These are the crucial assumptions to exploit the fact that the expected maximum likelihood attains its maximum at the true value θ 0 If these conditions did not hold, there would be some value θ 1 such that θ 0 and θ 1 generate an identical distribution of the observable data Then we wouldn't be able to distinguish between these two parameters even with an innite amount of data We then say that these parameters are observationally equivalent and that the model is not identied

Asymptotic Normality Assumptions 1 Consistency 2 l(θ is dierentiable and attains an interior maximum 3 A CLT can be applied to the gradient Under these conditions the ML estimator is asymptotically normal n 1 /2 (ˆθ θ N (0,Σ as n where Σ = ( plim 1 n 1 H i

Asymptotic Eciency and Variance Estimation If l(θ is dierentiable and attains an interior maximum the MLE must be at least as asymptotically ecient as any other consistent estimator that is asymptotically unbiased Consistent estimators of the Varianze-Covariance Matrix empirical hessian: var H (ˆθ = = [ 1 n H 1 i (ˆθ ] 1 [ ( T ( ] 1 1 BHHH, var BHHH (ˆθ = n g 1 i(ˆθ n g i(ˆθ the sandwich estimator: valid even if the model is misspecied

Motivation ML estimates are the values which maximize the likelihood function under general assumptions, ML is consistent, asymptotically normal, and asymptotically ecient