EC 6310: Advanced Econometric Theory

Similar documents
1 Another method of estimation: least squares

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

11. Time series and dynamic linear models

STA 4273H: Statistical Machine Learning

Centre for Central Banking Studies

Introduction to Markov Chain Monte Carlo

The Dynamics of UK and US In ation Expectations

Markov Chain Monte Carlo Simulation Made Simple

CAPM, Arbitrage, and Linear Factor Models

Bayesian Statistics in One Hour. Patrick Lam

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes

Estimating Industry Multiples

Statistics Graduate Courses

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Master s Theory Exam Spring 2006

Basics of Statistical Machine Learning

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

CHAPTER 2 Estimating Probabilities

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Representation of functions as power series

Chapter 3: The Multiple Linear Regression Model

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Parallelization Strategies for Multicore Data Analysis

1 Prior Probability and Posterior Probability

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Chapter 1. Vector autoregressions. 1.1 VARs and the identi cation problem

PREDICTIVE DISTRIBUTIONS OF OUTSTANDING LIABILITIES IN GENERAL INSURANCE

Imputing Missing Data using SAS

DURATION ANALYSIS OF FLEET DYNAMICS

Continued Fractions and the Euclidean Algorithm

L4: Bayesian Decision Theory

Bayesian Statistics: Indian Buffet Process

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

Tutorial on Markov Chain Monte Carlo

c 2008 Je rey A. Miron We have described the constraints that a consumer faces, i.e., discussed the budget constraint.

NOTES ON LINEAR TRANSFORMATIONS

1 Teaching notes on GMM 1.

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Statistical Machine Learning

Multivariate Normal Distribution

171:290 Model Selection Lecture II: The Akaike Information Criterion

Normalization and Mixed Degrees of Integration in Cointegrated Time Series Systems

1 Short Introduction to Time Series

1.2 Solving a System of Linear Equations

Inference on Phase-type Models via MCMC

Note on growth and growth accounting

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem. Lecture 12 04/08/2008. Sven Zenker

BayesX - Software for Bayesian Inference in Structured Additive Regression


Microeconomic Theory: Basic Math Concepts

Markov random fields and Gibbs measures

1 Error in Euler s Method

Common sense, and the model that we have used, suggest that an increase in p means a decrease in demand, but this is not the only possibility.

Optimal linear-quadratic control

The Binomial Distribution

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Tilburg University. Publication date: Link to publication

THE USE OF STATISTICAL DISTRIBUTIONS TO MODEL CLAIMS IN MOTOR INSURANCE

Linear Algebra Notes

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Introduction to General and Generalized Linear Models

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Java Modules for Time Series Analysis

5.1 Identifying the Target Parameter

88 CHAPTER 2. VECTOR FUNCTIONS. . First, we need to compute T (s). a By definition, r (s) T (s) = 1 a sin s a. sin s a, cos s a

Lecture 3: Linear methods for classification

Multi-variable Calculus and Optimization

2. Linear regression with multiple regressors

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach

Probability and Statistics

Tail-Dependence an Essential Factor for Correctly Measuring the Benefits of Diversification

Quadratic forms Cochran s theorem, degrees of freedom, and all that

Analysis of Bayesian Dynamic Linear Models

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Principle of Data Reduction

Economics of Strategy (ECON 4550) Maymester 2015 Applications of Regression Analysis

Bayesian Phylogeny and Measures of Branch Support

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

Unit 18 Determinants

STAT3016 Introduction to Bayesian Data Analysis

PS 271B: Quantitative Methods II. Lecture Notes

Lecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization

Conductance, the Normalized Laplacian, and Cheeger s Inequality

Economics 326: Duality and the Slutsky Decomposition. Ethan Kaplan

Jim Lambers MAT 169 Fall Semester Lecture 25 Notes

Chapter 2. Dynamic panel data models

Am I Decisive? Handout for Government 317, Cornell University, Fall Walter Mebane

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Transcription:

EC 6310: Advanced Econometric Theory July 2008 Slides for Lecture on Bayesian Computation in the Nonlinear Regression Model Gary Koop, University of Strathclyde

1 Summary Readings: Chapter 5 of textbook. Nonlinear regression model is of interest in own right, but also will allow us to introduce some widely useful Bayesian computational tools Metropolis-Hastings algorithms (a way of doing posterior simulation). Posterior predictive p-values (a way of comparing models which does not involve marginal likelihoods). Gelfand-Dey method of marginal likelihood calculation.

2 The Nonlinear Regression Model Researchers typically work with linear regression model: y i = 1 + 2 x i2 + ::: + k x ik + " i ; In some cases nonlinear models can be made linear by transformation. For instance: y = 1 x 2 2 ::x k k : can be logged to produce linear functional form: ln (y i ) = 1 + 2 ln (x i2 ) + ::: + k ln (x ik ) + " i ; where 1 = ln ( 1 ).

But some functional forms are intrinsically nonlinear E.g. constant elasticity of substitution (CES) production function: y i = 0 @ kx j=1 j x k+1 ij 1 A 1 k+1 : No way to transform CES to make linear. Nonlinear regression model: y i = 0 @ kx j=1 j x k+1 ij 1 A 1 k+1 + "i :

General form: y = f (X; ) + "; where y; X and " are de ned as in linear regression model (i.e. " is N(0 N ; h 1 I N )) f (X; ) is an N-vector of functions Properties of Normal distribution gives us likelihood function: p(yj; h) = n h exp h N 2 (2) N 2 h 2 fy f (X; )g 0 fy f (X; )g io :

Prior: any can be used. so let us just call it p (; h) Posterior is proportional to likelihood times prior: p(; hjy) / p (; h) h N n h 2 exp h 2 fy f (X; )g 0 fy f (X; )g io (2) N 2 No way to simplify this expression or recognize it as having a familiar form for (e.g. it is not Normal or t-distribution, etc.). How to do posterior simulation? Importance sampling is one possibility, but here we introduce another: Metropolis-Hastings

3 The Metropolis-Hastings Algorithm Notation: is a vector of parameters and p (yj) ; p () and p (jy) are the likelihood, prior and posterior, respectively. Metropolis-Hastings algorithm takes draws from a convenient candidate generating density. Let indicate a draw taken from this density which we denote as q (s 1) ;. Notation: is a draw taken of the random variable whose density depends on (s 1). Notation: like the Gibbs sampler (but unlike importance sampling), the current draw depends on the previous draw. A "chain of draws" is produced. Thus, "Markov Chain Monte Carlo (MCMC)".

Importance sampling corrects for the fact that the importance function di ered from the posterior by weighting the draws di erently from one another. With Metropolis-Hastings, we weight all draws equally, but not all the candidate draws are accepted.

The Metropolis-Hastings algorithm always takes the following form: Step 1: Choose a starting value, (0). Step 2: Take a candidate draw, from the candidate generating density, q (s 1) ;. Step 3: Calculate an acceptance probability, (s 1) ;. Step 4: Set (s) = with probability (s 1) ; and set (s) = (s 1) with probability 1 (s 1) ; : Step 5: Repeat Steps 1, 2 and 3 S times. Step 6: Take the average of the S draws g (1) ; :::; g (S). These steps will yield an estimate of E [g()jy] for any function of interest.

Note: As with Gibbs sampling, the Metropolis-Hastings algorithm usually requires the choice of a starting value, (0). To make sure that the e ect of this starting value has vanished, it is usually wise to discard S 0 initial draws. Intuition for acceptance probability, (s 1) ;, given in textbook (pages 93-94). (s 1) ; = min 2 4 p(= jy)q ;= p = (s 1) jy q (s 1) (s 1) ;= ; 1 3 5 :

3.1 The Independence Chain Metropolis- Hastings Algorithm The Independence Chain Metropolis-Hastings algorithm uses a candidate generating density which is independent across draws. That is, q (s 1) ; = q () and the candidate generating density does not depend on (s 1). Useful in cases where a convenient approximation exists to the posterior. This convenient approximation can be used as a candidate generating density. Acceptance probability simpli es to: (s 2 (s 1) 1) ; = min 4 p ( = jy) q = p = (s 1) jy q ( = ) ; 1 3 5 :

The independence chain Metropolis-Hastings algorithm is closely related to importance sampling. This can be seen by noting that, if we de ne weights analogous to the importance sampling weights (see Chapter 4, equation 4.38): w A = p = A jy q ( = A ) ; the acceptance probability in (5.9) can be written as: (s 2 1) ; = min 4 w ( ) w (s 1) ; 1 In words, the acceptance probability is simply the ratio of importance sampling weights evaluated at the old and candidate draws. 3 5 :

Setting q () = f N j b \ ML ; var b! ML can work well in some cases where ML denotes maximum likelihood estimates. See textbook pages 95-97 for more detail on choosing candidate generating densities.

3.2 The Random Walk Chain Metropolis- Hastings Algorithm The Random Walk Chain Metropolis-Hastings algorithm is useful when you cannot nd a good approximating density for the posterior. No attempt made to approximate posterior, rather candidate generating density is chosen to wander widely, taking draws proportionately in various regions of the posterior. Generates candidate draws according to: = (s 1) + z; where z is called the increment random variable.

The acceptance probability simpli es to: (s 1) ; = min 4 p ( = jy) p = (s 1) jy ; 1 2 3 5 Choice of density for z determines form of candidate generating density. Common choice is Normal. (s 1) is the mean and researcher must choose covariance matrix () q (s 1) ; = f N (j (s 1) ; ): Researcher must select. Should be selected so that the acceptance probability tends to be neither too high nor too low.

There is no general rule which gives the optimal acceptance rate. A rule of thumb is that the acceptance probability should be roughly 0:5. A common approach is to to set = c where c is a scalar and is an estimate of posterior covariance matrix of. You can experiment with di erent values of c until you nd one which yields reasonable acceptance probability. This approach requires nding, an estimate of \ var (jy) (e.g. var b ML )

3.3 Metropolis-within-Gibbs Remember: the Gibbs sampler involved sequentially drawing from p (1) jy; (2) and p (2) jy; (1). Using a Metropolis-Hastings algorithm for either (or both) of the posterior conditionals used in the Gibbs sampler, p (1) jy; (2) and p (2) jy; (1), is perfectly acceptable. This statement is also true if the Gibbs sampler involves more than two blocks. Such Metropolis-within-Gibbs algorithms are common since many models have posteriors where most of the conditionals are easy to draw from, but one or two conditionals do not have convenient form.

4 A Measure of Model Fit: The Posterior Predictive P-Value Bayesians usually use marginal likelihoods/bayes factors/marginal likelihoods to compare models But these can be sensitive to choice of prior and often cannot be used with noninformative priors. Also, they can only be used to compare models relative to each other (e.g. Model 1 is better than Model 2 ). Cannot be used as diagnostics of absolute model performance (e.g. cannot say Model 1 is tting well ) Posterior predictive p-value okay with noninformative priors and absolute measure of performance

Notation: y is data actually observed, and y y, observable data which could be generated from model under study g (:) is function of interest. Its posterior, p(g(y y )jy) summarizes everything our model says about g(y y ) after seeing the data. Tells us the types of data sets that our model can generate. Can calculate g (y). If g(y) is in extreme tails of p(g(y y )jy), then g (y) is not the sort of data characteristic that can plausibly be generated by the model.

Formally, tail area probabilities similar to frequentist p-value calculations can be obtained. Posterior predictive p-value is the probability of a model yielding a data set more than g (y) To get p(g(y y )jy) use simulation methods similar to predictive simulation Draw from posterior, then simulate y at each draw

5 Example: Posterior Predictive P- values in Nonlinear Regression Model Need to choose function of interest, g (:). Example: y y i = f (X i; ) + " i ; We have assumed Normal errors. Is this a good assumption? Normal errors imply skewness and kurtosis measures below are zero: p P N Ni=1 " 3 Skew = i h PNi=1 " 2 i3 2 i

Kurt = N P N i=1 " 4 i h PNi=1 " 2 i i 2 3: Use these as our functions of interest g (y) = E [Skewjy] or E [Kurtjy] and g y y = E h Skewjy yi or E h Kurtjy yi.

Can show (by integrating out h) that p y y j = f t y y jf (X; ) ; s 2 I N ; N ; (*) where s 2 = [y f (X; )]0 [y f (X; )] : N A program for doing this for Skew has following form (Kurt is similar).

Step 1: Take a draw, (s) ; using the posterior simulator. Step 2: Generate a representative data set, y y(s), from p y y j (s) using (*) Step 3: Set " (s) i = y i f X i ; (s) for i = 1; ::; N and evaluate Skew (s). Step 4: Set " y(s) i = y y(s) i f X i ; (s) for i = 1; ::; N and evaluate Skew y(s). Step 5: Repeat Steps 1, 2, 3 and 4 S times. Step 6: Take the average of the S draws Skew (1) ; :::; Skew (S to get E [Skewjy].

Step 7: Calculate the proportion of the S draws Skew y(1) ; :::; Skew y(s) which are smaller than your estimate of E [Skewjy] from Step 6. If Step 7 less than 0:5, this is posterior predictive p-value. Otherwise it is one minus this number. If posterior predictive p-value is less than 0:05 (or 0:01), the this is evidence against a model (i.e. this model is unlikely to have generated data sets of the sort that was observed).

5.1 Example Textbook pages 107-111 has an empirical example with nonlinear regression model (CES production function) For skewness yields a posterior predictive p-value of 0.37 For kurtosis yields a posterior predictive p-value of 0.38 Evidence that this model is tting these features of the data well. See gures

6 Calculating Marginal Likelihoods: The Gelfand-Dey Method Other main method of model comparison (posterior odds/bayes factors) based on marginal likelihoods Marginal likelihoods can be hard to calculate Sometimes can work out analytical formula (e.g. Normal linear regression model with natural conjugate prior). If one model is nested inside another, Savage-Dickey density ratio can be used. But with nonlinear regression model, may wish to compare di erent choices for f (:): non-nested

There are a few methods which use posterior simulator output to calculate marginal likelihoods for general cases Gelfand-Dey is one such method Idea: inverse of the marginal likelihood for a model, M i, which depends on parameter vector,, can be written as E [g () jy; M i ] for a particular choice of g (:). Posterior simulators such as Gibbs sampler or Metropolis- Hastings designed precisely to estimate such quantities.

Theorem 5.1: The Gelfand-Dey Method of Marginal Likelihood Calculation Let p (jm i ) ; p (yj; M i ) and p (jy; M i ) denote the prior, likelihood and posterior, respectively, for model M i de- ned on the region. If f () is any p.d.f. with support contained in, then E " f () p (jm i ) p (yj; M i ) jy; M i # = 1 p (yjm i ) : Proof: see textbook page 105

Theorem says for any p.d.f. f (), we can simply set: g () = f () p (jm i ) p (yj; M i ) and use posterior simulator output to estimate E [g () jy; M i ] Even f () = 1 works (in theory) But, to work well in practice, f () must be chosen very carefully. Theory says it converges best if f() p(jm i )p(yj;m i ) bounded. In practice, p (jm i ) p (yj; M i ) can be near zero in tails of posterior

One strategy: let f (:) be a Normal density similar to posterior, but with the tails chopped o. Let b and b be estimates of E (jy; M i ) and var (jy; M i ) obtained from the posterior simulator. For some probability, p 2 (0; 1), let b denote the support of f () which is de ned by b = : b 0 b 1 b 2 1 p (k) ; In words: chop o tails with p probability in them Let f () be this Normal density density truncated to the region b