1 EC 6310: Advanced Econometric Theory July 2008 Slides for Lecture on Bayesian Computation in the Nonlinear Regression Model Gary Koop, University of Strathclyde
2 1 Summary Readings: Chapter 5 of textbook. Nonlinear regression model is of interest in own right, but also will allow us to introduce some widely useful Bayesian computational tools Metropolis-Hastings algorithms (a way of doing posterior simulation). Posterior predictive p-values (a way of comparing models which does not involve marginal likelihoods). Gelfand-Dey method of marginal likelihood calculation.
3 2 The Nonlinear Regression Model Researchers typically work with linear regression model: y i = x i2 + ::: + k x ik + " i ; In some cases nonlinear models can be made linear by transformation. For instance: y = 1 x 2 2 ::x k k : can be logged to produce linear functional form: ln (y i ) = ln (x i2 ) + ::: + k ln (x ik ) + " i ; where 1 = ln ( 1 ).
4 But some functional forms are intrinsically nonlinear E.g. constant elasticity of substitution (CES) production function: y i = kx j=1 j x k+1 ij 1 A 1 k+1 : No way to transform CES to make linear. Nonlinear regression model: y i = kx j=1 j x k+1 ij 1 A 1 k+1 + "i :
5 General form: y = f (X; ) + "; where y; X and " are de ned as in linear regression model (i.e. " is N(0 N ; h 1 I N )) f (X; ) is an N-vector of functions Properties of Normal distribution gives us likelihood function: p(yj; h) = n h exp h N 2 (2) N 2 h 2 fy f (X; )g 0 fy f (X; )g io :
6 Prior: any can be used. so let us just call it p (; h) Posterior is proportional to likelihood times prior: p(; hjy) / p (; h) h N n h 2 exp h 2 fy f (X; )g 0 fy f (X; )g io (2) N 2 No way to simplify this expression or recognize it as having a familiar form for (e.g. it is not Normal or t-distribution, etc.). How to do posterior simulation? Importance sampling is one possibility, but here we introduce another: Metropolis-Hastings
7 3 The Metropolis-Hastings Algorithm Notation: is a vector of parameters and p (yj) ; p () and p (jy) are the likelihood, prior and posterior, respectively. Metropolis-Hastings algorithm takes draws from a convenient candidate generating density. Let indicate a draw taken from this density which we denote as q (s 1) ;. Notation: is a draw taken of the random variable whose density depends on (s 1). Notation: like the Gibbs sampler (but unlike importance sampling), the current draw depends on the previous draw. A "chain of draws" is produced. Thus, "Markov Chain Monte Carlo (MCMC)".
8 Importance sampling corrects for the fact that the importance function di ered from the posterior by weighting the draws di erently from one another. With Metropolis-Hastings, we weight all draws equally, but not all the candidate draws are accepted.
9 The Metropolis-Hastings algorithm always takes the following form: Step 1: Choose a starting value, (0). Step 2: Take a candidate draw, from the candidate generating density, q (s 1) ;. Step 3: Calculate an acceptance probability, (s 1) ;. Step 4: Set (s) = with probability (s 1) ; and set (s) = (s 1) with probability 1 (s 1) ; : Step 5: Repeat Steps 1, 2 and 3 S times. Step 6: Take the average of the S draws g (1) ; :::; g (S). These steps will yield an estimate of E [g()jy] for any function of interest.
10 Note: As with Gibbs sampling, the Metropolis-Hastings algorithm usually requires the choice of a starting value, (0). To make sure that the e ect of this starting value has vanished, it is usually wise to discard S 0 initial draws. Intuition for acceptance probability, (s 1) ;, given in textbook (pages 93-94). (s 1) ; = min 2 4 p(= jy)q ;= p = (s 1) jy q (s 1) (s 1) ;= ; :
11 3.1 The Independence Chain Metropolis- Hastings Algorithm The Independence Chain Metropolis-Hastings algorithm uses a candidate generating density which is independent across draws. That is, q (s 1) ; = q () and the candidate generating density does not depend on (s 1). Useful in cases where a convenient approximation exists to the posterior. This convenient approximation can be used as a candidate generating density. Acceptance probability simpli es to: (s 2 (s 1) 1) ; = min 4 p ( = jy) q = p = (s 1) jy q ( = ) ; :
12 The independence chain Metropolis-Hastings algorithm is closely related to importance sampling. This can be seen by noting that, if we de ne weights analogous to the importance sampling weights (see Chapter 4, equation 4.38): w A = p = A jy q ( = A ) ; the acceptance probability in (5.9) can be written as: (s 2 1) ; = min 4 w ( ) w (s 1) ; 1 In words, the acceptance probability is simply the ratio of importance sampling weights evaluated at the old and candidate draws. 3 5 :
13 Setting q () = f N j b \ ML ; var b! ML can work well in some cases where ML denotes maximum likelihood estimates. See textbook pages for more detail on choosing candidate generating densities.
14 3.2 The Random Walk Chain Metropolis- Hastings Algorithm The Random Walk Chain Metropolis-Hastings algorithm is useful when you cannot nd a good approximating density for the posterior. No attempt made to approximate posterior, rather candidate generating density is chosen to wander widely, taking draws proportionately in various regions of the posterior. Generates candidate draws according to: = (s 1) + z; where z is called the increment random variable.
15 The acceptance probability simpli es to: (s 1) ; = min 4 p ( = jy) p = (s 1) jy ; Choice of density for z determines form of candidate generating density. Common choice is Normal. (s 1) is the mean and researcher must choose covariance matrix () q (s 1) ; = f N (j (s 1) ; ): Researcher must select. Should be selected so that the acceptance probability tends to be neither too high nor too low.
16 There is no general rule which gives the optimal acceptance rate. A rule of thumb is that the acceptance probability should be roughly 0:5. A common approach is to to set = c where c is a scalar and is an estimate of posterior covariance matrix of. You can experiment with di erent values of c until you nd one which yields reasonable acceptance probability. This approach requires nding, an estimate of \ var (jy) (e.g. var b ML )
17 3.3 Metropolis-within-Gibbs Remember: the Gibbs sampler involved sequentially drawing from p (1) jy; (2) and p (2) jy; (1). Using a Metropolis-Hastings algorithm for either (or both) of the posterior conditionals used in the Gibbs sampler, p (1) jy; (2) and p (2) jy; (1), is perfectly acceptable. This statement is also true if the Gibbs sampler involves more than two blocks. Such Metropolis-within-Gibbs algorithms are common since many models have posteriors where most of the conditionals are easy to draw from, but one or two conditionals do not have convenient form.
18 4 A Measure of Model Fit: The Posterior Predictive P-Value Bayesians usually use marginal likelihoods/bayes factors/marginal likelihoods to compare models But these can be sensitive to choice of prior and often cannot be used with noninformative priors. Also, they can only be used to compare models relative to each other (e.g. Model 1 is better than Model 2 ). Cannot be used as diagnostics of absolute model performance (e.g. cannot say Model 1 is tting well ) Posterior predictive p-value okay with noninformative priors and absolute measure of performance
19 Notation: y is data actually observed, and y y, observable data which could be generated from model under study g (:) is function of interest. Its posterior, p(g(y y )jy) summarizes everything our model says about g(y y ) after seeing the data. Tells us the types of data sets that our model can generate. Can calculate g (y). If g(y) is in extreme tails of p(g(y y )jy), then g (y) is not the sort of data characteristic that can plausibly be generated by the model.
20 Formally, tail area probabilities similar to frequentist p-value calculations can be obtained. Posterior predictive p-value is the probability of a model yielding a data set more than g (y) To get p(g(y y )jy) use simulation methods similar to predictive simulation Draw from posterior, then simulate y at each draw
21 5 Example: Posterior Predictive P- values in Nonlinear Regression Model Need to choose function of interest, g (:). Example: y y i = f (X i; ) + " i ; We have assumed Normal errors. Is this a good assumption? Normal errors imply skewness and kurtosis measures below are zero: p P N Ni=1 " 3 Skew = i h PNi=1 " 2 i3 2 i
22 Kurt = N P N i=1 " 4 i h PNi=1 " 2 i i 2 3: Use these as our functions of interest g (y) = E [Skewjy] or E [Kurtjy] and g y y = E h Skewjy yi or E h Kurtjy yi.
23 Can show (by integrating out h) that p y y j = f t y y jf (X; ) ; s 2 I N ; N ; (*) where s 2 = [y f (X; )]0 [y f (X; )] : N A program for doing this for Skew has following form (Kurt is similar).
24 Step 1: Take a draw, (s) ; using the posterior simulator. Step 2: Generate a representative data set, y y(s), from p y y j (s) using (*) Step 3: Set " (s) i = y i f X i ; (s) for i = 1; ::; N and evaluate Skew (s). Step 4: Set " y(s) i = y y(s) i f X i ; (s) for i = 1; ::; N and evaluate Skew y(s). Step 5: Repeat Steps 1, 2, 3 and 4 S times. Step 6: Take the average of the S draws Skew (1) ; :::; Skew (S to get E [Skewjy].
25 Step 7: Calculate the proportion of the S draws Skew y(1) ; :::; Skew y(s) which are smaller than your estimate of E [Skewjy] from Step 6. If Step 7 less than 0:5, this is posterior predictive p-value. Otherwise it is one minus this number. If posterior predictive p-value is less than 0:05 (or 0:01), the this is evidence against a model (i.e. this model is unlikely to have generated data sets of the sort that was observed).
26 5.1 Example Textbook pages has an empirical example with nonlinear regression model (CES production function) For skewness yields a posterior predictive p-value of 0.37 For kurtosis yields a posterior predictive p-value of 0.38 Evidence that this model is tting these features of the data well. See gures
29 6 Calculating Marginal Likelihoods: The Gelfand-Dey Method Other main method of model comparison (posterior odds/bayes factors) based on marginal likelihoods Marginal likelihoods can be hard to calculate Sometimes can work out analytical formula (e.g. Normal linear regression model with natural conjugate prior). If one model is nested inside another, Savage-Dickey density ratio can be used. But with nonlinear regression model, may wish to compare di erent choices for f (:): non-nested
30 There are a few methods which use posterior simulator output to calculate marginal likelihoods for general cases Gelfand-Dey is one such method Idea: inverse of the marginal likelihood for a model, M i, which depends on parameter vector,, can be written as E [g () jy; M i ] for a particular choice of g (:). Posterior simulators such as Gibbs sampler or Metropolis- Hastings designed precisely to estimate such quantities.
31 Theorem 5.1: The Gelfand-Dey Method of Marginal Likelihood Calculation Let p (jm i ) ; p (yj; M i ) and p (jy; M i ) denote the prior, likelihood and posterior, respectively, for model M i de- ned on the region. If f () is any p.d.f. with support contained in, then E " f () p (jm i ) p (yj; M i ) jy; M i # = 1 p (yjm i ) : Proof: see textbook page 105
32 Theorem says for any p.d.f. f (), we can simply set: g () = f () p (jm i ) p (yj; M i ) and use posterior simulator output to estimate E [g () jy; M i ] Even f () = 1 works (in theory) But, to work well in practice, f () must be chosen very carefully. Theory says it converges best if f() p(jm i )p(yj;m i ) bounded. In practice, p (jm i ) p (yj; M i ) can be near zero in tails of posterior
33 One strategy: let f (:) be a Normal density similar to posterior, but with the tails chopped o. Let b and b be estimates of E (jy; M i ) and var (jy; M i ) obtained from the posterior simulator. For some probability, p 2 (0; 1), let b denote the support of f () which is de ned by b = : b 0 b 1 b 2 1 p (k) ; In words: chop o tails with p probability in them Let f () be this Normal density density truncated to the region b