Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem Lecture 12 04/08/2008 Sven Zenker
Assignment no. 8 Correct setup of likelihood function One fixed set of observation data Likelihood is a function of the parameter! Under assumption of independence, it becomes one product Absolute values will usually decrease as the number of observations increases: cannot compare likelihoods directly Also: in production mode, ALWAYS use logs when possible, in particular in MCMC settings Observations: Enough observations & reasonable noise: maximum LLH close to true value As the number of observations decreases and the noise increases, the LLH functions move away from being meaningfully described by a maximum likelihood estimate Observing the system only partially worsens this situations considerably This is a small, simple model! Real life problems in biology tend to be a lot worse (with a few exceptions )
20 observations of both states, \sigma noise = 0.5
20 observations of both states, noise \sigma = 2
5 observations of both states
20 observations of one state
5 observations of 1 state
Assignment no. 8 If experiments can be repeated often enough, the favorable properties of the maximum likelihood estimator come into play This is reflected in the shape of the likelihood functions In situations where the data is sparse and the model large, a maximum likelihood estimate may be relatively useless (marginals of) posterior distributions or likelihoods may still convey useful information about Ability to nail down individual parameter values (or lack thereof) Relationship of parameters on submanifolds compatible with observations and may therefore help to guide model reduction & development decide how to interpret parameter estimates and how much confidence to place into them
Review: Density estimation Histogram an instance of more general problem of density estimation: Given a finite set of samples from a probability distribution, how can I approximately find the PDF of the underlying distribution? Histogram one (relatively crude) way
Review: Histogram, bin width selection Various formulas based on asymptotic arguments exist. All of them (of course) have bin width decrease with increasing number of samples, and also take into account some measure of the spread of the data... E.g.: 7s w = (Scott's rule) where n number of samples, s standard deviation of sample bin 2n 1 3 or w bin = IQR 2n 1 3 (Freedman-Diaconis rule), where IQR is the interquartile range of the data These formulas give one bin width for the entire dataset. A variety of methods exist to further refine the bin widths by allowing them to change locally
Review: Kernel density estimation (KDE) For an i.i.d. sample { x,..., x of the underlying distribution is given by 1 N }, the (fixed bandwidth) kernel density estimate for the PDF N 1 x x ˆ( ) ( i f x = K ), Nh h i= 1 where K( x) is a kernel function satisfying K( xdx ) = 1 0.0 Usually, one will also require K( x) 0 and K( x) = K(- x) for all x h is termed the bandwith. Its selection is critical for performance of the estimator and much more important than the specific shape of the kernel used.
Review: Adaptive bandwidth selection Bandwidth can also be chosen for each individual sample separately, leading to much improved estimates (lower density -> wider kernels, higher density -> > narrower kernels) The optimal bandwidth selection problem in higher dimensions is still an active area of research Important idea: penalized MLE approaches (Why? Dirac catastrophe )
KDE and MCMC: Caveat Optimal bandwidth selection algorithms are typically designed assuming i.i.d. samples This is NOT true for MCMC output Transition kernels that admit a density not very problematic General Metropolis-Hastings (point mass at current point!) more problematic, adaptations of standard procedures have been devised, some specialized literature exists
Review: Metropolis-Hastings MCMC Suppose transition kernel takes form Pxdy (, ) = pxydy (, ) + rx ( ) δ ( dy) x where pxx (, ) = 0 and δ ( dy) = 1 if x in dy and 0 otherwise and x rx ( ) 1 pxydy (, ) is the probability that = n the chain remains at x.
Review Metropolis-Hastings Now, if pxy (, ) satisfies the so-callled "detailed balance", "reversibility" condition π( x) p( x, y) = π( y) p( y, x) then π ( )is the invariant density of Px (, ).
Review Metropolis-Hastings How to achieve detailed balance So the the specific form for pxy (, ) in the Metropolis-Hastings algorithm becomes pmh ( x, y) = q( x, y) α( x, y), x y For the case where the inequality is as above, we may wish to set α( yx, ) = 1, the largest possible value for a probability, and can then compute α( xy, ) from the detailed balance condition π( xqxy ) (, ) α( xy, ) = π( yqyx ) (, ) α( yx, ) to obtain π ( yqyx ) (, ) α( xy, ) =, π ( xqxy ) (, ) and similarly for the inequality in the other direction by setting α ( xy, ) = 1.
Review Metropolis-Hastings How to achieve detailed balance So to obtain detailed balance/reversibility, we chose π ( yqyx ) (, ) min,1 if π ( xqxy ) (, ) > 0 α( xy, ) = π ( xqxy ) (, ) 1 otherwise which, together with the probability for staying at the current position rx ( ) = 1 qxy (, ) α( xydy, ) n yields an overall transition kernel which is a special case of the version from the previous slide: P MH ( xdy, ) = qxy (, ) α( xydy, ) + 1 qxy (, ) α( xydy, ) δx( dy) n for which we saw that it does have the desired invariant distribution since we have detailed balance/reversibility by construction...
Review: Observations -For a symmetric proposal distribution, that is pxy (, ) = pyx (, ), π ( y) α( xy, ) =, so that "uphill moves" will always be accepted π ( x) (simulated annealing!) -For qxy (, ) = π( y), α( xy, ) = 1, so if the proposal distribution is the true distribution we wish to sample from, we will always accept the move -The PDF of the distribution of interest π need only be known up to a constant scalar factor since it appears both in numerator and denominator
Review: Algorithm Initialization Specify Family of proposal distributions q(x,y) Desired number of samples N Initial value x 0 Main loop Repeat for j=1, N Generate (sample) y from q(x j,.) and u from uniform distribution on [0,1] If u <= α(x j,y) Set x j+1 = y Else Set x j+1 =x j End Repeat Termination Return the set of samples {x 1,,x N }
MCMC: Theoretical convergence We ve seen that detailed balance/reversibility ensures that the desired distribution is a stationary distribution of the Markov chain This, in and of itself, does not ensure that the chain will actually converge to that distribution, that is, eventually sample from it when started at an arbitrary point
MCMC: Theoretical convergence We ve seen that detailed balance/reversibility ensures that the desired distribution is a stationary distribution of the Markov chain This, in and of itself, does not ensure that the chain will actually converge to that distribution, that is, eventually sample from it when started at an arbitrary point
MCMC: Theoretical convergence Convergence will occur under mild regularity conditions, namely (very roughly): Irreducibility: : For any state of the Markov chain, there is a positive probability of reaching any small subset dy in the support of the distribution in finite time Aperiodicity: : we will not get trapped in cycles This can be made more detailed and precise. For practical purposes, it is usually not an issue if reasonable proposal distribution are chosen.
MCMC: Rate of convergence Much more important from a practical perspective: how fast does it converge, i.e., when can we start to believe that the sample we have obtained represents the target distribution reasonably well? This will critically depend on the choice of the proposal distribution
MCMC: burn-in To eliminate the initial influence of the choice of starting point for the Markov chain, one usually discards a number of initial samples, the so-called burn-in period Although some argue that this is theoretically not required, it will avoid undue influence of unlikely starting points for finite sample sizes
Thinning Practice of storing only every k th sample produced by the MCMC algorithm Cannot improve the description of the target distribution, but may save memory without actually worsening the description since the immediately subsequent samples may carry little independent information if autocorrelation is high.
MCMC: tuning In standard algorithms, the proposal bandwidth needs to be tuned For special cases, theoretical optima can be determined A theoretically justifiable tuning target for practical problems is the acceptance rate, that is, the ratio of the number of accepted steps and total steps Recommendations vary somewhat, an acceptance ratio between 0.2 and 0.6 covers most of these recommendations Strong relationship between proposal bandwidth, autocorrelation of sample, and sampling efficiency, that is (roughly), how many dependent samples do we expect to need to obtain one independent sample Relationship between proposal bandwidth and autocorrelation NOT monotonous!
MCMC: importance of diagnosing convergence Key risks when using (deceptively simple) MCMC methods are Inappropriate modeling: model may be unable to fit the data =>Perform sanity checks in regions of high likelihood/posterior density Programming errors May be impossible to detect in realistic problem => Always write generically applicable code and test on problems with known answers Slow convergence: The simulation may remain in a region heavily influenced by the starting condition for many iterations ( mixing problem) This is a fundamental issue: a finite run will never explore the distribution in all detail
MCMC: diagnosing convergence No fully satisfactory answer does (and can?) exist Key problem: we are trying to infer something from the sample itself, about which it may carry no information (sketch) Nevertheless, many methods, more or less heuristic in nature, have been proposed Will look at a few (by no means exhaustively)
Graphical diagnosis Plot positions as a function of time, that is, iteration number Attempt to determine when the process has settled down visually (reached stationary distribution?) Can detect continued drift from far out starting value Allows to get a feeling for acceptance behavior and jump sizes Cannot detect complete failure to mix Rather subjective
Comparison of multiple chains (Gelman) Run several independent chains from randomly selected, overdispersed starting points Compare the resulting distributions Stop only when they are indistinguishable by some meaningful criterion This can be formalized by for example requiring for convergence that the variance between chains be no larger than the variance within individual chains (Gelman( Gelman,, 1995) In practice, it may be hard to ascertain overdispersedness of starting points
Raftery and Lewis Based on 2-state 2 Markov chain theory Tries to bound errors on estimated quantiles of the the true distribution Will give recommended number of burn-in iterations to be discarded and recommended run length Available as R package an in FORTRAN implementation from Statlib (http://lib.stat.cmu.edu/general/gibbsit lib.stat.cmu.edu/general/gibbsit) Rather computationally expensive, so FORTRAN version recommended for large sample sizes
And many others Bottom line: Graphical convergence diagnostics can serve as monitoring aids Formal convergence diagnostics can help to assess output Repeated (or parallel) runs can help to gain confidence in the accurate representation of the distribution, possibly aided by formal comparison of statistical properties of chains To my knowledge, no practical method can rigorously ensure convergence from MCMC output
Convergence diagnostics: Resources Cowles & Carling, Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review, Journal of the American Statistical Association, Vol. 91, No. 434: 883-904 provides an critical review of approx. 10 different convergence diagnostic approaches (including the famous classics) and gives recommendations for application The CODA package for R implements a few established diagnostics (http://cran.r( http://cran.r-project.org/web/packages/coda) A fast FORTRAN implementation of Raftery & Lewis convergence diagnostic is available at http://lib.stat.cmu.edu/general/gibbsit.. You may have to adjust the preallocated array size in the source code for larger samples.
Assignment no. 10 Modify your Metropolis-Hastings algorithm from assignment no. 9 to accept a function that provides the logarithm of the target density as input and performs p its acceptance computation using these values directly (that is, the exponentiation tion is never actually performed; for this to work you will have to transform the random m sample from the uniform distribution on [0, 1] before comparison ) Combine the modified algorithm with a modification of the likelihood function for the van der Pol system from assignment no. 8 that computes the logarithm of the likelihood up to an additive constant (corresponding to the familiar constant factor in the non-logarithmic likelihood) as a function of \mu and the initial condition for state 1 For the sampling to work reliably, you will have to catch integration errors using the lastwarn function. We will assume that failure to integrate corresponds to 0 likelihood (-Inf in log domain) Generate an artificial observation set of 20 measurements of both h states with additive noise with standard deviation 0.5 in the usual fashion and store it so that you can reuse for the each of the following steps to achieve reproducability.. Alternatively, you can reset the random seeds appropriately. Using runs of 1000 samples each, tune the proposal bandwidth such h that you obtain an acceptance rate of approx. 0.25-0.35 0.35 from a starting point of \mu = 1.5, y0_1 = 1.5 Plot histograms of the marginal distributions obtained by running g the sampler for 10000 iterations with the above starting point, as well as when using [3, 3] as a starting point. Also compare results when discarding burn-in periods of 0, 2000, and 5000 samples. What do you observe? How do you interpret your findings? Note: please provide plots & documentation of acceptance rates etc. so I don t have to rerun your simulations.