Monte Carlo simulation models: Sampling from the joint distribution of State of Nature -parameters

Transcription

1 Monte Carlo simulation models: Sampling from the joint distribution of State of Nature -parameters Erik Jørgensen Biometry Research Unit. Danish Institute of Agricultural Sciences, P.O.Box 50, DK-8830 Tjele, Denmark Abstract When using Monte Carlo simulation models for decision support it is important to represent the full uncertainty that faces the decision maker. This paper focuses on approaches towards specifying the uncertainty in input parameters, also called "State-of-nature". Until recently, such specification has only been possible in practice under special conditions, e.g., independent parameters or parameters following specific multivariate distributions, such as the normal distribution. As a result of advances in Bayesian statistical methodology, it is now possible to specify much more complicated distributions, and the distributions can even be found conditional on observations made prior to the simulations. The paper presents cases to illustrate the potential. 1 Introduction Models of animal production systems is widely used for investigating different production strategies etc. Many of these models uses Monte Carlo simulation techniques to calculate output variables of interest. As a result of the large complexity of such models, even the correct specification of model input parameters leads to difficulties. As a result, the uncertainty in the input parameters is usually ignored. When the model is used for studying the behaviour of the system, this may not be important. However, when the model is used for decision support, the full uncertainty facing the decision maker needs to be considered. In the so-called Dina pig model (Jørgensen & Kristensen, 1995) the intention is to include the full uncertainty in the model, i.e., the uncertainty in model parameters is included as well as the usual uncertainty in system development with known parameters. Erik.Jorgensen@agrsci.dk, WWW: 1

2 With offset in the Dina pig model, this paper illustrates approaches toward correct specification of input parameters. 1.1 Elements of Simulation models Initially, we will present some elements of the Monte Carlo simulation method, mainly to introduce the notations followed in the paper. Details of the methods can be found in textbooks such as Fishmann (1996). Essentially, the Monte Carlo simulation method is a method for evaluating an integral Ψ = E π {U(X)} = U(x)π(x)dx (1) where E π () is the expectation with respect to the probability density π and U() is some response function, e.g., a utility function. It involves generating random draws X = x (j) from the target distribution π and then estimating Ψ by Ψ = 1 k { U ( x (1) ) + + U ( x (k))} (2) In our context, X = {Θ, Φ} is a vector consisting of decision parameters, Θ, and system parameters and state variables, Φ. The Monte Carlo method is thus a numeric method for evaluating the integral in Eq. (1). In addition, if the random draws, x (j) are independent, we can easily obtain an estimate of the error of the approximation, using the Central Limit Theorem, (see e.g., Fishmann, 1996, section 2.2). Often, it is an advantage to reformulate the integral in Eq. (1) by splitting Φ into the so-called state of nature, Φ O, and parameters and state variables Φ s = {Φ 1s, Φ 2s,, Φ T s } that are calculated by the model. (The additional index denotes model step, e.g., model time). A subset of Φ s, Ω is called the output of the model. This splitting of the parameter vector leads to a reformulation of Eq. (1) } { } π(x) Ψ = E πo {E πs O {U(X)} = U(x) d{θ, Φ s} π O (Φ O )dφ O (3) π O (Φ O ) where E πs O {U(X)} denotes the conditional expectation of U(X) for a given state of nature Φ O. The dimension of Φ O will in general be fixed by the model structure, while the number of elements in Φ s will vary with different decisions and different combinations of the other elements in Φ. Disregarding the problem of dimensionality, the integration with respect to Φ O is well behaved and lends itself to techniques other than simple Monte Carlo simulation. In contrast, the integration with respect to Φ s is of a complexity that is only feasible to solve using the Monte Carlo method. (Note, that the dimension of Φ O in such models is often in excess of hundred, so even though it is well behaved the evaluation of the integral is complicated). 2

3 1.2 Additional information concerning State-of-Nature Often, we want to use the the model in a specific context, e.g., to predict effect of different production strategies within a specific herd. In this case, we have additional information concerning the model parameters, i.e., registrations related to the model parameters y. In this case, we are interested to base our inference on the conditional distribution of the parameters given the observations, π 0 (Φ 0 y). Note, that this implies that we additionally specify a model of the joint distribution of the parameters Φ 0 and the observations, y. But this is exactly the purpose of our simulation model, i.e., the output parameters Ω is usually observable and the observations y is a subset of Ω. In Jørgensen (2000a) calibrating of model parameters with observations of model output parameters is described. However, du e to complexity issues this approach has limitations. In the present context, we will therefore concentrate on the situation, where we are able to specify an alternative model of the relation between the observations and the model parameters π(ω Y = y, Φ 0 ) = π(ω Φ 0 )π(φ 0 Y = y) though there is an inherent inconsistency in the approach as y Ω. Of course we may argue that y is independent of the features, we explore in the model, i.e., decisions and capacity restrictions. The problem handled in the present paper is how to specify the joint probability distribution of the parameters π O (Φ O ), in order to make it possible to draw pseudo-random instantiations Φ (i) 0 of the distribution. The presentation is structured as a description of three cases. The approach described is used in the Dina pig simulation model (Jørgensen & Kristensen, 1995), and may be combined with recent advances within Bayesian statistics 1.3 The framework for specification The specification of the prior distribution is similar to the specification need within Bayesian approaches to statistical analysis and learning in expert systems (Spiegelhalter et al., 1993, 1996). One widely used program is the so-called WinBUGS program (Spiegelhalter et al., 1999). The WinBUGS program is intended for inference in graphical models using the Markov Chain Monte Carlo approach. The original intention in the Dina pig model was to use the WinBUGS language for the specification. However, in most cases the use of WinBUGS would be too inconvenient. Under assumptions of independency between parameters, the graphical model is simply a set of disconnected nodes. Therefore, the model specification in the Dina pig model follows the specification language in WinBUGS, but is integrated into the general model specification. 3

4 2 Case I: Independent parameters When specifying a probability distribution for the parameters in the state of nature, a simplifying assumption is that the the parameters are independent. In the Dina pig model this is the standard assumption. The independence assumption implies that the joint density of the parameters is simply the product of the density of each individual parameter i.e., π(φ O ) = π(φ 0,1 ) π(φ 0,2 ) π(φ 0,n ) That is for each state parameter, Φ 0,i, in the model, instead of only specifying the expected value, we have to select a probability distribution and the parameters describing this distribution. We will follow standard practice and use the term hyper-parameters. Very often it is most natural to specify the distribution of the parameter on a different scale than the actual parameter. Parameters describing proportions may be specified on a logit scale and a log-normal distribution may be natural for some parameters. As an example, parameter values describing time until an event (i.e., positive) may often be described as following a lognormal distribution. Therefore, the normal distribution is selected as the distribution with corresponding hyperparameters and the transformation is the exponential function exp(). The available distributions and transformation closely follows the notation in the WinBUGS manual. 2.1 Specification of growth related parameters One of the available growth models in the Dina pig model, is an extension of a simple Gompertz growth model, as described in Jørgensen (1998). We will use this model as an example of the specification of the prior distribution. The Gompertz growth model in its standard format is dw dt = k {K ln(w t )} W t (4) where W t is the weight at time t, k is the growth rate and K is the logarithm of asymptotic maximum weight. This produces a sigmoid curve that closely corresponds to the growth of the pig. Notice, that the description of K should not be taken literally. Extrapolation from measurements during the slaughter pig growth phase to the age, when maximum weight is approached, is not reliable. The basic formula in Eq. (4) is modified in the simulation model, but the basic formula may still be recognised. A growth parameter called the current herd level at time t, k ht follows a first order stationary autoregressive process with k h(t+ t) = µ kh + α( t )(k ht µ kh ) + β( t )ε h (5) µ kh is the expected level (e.g. the population expectation) and ε h is a random noise, where ε h N (0, σ 2 h ) α( t) = exp( α 0 t ) and β( t ) = 1 α( t ) 2 is the autoregression parameters 4

5 with the varying length of the time steps taken into account. 1 α 0 corresponds approximately to the usual autoregression parameter with time step 1. The individual growth parameter for each pig is k pig. k pig is drawn from a normal distribution with expectation equal to the herd level at the time of the pig s introduction into the herd, i.e., k pig N (k ht, σk 2 ) with t the time of introduction into the herd of the pig. The specification the herd level of growth rate k h will be based on estimates of daily gain from production data bases. Usual values is that the herd level in daily gain varies between 700 and 1000 gram, roughly speaking a standard deviation of 300/4 75 g. As the daily gain is a function of the k parameter as well as the K parameter we use a first order Taylor approximation i.e., dg f(k 0, K 0 ) + f k (k 0, K 0 )(k k 0 ) + f K (k 0, K 0 )(K K 0 ) as basis for an approximate variance V(dg) (f k )2 (k 0, K 0 )V(k) + (f K )2 V(K) Furthermore, we assume that 90% of the variation is due to variation in k h. From these assumptions we find that K N (5.40, 1/ ), and k h N (0.0116, 1/ ) (The normal distribution is parameterised with mean and precision (= 1/σ 2 ) following WinBUGS). With the mean parameter values the average daily gain is 885 based on the growth of a single animal from 77 to 175 days. α 0 is selected to obtain a correlation between herd level 3 months apart of between 0.95 and 0.99, i.e., N (0.0003, 1/ ) The variation on herd level σ h is specified to reflect that the variance within the herd is assumed to be between 0.25 to 0.5 of the total variance between herds. A lognormal scale is assumed i.e., log(σh 2 ) N ( 6.6, ). Variation between pigs consists of a genetic part and a random walk part. The specification is based on the assumption, that after 90 days of growth the width of confidence interval for live weight is 30 kg, corresponding to a standard deviation of 30/4 Between 1/3 to 2/3 of the variance is permanent corresponding to a σ k between [ , ]). Therefore we select the following distribution for σ k N ( , ). The random walk part corresponds to an additional standard deviation in daily gain uniformly distributed between [0.25, 0.75]. Similar considerations is made for the specification of the other model parameters, i.e., parameters describing start weight, feed intake, feed waste, slaughter waste (killing out percentage), and relation ship between live weight and meat percentage. The available space does not allow us to present the data. Using the Dina pig model the kernel density plots shown in Fig. 1 is produced for each input parameter. k Std.k Std.dailygain Figure 1: Prior distribution of variables. The rug indicates the values used in actual simulation runs 5

6 3 Case 2: Using samples from Markov Chain Monte Carlo The second case is taken from a study concerning precision of clinical diagnosis, Bådsgaard & Jørgensen (2000). The results will be used in section 4 as well. The case has been selected because it illustrates a situation that is almost standard, when using simulation models for decision support. A hierarchic model is used for describing a population of subjects based on empirical data. When using the simulation model, we want to refer to a subject (e.g., a herd) from this population, either with no further information on the subject or with some additional information (e.g., previous performance) on the subject. In this case, the prior distribution is estimated using the Markov Chain Monte Carlo approach via WinBUGS. For clinical diseases, estimation of herd prevalence relies on how precise the veterinarian is. The precision is usually expressed as sensitivity, SE, the probability of correct identification of a diseased animal, and specificity, SP, the probability of correct identification of the healthy animal. SE and SP influences the observed prevalence in the herd. Consider the case, where a veterinarian inspects 10 animals. We want to estimate the probability of observing n obs diseased animals, conditional on the true prevalence in the herd p dis, i.e., Pr(n obs = i p dis = p) = Pr(n obs = i p dis = p, SE = u, SP = v)π(u, v)dudv (6) where π(u, v) is the joint probability density of sensitivity and specificity of the veterinarian. In this context, the simulation model is very simple, i.e., with known parameters the observed number of diseased animals simply follows the binomial distribution. However, the parameters is not known. Our knowledge concerning the specificity and sensitivity of the veterinarian may either arise from the specific knowledge of the vet based on previous observations, or from our general knowledge concerning the population of veterinarians. In Fig. 2 this is illustrated. In the study (Bådsgaard & Jørgensen, 2000) the distribution (Vet pop ) of the precision parameters (Vet i = {SE, SP}), were quantified using an experimental setup where 4 veterinarians (only two in the figure) simultaneously assessed clinical symptoms of a total of 155 animals. In the present context, we want to use the information from this study for estimation of π(u, v) in Eq. (6). Two situation will be addressed, either if a specific veterinarian participating in quantification study (Vet 2 ) or a different veterinarian selected at random from the population (Vet 3 ). 6

7 Quantification Study Vet pop "Simulation" model Vet 1 Vet 2 Vet 3 Symp i1 Symp i2 Symp i5 Symp i3 State i State i State i Herd 0 Herd 1 Herd 2 Herd pop Figure 2: Schematic illustration of the clinical setup and our study. The WinBUGS analysis produces a sample {(SE (1), SP (1) ), (SE (2), SP (2) ),..., (SE (n), SP (n) )} from the relevant distributions as illustrated by the kernel densities in Fig. 3 (n is the sample size). Note that we may be able to approximate the joint distribution using some standard probability distribution such as the multivariate normal distribution. However, it is not obvious to what extent the parameters will follow such a distribution. Furthermore, the efforts will be a waste of time. For our purpose, we need exactly what the MCMC approach produces, a sample from the correct joint distribution. In the present case a sample of were produced. To estimate the probability in Eq. (6) we proceed as follows for a given true prevalence p h. First the probability of observing disease symptoms is calculated p (i) o = SE (i) p h + (1 p h )(1 SP (i) ) Then number of animals with disease symptoms n (i) o i.e., n (i) o desired probabilities. Binomial(10, p (i) o ). Finally, the distribution of n (i) o is drawn from the binomial distribution for all i is used form finding the In the present case the "simulation" model is so simple that calculation time and sample size is of (almost) no concern. However, with more complicated simulation models this issue becomes important. In contrast to the previous case, the samples produced by the MCMC approach are not independent. Therefore, the precision of the output by the simulation model is not simply ˆσ/ n 7

8 Density a) Density b) logit(se) logit(sp) Figure 3: Kernel density estimates on logit scale of sensitivity (a) and specificity (b) for random veterinarian ( ) and veterinarian no. 1 ( ). Table 1: Distribution of number of diseased from clinical inspection with different herd health state. Random veterinarian. Herd Number of diseased (clinical) Health > Table 2: Distribution of number of diseased from clinical inspection with different herd health state. Vet. no. 1 from experiment Herd Number of diseased (clinical) Health >

9 3.1 Conclusion The Markov Chain Monte Carlo methods seems ideally suited to be used in the context of specification of prior distribution for use in simulation modelling. Even if standard statistical model such as generalized linear models may be more expedient for experimental analysis, MCMC may be still relevant because a random sample from the population is automatically produces. The only problem is that the samples are not drawn independently, but to a large extent this can be remedied by thinning the sample. 4 Case 3: Sampling from Expert system (Bayesian network) The third case is taken from a project concerning intervention strategies for respiratory diseases, as presented in Otto (2000). The system uses the uses HUGIN TM program to formulate a probabilistic expert system for diagnosis and error detection concerning Mycoplasma. The present example is a slight modification of an example described in detail in Jørgensen (2000b). The final system will in addition to the diagnostic network include a module for Monte Carlo assessment of cost-benefit of different controlstrategies. The prevalence of the disease is expected to depend on management level and two risk factors. The prevalence may be observed either by the farmer or by a veterinarian. The precision of the farmers observation depends on his ability as a manager. Disease prevalence and quality of management influences growth rate. Manage Risk 1 Risk 2 Gain Preval Farm obs VetFind Figure 4: Hugin Expert system The quantifications of the dependencies in Fig. 4 is based upon Stärk et al. (1998). Two of the risk factors in her table 10 has been selected with corresponding parameter estimates. Two additions has been made. The overdispersion has been modelled by a random herd effect, and an additional management factor not included in her study is added for illustration. The 9

10 parameters from the logistic regression in Stärk et al. (1998) has thus been supplement with effect of management and between herd variation. Three factors influence herd prevalence. The management quality (Manage), Manure removal in nursery (Risk1) and No. of pigs in room (nursery) (Risk2) is No. of pigs in room (nursery). Each factor is categorized into discrete levels. The detailed model parameters are described in Jørgensen (2000b). The parameters is used to specify the necessary probability distribution tables in the Bayesian network in Fig. 4. The Prevalence node is defined as a continuous variable, the prevalence of serologically positive animals. For the purpose of the model the prevalence node is divided into 5 categories No disease, from 1-10 percent disease, from 10 to 40 percent disease, from 40 to 60 percent disease and above 60 percent disease. Based on the model we can calculate the probability of being in each of these different categories of disease level for each combination of the parent nodes (risk factors). To illustrate, the probability distribution is shown for a selected part of the combinations of risk factors In Table 3. Table 3: Distribution of herd health level for average management and selected risk factors. No. Of pigs 1st quartile 2nd quartile Manure Removal < daily daily > daily < daily daily > daily Herd Health > The next step in the modelling is the specification of the problem detection by the farmer. In the present example Farm_obs is defined with two levels No problem observed and Problem observed. In his daily work, the farmer assess the disease level continuously, but the measurement is not necessarily very precise. Furthermore, the observations may not lead to a problem detection, because the farmer may suppose that he is looking at a normal disease level, i.e., is threshold for problem detection is high. A natural model of the farmers observation is that good farmers are more precise in their observation, and that they tend to react to lower levels of disease. Based on these assumptions the probability table of Farm_obs conditioned on management quality and health problem is specified. The next node is the veterinarian diagnosis, i.e., he visits the farm and samples 10 animals at random and makes a clinical inspection of the animal. The outcome of the clinical inspection is the number of diseased animals, i.e., the states of the Vet_Find1 node is {0,..., 10}. If we know the herd prevalence and the precision of the veterinarian, we can calculate the probability distribution of number of diseased based on the assumptions above. This is exactly the probability table that were specified in section 3 and Table 1 and 2. Of course, the table need to reflect our knowledge concerning the veterinarian. 10

11 In the typical use of the expert system, we need to base our inference on evidence on a minimum of two nodes. The farmer will have detected a problem in the herd, and we will have the the result of the veterinarians inspection of the sample. Conditional on this evidence we need to advice the farmer, if he should change his production strategy and how it should be changed. Our expectation towards future production strategies will of course depend on the risk factors actually causing the problem. A high stocking rate might suggest an increase of herd size combined with sectioned production. But if there is poor management as well the full benefit of sectioned production might not be obtained. The cost-benefit of control-strategies thus depends on the combination of risk-factors presents. The Bayesian network contains the full joint probability of these combinations, and in the program Hugin a random sample from this joint distribution may readily be found, using existing procedures in the application programming interface (simulate). In contrast to the output from the MCMC method, the subsequent samples from Hugin are independent samples from the distribution. 5 Conclusion In the present paper different approaches towards specification of prior distribution of "stateof-nature" parameters has been presented. The conclusions is that such a specification can be made readily using off-the-shelf methods, and the possibility for handling prior evidence concerning these parameters are good. The only word of caution, is that the sample produced by the MCMC does not consist of independent instantiations from the distribution, but this can be easily remedied by simply discarding instantiations. However, it should be noted that the techniques are restricted to relationships between model parameters and evidence, where it is not important to use the full simulation model. This relates especially to capacity restrictions and interactions between animals. The ideas in Jørgensen (2000a) may be used in such cases. Another important aspect not covered, is the state of individual animals currently present in the herd. It is not possible to estimate the current state of an individual in the herd without taken observations and decisions into account. The individuals remain in the herd because it has been decided not to cull it. We need techniques to calculate probability distribution of the current state given the evidence that it is alive. If the use of simulation models is restricted to steady state results of production strategies, we can avoid this problem. 11

12 References Bådsgaard, N.P. & E. Jørgensen (2000). A Bayesian approach to estimating the reliability of clinical observations With an application to herd prevalence estimation. Preventive Veterinary Medicine, in prep. Fishmann, G.S. (1996). Monte Carlo. Concepts, Algorithms, and Applications. Springer-Verlag New York, Inc. Jørgensen, E. (1998). Stochastic modelling of pig production. Working Paper: Growth Models. Dina Notat, 73 pp URL: eps. Jørgensen, E. (2000a). Calibration of a Monte Carlo Simulation Model of Disease Spread in Slaughter Pig Units. Computers and Electronics in Agriculture, 25 pp URL: Jørgensen, E. (2000b). Elements of Bayesian network specification in an animal health research project. Internal report, Biometry Research Unit, Danish Institute of Agricultural Sciences, pp URL: diag1504a.pdf. Jørgensen, E. & A.R. Kristensen (1995). An object oriented simulation model of a pig herd with emphasis on information flow. In FACTs 95 March 7, 8, 9, 1995, Orlando Florida, Farm Animal Computer Technologies Conference, pp Otto, L. (2000). Mycoplasma for pigs in a Bayesian Network: A decision support system. In Proc. "Economic modelling of Animal Health and Farm Management". November 23-24, 2000 Wageningen. Spiegelhalter, D.J., A.P. Dawid, S.L. Lauritzen, & R.G. Cowell (1993). Bayesian Analysis in Expert Systems. Statistical Science, 8(3) pp Spiegelhalter, D.J., A. Thomas, & N. Best (1996). Computation on Bayesian Graphical Models. Bayesian Statistics, 5 pp Spiegelhalter, D.J., A. Thomas, N. Best, & W. Gilks (1999). WinBUGS. Version 1.2 User Manual. MRC Biostatistics Unit. URL: uk/bugs/welcome.html. Stärk, K.D.C., D.U. Pfeiffer, & R.S. Morris (1998). Risk factors for respiratory disease in New Zealand pig herds. New Zealand Veterinary Journal, pp