Lecture 11: Further Topics in Bayesian Statistical Modeling: Graphical Modelling and Model Selection with DIC

Lecture 11: Further topics in Bayesian statistical modeling [1] Lecture 11: Further Topics in Bayesian Statistical Modeling: Graphical Modelling and Model Selection with DIC

Graphical Models Lecture 11: Further topics in Bayesian statistical modeling [2] Statistical modeling of complex systems involve usually many interconnected random variables. Question: How to build these connections? Answer: Think locally, act globally! Directed Acyclic Graphs (DAG): All quantities (random variables) in a model are represented by a node Relationships between nodes by arrows The graph is used to represent a set of conditional independence statements Express the joint relationship between all known (data) and unknown quantities (parameters, predictions, missing data, etc.) in a model through a series of simple local relationships. Provides the basis for computations

Conditional independence Lecture 11: Further topics in Bayesian statistical modeling [3] Two variables, X and Y are statistically independent if p(x, Y ) = p(x)p(y ). Equivalently, variables X and Y are statistically independent if Conditional independence: p(y X) = p(y ) Given three variables X, Y, and Z we say that X and Y are conditionally independent give Z, denoted by X Y Z, if p(x, Y Z) = p(x Z)p(Y Z)

Lecture 11: Further topics in Bayesian statistical modeling [4] Example: A Toy Model (Spiegelhalter, 1998) From a DAG, we can read of some conditional independence statements (Local Markov property) that use the natural order of the graph, e.g. B C, E, F A

Lecture 11: Further topics in Bayesian statistical modeling [5] How to read further conditional independence statements from a DAG? We define a Moral Graph by marrying the parents dropping arrows From this graph, different properties can be deduced and in particular the Global Markov property: any two subsets separated by a third one are conditional independent given the third. By separated, we mean that there is no path between the 2 subsets that does not go through the third one. In particular, p(v rest) = p(v neighbours of v) where by neighbours of v we mean the parents, spouse and children.

Moral graph Lecture 11: Further topics in Bayesian statistical modeling [6] D A, E, F (B, C) i.e. p(d rest) = p(d B, C)

Link between Gibbs sampling and DAG Lecture 11: Further topics in Bayesian statistical modeling [7] If we want to sample from p(a, B, C, D, F ) with a Gibbs sampler we define each marginal full conditional distribution using the conditional independence pattern of the DAG. Then we sample by iteratively sampling from (A, B, C, D, E, F ) p(a, B, C, D, E, F ) A p(a rest) = p(a) B p(b rest) = p(b A ) C p(c rest) = p(c A ) D p(d rest) = p(d B, C ) E p(e rest) = p(e A, F ) F p(f rest) = p(f ).

Lecture 11: Further topics in Bayesian statistical modeling [8] Summary DAG gives a non-algebraic description of the model Using a DAG is an interpretable way of specifying joint distributions through simple local terms It can be used to build hierarchical models It is used to find locally all conditional marginal distributions in a Bayesian model DAG is used to programs the kernel of the Gibbs sampler

WinBUGS and Graphical Models Lecture 11: Further topics in Bayesian statistical modeling [9] The WinBUGS User Manual recommends that the first step in any analysis should be the construction of a directed graphical model In Bayesian analysis both observable variables (data) and parameters are random variables. A Bayesian graphical model consists of nodes representing both data and parameters. These graphical representation can add clarity to complex patters of dependency.

WinBUGS implementation Lecture 11: Further topics in Bayesian statistical modeling [10] DoodleBUGS is a tool for drawing graphical models. BUGS code for a model can be generated from the graph. Types of nodes: Constants: fixed values - assigned values in data; cannot have parent nodes. Stochastic nodes: random variables assigned a probability distribution in the model - can be observed (data) or unobserved (parameters). Deterministic nodes: derived from other nodes as mathematical or logical functions of them.

Lecture 11: Further topics in Bayesian statistical modeling [11] Array of nodes - e.g. data values y[i]. They are represented compactly by a plate, indexed by i = 1,..., N. Type of links between nodes: Single arrows: represent stochastic dependence. Double arrows: represent logical (mathematical) dependence

Example: regression model Lecture 11: Further topics in Bayesian statistical modeling [12] A DAG representation for a linear regression model: y i N(µ i, τ) (i = 1,..., N) with µ i = θ 1 x 1,i + θ 2 x 1,i and τ = 1/σ 2

Multiple indexing Lecture 11: Further topics in Bayesian statistical modeling [13] Very useful to represents complex model structures: Each level of indexing of a variable requires its own plate in a graphical model. So an array variable like y ij would require two plates, one for each index. The y ij node will be in the intersection of the two plates. See example Dyes from WinBUGS Examples Vol. I - complete nesting. Any variable indexed by only j, for example, would be in the j plate but not in the i plate. See example Rats Vol I - repeated measures - x j (time) is the same for each i (rats), and so is in the j plate only.

Lecture 11: Further topics in Bayesian statistical modeling [14] Dyes from WinBUGS Examples Vol. I - complete nesting.

Rats Vol I - repeated measures - Lecture 11: Further topics in Bayesian statistical modeling [15]

More about model building Lecture 11: Further topics in Bayesian statistical modeling [16] Model criticism and sensitivity analysis Standard checks based on fitted model applied to Bayesian modeling: residuals: plot versus covariates, checks for auto-correlations and so on. prediction: check accuracy on external validation set, or cross validation. In addition should check for conflict between prior and data should check for unintended sensitivity to the prior using MCMC, we can replicate parameters and data.

Bayesian Model Selection Lecture 11: Further topics in Bayesian statistical modeling [17] Classical model selection criteria like C p, AIC and BIC assumed that the number of parameters in the model is a well-defined concepts. It is taken to be equivalent to degrees of freedom or the number of free parameters. In Bayesian analysis the prior effectively acts to restrict the freedom of these parameters to some extent and thus the appropriate model degrees of freedom is less clear. Another issue in complex models (i.e. hierarchical models) is that the likelihood is not a well defined concept. Moreover models to compare are not nested.

Using DIC for model selection Lecture 11: Further topics in Bayesian statistical modeling [18] Spiegelhater et al (2002) proposed a Bayesian model comparison criterion based on trading off goodness of fit and model complexity: Deviance Information Criterion, DIC = goodness of fit + complexity They measure goodness of fit via the deviance: D(θ) = 2 log L(data θ) Complexity of the model via: p D = E θ y [D] D ( E θ y [θ] ) = D D( θ)

Lecture 11: Further topics in Bayesian statistical modeling [19] i.e. posterior mean deviance minus deviance evaluated at the posterior mean of the parameters. The DIC is defined similarly to AIC as DIC = D( θ) + 2 p D = D + p D Models with smaller DIC are better supported by the data DIC can be monitored in WinBUGS from Interface/DIC menu.

Lecture 11: Further topics in Bayesian statistical modeling [20] Example: Gelman et. al pag 182 Suppose that the data model is y µ N(µ, 1) with prior µ Unif(0, 1000). Now suppose that we observe y 1 = 0.5 and y 2 = 100. Which is the effective number of parameters p D in each case: model{ y1 ~ dnorm(mu1, 1) y2 ~ dnorm(mu2, 1) mu1 ~ dunif(0,1000) mu2 ~ dunif(0, 1000) } #data list(y1 = 0.5, y2= 100)

Lecture 11: Further topics in Bayesian statistical modeling [21] Then we have Dbar Dhat pd DIC y1 2.585 2.094 0.490 3.075 y2 2.858 1.838 1.020 3.877 If we observe y 1 = 0.5 then effective number of parameters p D is approximately 0.5, since roughly half the information in the posterior distribution is coming from the data and half from the prior constraint of positivity. If we observe y 2 = 100 then the constrain is essentially irrelevant and the effective number of parameters is approximately 1.

Lecture 11: Further topics in Bayesian statistical modeling [22] Some comments p D is not invariant to reparametrization, i.e. which estimate is used in D( θ) p D can be negative if there is a strong prior-data conflict DIC and p D are particular useful in hierarchical models p D depends on the model and on the data. This is fundamentally different to AIC or BIC