Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary response Y = 1 is recorded if and only if U > 0: θ = Pr(Y = 1 x) = 1 F (0; x). Since U is not directly observed there is no loss of generality in taking the critical (i.e., cutoff point) to be 0. In addition, we can take the standard deviation of U (or some other measure of dispersion) to be 1, without loss of generality. 1

Probit Models For example, if U N(x β, 1) it follows that θ i = Pr(Y = 1 x i ) = Φ(x iβ), where Φ( ) is the cumulative normal distribution function Φ(t) = (2π) 1/2 t exp( 1 2 z2 ) dz. The relation is linearized by the inverse normal transformation Φ 1 (θ) = x iβ = p j=1 x ij β j. 2

We have regarded the cutoff value of U as fixed and the mean of U to be changing with x. Alternatively, one could assume that the distribution of U is fixed and allow the critical value to vary with x (e.g., dose) In toxicology studies where dose is the explanatory variable it makes sense to let V denote the minimum level of dose needed to produce a response (i.e., tolerance) 3

Under the second formulation, y i = 1 if x i β > v i It follows that Pr(Y = 1 x i ) = Pr(V x iβ). Note that the shape of the dose-response curve is determined by the distribution function of V If V N(0, 1), then Pr(Y = 1 x i ) = Φ(x iβ), and it follows that the U and V formulations are equivalent The U formulation is more common 4

Latent Utilities & Choice Models Suppose that Fred is choosing between 2 brands of a product (say, Ben & Jerry s or Haigen Daz) Fred has a utility for Ben & Jerry s (denoted by Z i1 ) and a utility for Haigen Daz (denotes by Z i2 ) Letting the difference in utilities be represented by the normal linear model, we have U i = Z i1 Z i2 = x iβ + ɛ i, where ɛ i N(0, 1). If Fred has a higher utility for Ben & Jerry s, then Z i1 > Z i2, U i > 0, and Fred will choose Ben & Jerry s (Y i = 1) 5

This latent utility formulation is again equivalent to a probit model for the binary response. The generalization to a multinomial response is straightforward: Introduce k latent utilities instead of 2 Individual s response (i.e., choice) corresponds to the category with maximum utility Referred to as discrete choice model 6

Comments: Certain types of models are preferred in certain applications - probit common in bioassay & social sciences. Ability to calculate posterior densities for any functional makes parameter interpretation less important for Bayesian approach. The model will ideally be motivated by prior information and fit to the current data. Practically, computational convenience also plays a role, particularly in complex settings. 7

Logistic Regression The normal form is only one possibility for the distribution of U. Another is the logistic distribution with location x iβ and unit scale. The logistic distribution has cumulative distribution function so that F (u) = exp(u x iβ) 1 + exp(u x iβ), F (0; x i ) = 1/{1 + exp(x iβ)}, 8

It follows that Pr(Y = 1 x i ) = Pr(U > 0 x i ) = 1 F (0; x i ) = 1/{1+exp( x iβ)}. To linearize this relation, we take the logit transformation of both sides, log{θ i /(1 θ i )} = x iβ. 9

Some Generalizations of the Logistic Model Logistic regression assumes a restricted dose-response shape - possible to relax restriction Aranda-Ordaz (1981) proposed two families of linearizing transformations, which can easily be inverted and which span a range of forms. 10

The first, which is restricted to symmetric cases (i.e., invariant to interchanging success & failure) is 2 θ ν (1 θ) ν ν θ ν + (1 θ). ν In the limit as ν 0, this is logistic and for ν = 1 this is linear The second family has log[{(1 θ) ν 1}/ν], which reduces to the extreme value model when ν = 0 and the logistic when ν = 1. 11

When there is doubt about the transformation, a formal approach is to use one or the other of the above transformations and to fit the resulting model for a range of possible values for ν A profile likelihood can be obtained for ν by plotting the maximized likelihood against ν (Frequentist) 12

Potentially, one could choose a standard form, such as the logistic, if the corresponding value of ν falls with the 95% profile likelihood confidence region. Alternatively, we could choose a prior density for ν and implement a Bayesian approach. 13

DDE & Preterm Birth Example Scientific Interest: Association between DDE exposure & preterm birth adjusting for possible confounding variables. Data from US Collaborative Perinatal Project (CPP) - n = 2380 children out of which 361 were born preterm. Analysis: Bayesian analysis using a probit model 14

Model: y i = 1 if preterm birth and y i = 0 if full-term birth Pr(y i = 1 x i, β) = Φ(x iβ), x i = (1, dde i, x i3,..., x i7 ) x i3,..., x i7 = represent possible confounders β 1 = intercept β 2 = dde slope 15

Prior, Likelihood & Posterior Prior: π(β) = N(0, Σ β ) Likelihood: π(y β, X) = n i=1 Φ(x iβ) y i {1 Φ(x iβ)} 1 y i Posterior: π(β y, X) π(β)π(y β, X). No closed form available for normalizing constant 16

Maximum Likelihood Results Parameter MLE SE Z stat p-value β 1-1.08068 0.04355-24.816 < 2e 16 β 2 0.17536 0.02909 6.028 1.67e-09 β 3-0.12817 0.03528-3.633 0.000280 β 4 0.11097 0.03366 3.297 0.000978 β 5-0.01705 0.03405-0.501 0.616659 β 6-0.08216 0.03576-2.298 0.021571 β 7 0.05462 0.06473 0.844 0.398721 β 2 = dde slope (highly significant increasing trend) 17

Augmented Data Posterior Augment observed data {y i, x i } with latent z i. Probit model can be expressed in hierarchical form as follows: y i = 1(z i > 0) z i N(x iβ, 1) 18

Augmented data posterior is π(z, β y, X) { n i=1 π(y i z i, x i, β) }{ n i=1 Conditional distribution of y i given z i, x i, β: π(z i x i, β) } π(β), π(y i z i, x i, β) = 1(z i > 0)y i + 1(z i 0)(1 y i ) For Gibbs sampling, need full conditional posterior distributions for each unknown given all other unknowns in the model. 19

Conditional posterior of z i y i = 1, x i, β: π(z i y i = 1, x i, β) 1(z i > 0) 2π exp { 1/2(z i x iβ) 2 } = 1(z i > 0) N(z i ; x iβ, 1) 1 F (0; x iβ, 1) = N(z i ; x iβ, 1) truncated below by 0. Conditional posterior of z i y i = 0, x i, β: π(z i y i = 0, x i, β) = N(z i ; x iβ, 1) truncated above by 0. Conditional posterior of β z, y, X: π(β z, y, X) = N( β, V β ), V β = (Σ 1 β + X X) 1, β = V β (Σ 1 β β 0 + X z). 20

To implement Gibbs sampling, we simply iteratively sample from these full conditional posterior distributions a large number of times An initial burn-in is discarded to allow convergence to a stationary distribution Inferences can be based on posterior summaries calculated using the draws from the joint posterior distribution. 21

DDE Application We chose a ridge-regression type conjugate prior: π(β) = N(β; β 0, Σ β ), β = 0, Σ β = 4 I 7 7 An alternative default-type conjugate prior is obtained by letting Σ β = g (X X) 1, where g is a small positive constant. This prior, which is related to Zellner s g-prior, has some nice properties. 22

Another possibility would be the uniform improper prior: π(β) 1 Ideally, one would choose an informative prior to formally incorporate outside information into the analysis. For this example, the sample size is large, so there tends to be minimal differences in the posterior obtained using various diffuse to moderately informative priors with support on R. 23

Gibbs Sampling Results We chose β = 0 as starting values. The MLEs may provide a better choice, but results should not depend on the starting values. The algorithm was run for 1,000 iterations, with the first 100 iterations discarded as a burn-in. For standard GLMs, convergence tends to occur rapidly and 100 iterations appeared to be sufficient in this case based on diagnostic plots. For more complex models, many more iterations may be needed. 24

intercept (beta_1) 1.2 1.0 0.8 0.6 0 200 400 600 800 1000 iteration slope (beta_2) 0.05 0.15 0.25 0 200 400 600 800 1000 iteration 25

Inferences on a Regression Coefficient Bayesian inferences are often based on examination of marginal posterior densities. These densities can be estimated using kernel smoothing applied to draws collected after burn-in. One can also summarize the posterior using the mean, median, and a 95% credible interval. 26

Posterior Summaries Parameter Mean Median SD 95% credible interval β 1-1.08-1.08 0.04 (-1.16, -1.01) β 2 0.17 0.17 0.03 (0.12, 0.23) β 3-0.13-0.13 0.04 (-0.2, -0.05) β 4 0.11 0.11 0.03 (0.05, 0.18) β 5-0.02-0.02 0.03 (-0.08, 0.05) β 6-0.08-0.08 0.04 (-0.15, -0.02) β 7 0.05 0.06 0.06 (-0.07, 0.18) 27

Posterior Density 0 2 4 6 8 10 12 0.05 0.10 0.15 0.20 0.25 0.30 DDE slope (beta_1) 28

Inferences on Functional Often, it is not the regression parameter which is of primary interest. One may want to estimate functionals, such as the mean at different values of a predictor. By applying the function to every iteration of the MCMC algorithm after burn-in, one can obtain samples from the marginal posterior density of the unknown of interest. 29

Pr(Preterm birth) 0.0 0.05 0.10 0.15 0.20 0.25 2 1 0 1 2 Serum DDE (mg/l) 30