A tutorial on Bayesian model selection. and on the BMSL Laplace approximation

A tutorial on Bayesian model selection and on the BMSL Laplace approximation Jean-Luc (schwartz@icp.inpg.fr) Institut de la Communication Parlée, CNRS UMR 5009, INPG-Université Stendhal INPG, 46 Av. Félix Viallet, 38031 Grenoble Cedex 1, France 1

A. The Bayesian framework for model assessment In most papers comparing models in the field of speech perception, the tool used to compare models is the fit estimated by the root mean square error RMSE, computed by taking the squared distances between observed and predicted probabilities of responses, averaging them over all categories Ci (total number n C ) and all experimental conditions Ej (total number n E ), and taking the square root of the result: RMSE = [( Ej, Ci (P Ej (C i ) p Ej (C i )) 2 ) / (n E n C )] 1/2 (1) The fit may be derived from the logarithm of the maximum likelihood of a model, considering a data set. If D is a set of k data d i, and M a model with parameters Θ, the estimation of the best set of parameter values θ is provided by (1) : θ = argmax p(θ D,M) (2) or, through the Bayes formula and assuming that all values of Θ are a priori equiprobable: θ = argmax p(d Θ,M) (3) If the model predicts that the d i values come from Gaussian models (δ i, σ i ), we have: log (p(d Θ,M)) = constant 1/2 i (d i -δ i ) 2 /σ i 2 (4) log (p(d θ,m)) = constant k/2 RMSE /σ 2 if σ i 2 =σ 2 i (5) 2

Hence the θ parameters maximizing the likelihood of M are those providing the best fit by minimizing RMSE. Notice that if the d i values do not come from Gaussian models, maximal likelihood is no more equivalent to best fit, which is typically the case with models of audiovisual categorization data, involving multinomial laws. More importantly, in the Bayesian theory, the comparison of two models is more complex than the comparison of their best fit (Jaynes, 1995). Indeed, comparing a model M 1 with a model M 2 by comparing their best fits means that there is a first step of estimation of these best fits, and it must be acknowledged that the estimation process is not error-free. Therefore, the comparison must account for this error-prone process, which is done by computing the total likelihood of the model knowing the data. This results in integrating likelihood over all model parameter values: p(d M)= p(d,θ M) dθ= p(d Θ,M) p(θ M) dθ= L(D M) p(θ M) dθ (6) where L(Θ M) is the likelihood of parameter Θ for the model, considering the data: L(Θ M) = p(d Θ,M) (7) This means that the a priori distribution of data D knowing model M integrates the distribution for all values Θ of the parameters of the model. Taking the opposite of the logarithm of total likelihood leads to the so-called Bayesian Model Selection (BMS) criterion for model evaluation (MacKay, 1992, Pitt & Myung, 2002): BMS = log L(Θ M) p(θ M) dθ (8) 3

Let us consider two models M 1 and M 2 that have to be compared in relation to a data set D. The best fit θ 1 for model M 1 provides a posterior likelihood Λ 1 =max p(d Θ 1,M 1 ) and the best fit θ 2 for model M 2 provides a posterior likelihood Λ 2 =max p(d Θ 2,M 2 ). From Eq. (6) it follows that the model comparison criterion is not provided by Λ 1 /Λ 2 (or by comparing RMSE 1 and RMSE 2, as classically done), but by: p(m 1 D) / p(m 2 D) = Λ 1 W 1 / Λ 2 W 2 (9) with: W i = [p(d Θ i,m i )/p(d θ i,m i )] p(θ i M i )dθ i (10) The ratio in Eq. (9) is called the Bayes factor (Kass & Raftery, 1995). The term p(d Θ i,m i )/p(d θ i,m i ) in Eq. (10) evaluates the likelihood of Θ i values relative to the likelihood of the θ i set providing the highest likelihood Λ i for model M i. Hence W i evaluates the volume of Θ i values providing an acceptable fit (not too far from the best one) relative to the whole volume of possible Θ i values. This relative volume decreases with the increase of the total Θ i volume: for example with the dimension of the Θ i space (2). But it also decreases if the function p(d Θ i,m i )/p(d θ i,m i ) decreases too quickly: this is what happens if the model is too sensitive. 4

B. BMSL, a simple and intuitive approximation of BMS The computation of BMS through Eq. (8) or the Bayes factor through Eq. (9-10) is complex. It involves the estimation of an integral, which generally requires use of numerical integration techniques, typically Monte-Carlo methods (e.g. Gilks et al., 1996). However, Jaynes (1995, ch. 24) proposes an approximation of the total likelihood in Eq. (6), based on an expansion of log(l) around the maximum likelihood point θ. Log(L(Θ)) Log(L(θ)) + 1/2 (Θ θ) [ 2 log(l) / Θ 2 ] θ (Θ θ) (11) where [ 2 log(l) / Θ 2 ] θ is the Hessian matrix of the function log(l) computed at the position of the parameter set θ providing the maximal likelihood L max of the considered model. Then, near this position, a good approximation of the likelihood is provided by: L(Θ) L max exp [ 1/2 (Θ θ) Σ 1 (Θ θ) ] (12) that is a multivariate Gaussian function with the inverse covariance matrix : Σ 1 = [ 2 log(l) / Θ 2 ] θ (13) Coming back to Eq. (6), and assuming that there is no a priori assumption on the distribution of parameters Θ, that is their distribution is uniform, we obtain: p(d M) = L(Θ M) p(θ M)dΘ L max exp [ 1/2 (Θ θ) Σ 1 (Θ θ) ] p(θ M) dθ (14) 5

Since p(θ M) is constant, the integral is now simply the volume of a Gaussian distribution: p(d M) L max (2π) m/2 det(σ) / V (15) where V is the total volume of the space occupied by parameters Θ and m is its dimension, that is the number of free parameters in the considered model. This leads to the so-called Laplace approximation of the BMS criterion (Kass & Raftery, 1995): BMSL = log(l max ) m/2 log(2π) + log(v) 1/2 log(det(σ)) (16) The preferred model considering the data D should minimize the BMSL criterion. There are in fact three kinds of terms in Eq. (16). Firstly, the term log(l max ) is directly linked to the maximum likelihood of the model, more or less accurately estimated by RMSE in Eq. (5): the larger the maximum likelihood, the smaller the BMSL criterion. Then, the two following terms are linked to the dimensionality and volume of the considered model. Altogether, they result in the handicapping of models that are too large (that is, models with a too high number of free parameters) by increasing BMSL (3). Finally, the fourth term provides exactly what we were looking for: that is, a term favoring models with a large value of det(σ). Indeed, if det(σ) is large, this means that the determinant of the Hessian matrix of log(l) is small, which expresses that the likelihood L does not vary too quickly around its maximum value L max. This is the precise mathematical way the BMSL criterion integrates fit (provided by the first term in Eq. (16)) and stability (provided by the fourth term), the second and third term just being there to account for possible differences in the global size of the tested models. Notice that if two models with the same number of free parameters and occupying 6

the same size are compared on a given data set D, BMSL just depends on the first and fourth terms, which is the (fit + stability) compromise we were looking for. Bayesian Model Selection has already been applied to the comparison of AV speech perception models, including FLMP (see Myung & Pitt, 1997; Massaro et al., 2001; Pitt et al., 2003). However, this involved heavy computations of integrals in Eq. (10) through Monte Carlo techniques, which would be difficult to apply in all the model comparison works in the domain. BMSL has the double interest to be easy to compute, and easy to interpret in terms of (fit + stability) compromise. Furthermore, if the amount of available data is much higher than the number of parameters involved in the models to compare (that is, the dimension m of the Θ space) the probability distributions become highly peaked around their maxima, and the central limit theorem shows that the approximation in Eqs. (11-12) becomes quite reasonable (Walker, 1967). Kass & Raftery (1995) suggest that the approximation should work well for a sample size greater than 20 times the parameter size m (see Slate, 1999, for further discussions about assessing non-normality). 7

C. Implementing BMSL for audiovisual speech perception experiments An audiovisual speech perception experiment typically involves various experimental conditions E i (e.g. various A, V, AV stimuli, conflicting or not), with categorization data described by observed frequencies p ij for each category C j in each condition E i (Σ j p ij = 1 for all values of i). A model M, depending on m free parameters Θ, predicts probabilities P ij (Θ) for each category C j in each condition E i. The distribution of probabilities in each experimental condition follows a multinomial law hence the logarithm of the likelihood of the Θ parameter set can be approximated by: log(l(θ)) = Σ ij n i (p ij log(p ij (Θ / p ij )) (17) where n i is the total number of responses provided by the subjects in condition E i. Therefore, the computation of BMSL can be easily done in four steps: (i) select the value of the Θ parameter set maximizing log(l(θ)), that is θ providing log(l(θ)) = L max ; (ii) compute the Hessian matrix of log(l) around θ, and its opposite inverse Σ; (iii) estimate the volume V of the Θ parameter set; (iv) compute BMSL according to Eq. (16). Let us take as an example the Fuzzy-Logical Model of Perception (FLMP) (Massaro, 1987, 1998) simulation of a test-case with two categories C1 and C2; one A, one V and one AV condition; and the following pattern of data: p A (C1)=0.99, p V (C1)=0.01, and p AV (C1)=0.95 obtained on 10 repetitions of each condition (n=10). The basic FLMP equation is: 8

P AV (C i ) = P A (C i )P V (C i ) / j P A (C j )P V (C j ) (18) Ci and Cj being phonetic categories involved in the experiment, and P A, P V and P AV the model probability of responses respectively in the A, V and AV conditions (observed probabilities are in lower case and simulated probabilities in upper case throughout this paper).the FLMP depends on two parameters Θ A and Θ V, varying each one between 0 and 1, hence in Eq. (16) we take m=2 and V=1. Θ A and Θ V respectively predict the audio and video responses: P A (C1) = Θ A P V (C1) = Θ V while the AV response is predicted by Eq (18) : P AV (C1) = Θ A Θ V / (Θ A Θ V + (1 Θ A ) (1 Θ V )) The probabilities of category C2 are of course the complement to 1 of all values for C1: P A (C2) = 1 P A (C1) P V (C2) = 1 P V (C1) P AV (C2) = 1 P AV (C1) In the continuation, all observed and predicted probabilities for C1 are respectively called p or P, and all observed and predicted probabilities for C2 are respectively called q or Q. This enables to compute the model log-likelihood function from Eq. (17): log(l(θ)) = n (p A log(p A /p A ) + q A log(q A /q A ) + p V log(p V /p V ) + q V log(q V /q V ) + p AV log(p AV /p AV ) + q AV log(q AV /q AV )) The next step consists in minimizing log(l(θ)) over the range Θ A,Θ V [0, 1]. This can be done by any optimization algorithm available in various libraries. In the present case, the minimum should be obtained around: θ A = 0.9994 9

θ V = 0.0105 which provide: log(l(θ A, θ V )) = log(l max ) = 0.1574 This is the end of step (i). Step (ii) consists in the computation of the Hessian matrix H of log(l) around θ. This can be done by classical numeric approximations of differential functions by Taylor developments. The core program, which can be directly implemented by users of the BMSL algorithm, is provided here under: ε = 0.00001; z = zeros (1,m); for i = 1:m e = z ; e(i) = ε ; H (i, i) = (log(l(θ+e)) + log(l(θ e)) 2* log(l(θ))) / ε 2 ; end for i=1:m for j=(i+1):m e=z; e(i)= ε ; e(j)= ε ; b = (log(l(θ+e)) + log(l(θ e)) 2* log(l(θ))) / ε 2 ; H (i, j) = (b H (i, i) H (j, j)) / 2; H (j, i) = H(i, j); end end Σ = inv(h); Computation of BMSL can then be done from Eq. 16, in which all terms are now computed. This provides a BMSL value of 7.94 in the present example. 10

Footnotes 1. In the following, bold symbols deal with vectors or matrices, and all maximizations are computed on the model parameter set Θ. 2. Massaro (1998) proposes to apply a correction factor k/(k-f) to RMSE, with k the number of data and f the freedom degree of the model (p. 301). 3. The interpretation of the term log(v) is straightforward, and results in handicapping large models by increasing BMSL. The term m/2 log(2π) comes more indirectly from the analysis, and could seem to favor large models. In fact, it can only decrease the trend to favor small models over large ones. 11

References Gilks, W.R., Richardson, S., & Spiegelhalter, D.J. (1996). Markov Chain Monte Carlo in Practice. New-York: Chapman & Hall. Jaynes E.T. (1995). Probability theory - The logic of science. Cambridge University Press (in press). http://bayes.wustl.edu.` Kass, R.E., & Raftery, A.E. (1995). Bayes factor, Journal of the American Statistical Association 90, 773-795. MacKay, D.J.C. (1992). Bayesian interpolation, Neural Computation 4, 415-447. Massaro, D.W. (1987). Speech perception by ear and eye: a paradigm for psychological inquiry. London: Laurence Erlbaum Associates. Massaro, D.W. (1998). Perceiving Talking Faces. Cambridge: MIT Press. Massaro, D.W., Cohen, M. M., Campbell, C.S., & Rodriguez, T. (2001). Bayes factor of model selection validates FLMP, Psychonomic Bulletin & Review 8, 1-17. Myung, I. J., & Pitt, M. A. (1997). Applying Occam's razor in modeling cognition: A Bayesian approach, Psychonomic Bulletin & Review 4, 79-95. Pitt, M.A., & Myung, I.J. (2002). When a good fit can be bad., Trends in Cognitive Science 6, 421-425. Pitt, M.A., Kim, W., & Myung, I.J. (2003). Flexibility versus generalizablity in model selection., Psychonomic Bulletin & Review 10, 29-44. Slate, E.H. (1999). Assessing multivariate nonnormality using univariate distributions, Biometrika 86, 191-202. Walker, A.M. (1967). On the asymptotic behaviour of posterior distributions, J. R. Stat. Soc. B. 31, 80-88. 12