A tutorial on Bayesian model selection. and on the BMSL Laplace approximation

Similar documents
STA 4273H: Statistical Machine Learning

Bayesian Statistics: Indian Buffet Process

A Bayesian Antidote Against Strategy Sprawl

Statistics Graduate Courses

Tutorial on Markov Chain Monte Carlo

Linear Classification. Volker Tresp Summer 2015

Introduction to General and Generalized Linear Models

Basics of Statistical Machine Learning

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Statistics in Retail Finance. Chapter 6: Behavioural models

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Bayesian probability theory

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

CHAPTER 2 Estimating Probabilities

Statistical Machine Learning

Linear Threshold Units

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Modeling Individual Differences in Category Learning

A hidden Markov model for criminal behaviour classification

Handling attrition and non-response in longitudinal data

Adaptive Online Gradient Descent

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

Department of Economics

APPLIED MISSING DATA ANALYSIS

Christfried Webers. Canberra February June 2015

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Dirichlet forms methods for error calculus and sensitivity analysis

Probabilistic user behavior models in online stores for recommender systems

BayesX - Software for Bayesian Inference in Structured Additive Regression

Multivariate Normal Distribution

Gaussian Conjugate Prior Cheat Sheet

Gaussian Processes in Machine Learning

Introduction to Detection Theory

Application of discriminant analysis to predict the class of degree for graduating students in a university system

Computing with Finite and Infinite Networks

Monotonicity Hints. Abstract

An introduction to Value-at-Risk Learning Curve September 2003

Classification by Pairwise Coupling

Incorporating cost in Bayesian Variable Selection, with application to cost-effective measurement of quality of health care.

Lecture 3: Linear methods for classification

The equivalence of logistic regression and maximum entropy models

Multivariate Analysis of Variance (MANOVA): I. Theory

S-Parameters and Related Quantities Sam Wetterlin 10/20/09

Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models

11. Time series and dynamic linear models

Language Modeling. Chapter Introduction

Item selection by latent class-based methods: an application to nursing homes evaluation

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Markov Chain Monte Carlo Simulation Made Simple

A Game Theoretical Framework for Adversarial Learning

Least-Squares Intersection of Lines

Logistic Regression (1/24/13)

Multi-variable Calculus and Optimization

Predict the Popularity of YouTube Videos Using Early View Data

Probabilistic Methods for Time-Series Analysis

Factor Analysis. Chapter 420. Introduction

Data Mining: Algorithms and Applications Matrix Math Review

How To Check For Differences In The One Way Anova

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Bayes and Naïve Bayes. cs534-machine Learning

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Bayesian Information Criterion The BIC Of Algebraic geometry

1 Teaching notes on GMM 1.

Likelihood: Frequentist vs Bayesian Reasoning

Capacity Limits of MIMO Channels

CS229 Lecture notes. Andrew Ng

A BAYESIAN MODEL COMMITTEE APPROACH TO FORECASTING GLOBAL SOLAR RADIATION

How To Understand The Theory Of Probability

Package fastghquad. R topics documented: February 19, 2015

Statistical Machine Translation: IBM Models 1 and 2

Prentice Hall Algebra Correlated to: Colorado P-12 Academic Standards for High School Mathematics, Adopted 12/2009

Matrices 2. Solving Square Systems of Linear Equations; Inverse Matrices

Notes on Determinant

Factor analysis. Angela Montanari

Model Uncertainty in Classical Conditioning

Non Linear Dependence Structures: a Copula Opinion Approach in Portfolio Optimization

Machine Learning and Pattern Recognition Logistic Regression

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

Probability and Random Variables. Generation of random variables (r.v.)

Java Modules for Time Series Analysis

1 Determinants and the Solvability of Linear Systems

How To Test Granger Causality Between Time Series

Estimation across multiple models with application to Bayesian computing and software development

Pricing and calibration in local volatility models via fast quantization

Statistical Machine Learning from Data

33. STATISTICS. 33. Statistics 1

CSCI567 Machine Learning (Fall 2014)

Bayesian Model Averaging CRM in Phase I Clinical Trials

SAS Software to Fit the Generalized Linear Model

Time Series Analysis

THE USE OF STATISTICAL DISTRIBUTIONS TO MODEL CLAIMS IN MOTOR INSURANCE

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Introduction to Matrix Algebra


A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Optimizing Prediction with Hierarchical Models: Bayesian Clustering

Transcription:

A tutorial on Bayesian model selection and on the BMSL Laplace approximation Jean-Luc (schwartz@icp.inpg.fr) Institut de la Communication Parlée, CNRS UMR 5009, INPG-Université Stendhal INPG, 46 Av. Félix Viallet, 38031 Grenoble Cedex 1, France 1

A. The Bayesian framework for model assessment In most papers comparing models in the field of speech perception, the tool used to compare models is the fit estimated by the root mean square error RMSE, computed by taking the squared distances between observed and predicted probabilities of responses, averaging them over all categories Ci (total number n C ) and all experimental conditions Ej (total number n E ), and taking the square root of the result: RMSE = [( Ej, Ci (P Ej (C i ) p Ej (C i )) 2 ) / (n E n C )] 1/2 (1) The fit may be derived from the logarithm of the maximum likelihood of a model, considering a data set. If D is a set of k data d i, and M a model with parameters Θ, the estimation of the best set of parameter values θ is provided by (1) : θ = argmax p(θ D,M) (2) or, through the Bayes formula and assuming that all values of Θ are a priori equiprobable: θ = argmax p(d Θ,M) (3) If the model predicts that the d i values come from Gaussian models (δ i, σ i ), we have: log (p(d Θ,M)) = constant 1/2 i (d i -δ i ) 2 /σ i 2 (4) log (p(d θ,m)) = constant k/2 RMSE /σ 2 if σ i 2 =σ 2 i (5) 2

Hence the θ parameters maximizing the likelihood of M are those providing the best fit by minimizing RMSE. Notice that if the d i values do not come from Gaussian models, maximal likelihood is no more equivalent to best fit, which is typically the case with models of audiovisual categorization data, involving multinomial laws. More importantly, in the Bayesian theory, the comparison of two models is more complex than the comparison of their best fit (Jaynes, 1995). Indeed, comparing a model M 1 with a model M 2 by comparing their best fits means that there is a first step of estimation of these best fits, and it must be acknowledged that the estimation process is not error-free. Therefore, the comparison must account for this error-prone process, which is done by computing the total likelihood of the model knowing the data. This results in integrating likelihood over all model parameter values: p(d M)= p(d,θ M) dθ= p(d Θ,M) p(θ M) dθ= L(D M) p(θ M) dθ (6) where L(Θ M) is the likelihood of parameter Θ for the model, considering the data: L(Θ M) = p(d Θ,M) (7) This means that the a priori distribution of data D knowing model M integrates the distribution for all values Θ of the parameters of the model. Taking the opposite of the logarithm of total likelihood leads to the so-called Bayesian Model Selection (BMS) criterion for model evaluation (MacKay, 1992, Pitt & Myung, 2002): BMS = log L(Θ M) p(θ M) dθ (8) 3

Let us consider two models M 1 and M 2 that have to be compared in relation to a data set D. The best fit θ 1 for model M 1 provides a posterior likelihood Λ 1 =max p(d Θ 1,M 1 ) and the best fit θ 2 for model M 2 provides a posterior likelihood Λ 2 =max p(d Θ 2,M 2 ). From Eq. (6) it follows that the model comparison criterion is not provided by Λ 1 /Λ 2 (or by comparing RMSE 1 and RMSE 2, as classically done), but by: p(m 1 D) / p(m 2 D) = Λ 1 W 1 / Λ 2 W 2 (9) with: W i = [p(d Θ i,m i )/p(d θ i,m i )] p(θ i M i )dθ i (10) The ratio in Eq. (9) is called the Bayes factor (Kass & Raftery, 1995). The term p(d Θ i,m i )/p(d θ i,m i ) in Eq. (10) evaluates the likelihood of Θ i values relative to the likelihood of the θ i set providing the highest likelihood Λ i for model M i. Hence W i evaluates the volume of Θ i values providing an acceptable fit (not too far from the best one) relative to the whole volume of possible Θ i values. This relative volume decreases with the increase of the total Θ i volume: for example with the dimension of the Θ i space (2). But it also decreases if the function p(d Θ i,m i )/p(d θ i,m i ) decreases too quickly: this is what happens if the model is too sensitive. 4

B. BMSL, a simple and intuitive approximation of BMS The computation of BMS through Eq. (8) or the Bayes factor through Eq. (9-10) is complex. It involves the estimation of an integral, which generally requires use of numerical integration techniques, typically Monte-Carlo methods (e.g. Gilks et al., 1996). However, Jaynes (1995, ch. 24) proposes an approximation of the total likelihood in Eq. (6), based on an expansion of log(l) around the maximum likelihood point θ. Log(L(Θ)) Log(L(θ)) + 1/2 (Θ θ) [ 2 log(l) / Θ 2 ] θ (Θ θ) (11) where [ 2 log(l) / Θ 2 ] θ is the Hessian matrix of the function log(l) computed at the position of the parameter set θ providing the maximal likelihood L max of the considered model. Then, near this position, a good approximation of the likelihood is provided by: L(Θ) L max exp [ 1/2 (Θ θ) Σ 1 (Θ θ) ] (12) that is a multivariate Gaussian function with the inverse covariance matrix : Σ 1 = [ 2 log(l) / Θ 2 ] θ (13) Coming back to Eq. (6), and assuming that there is no a priori assumption on the distribution of parameters Θ, that is their distribution is uniform, we obtain: p(d M) = L(Θ M) p(θ M)dΘ L max exp [ 1/2 (Θ θ) Σ 1 (Θ θ) ] p(θ M) dθ (14) 5

Since p(θ M) is constant, the integral is now simply the volume of a Gaussian distribution: p(d M) L max (2π) m/2 det(σ) / V (15) where V is the total volume of the space occupied by parameters Θ and m is its dimension, that is the number of free parameters in the considered model. This leads to the so-called Laplace approximation of the BMS criterion (Kass & Raftery, 1995): BMSL = log(l max ) m/2 log(2π) + log(v) 1/2 log(det(σ)) (16) The preferred model considering the data D should minimize the BMSL criterion. There are in fact three kinds of terms in Eq. (16). Firstly, the term log(l max ) is directly linked to the maximum likelihood of the model, more or less accurately estimated by RMSE in Eq. (5): the larger the maximum likelihood, the smaller the BMSL criterion. Then, the two following terms are linked to the dimensionality and volume of the considered model. Altogether, they result in the handicapping of models that are too large (that is, models with a too high number of free parameters) by increasing BMSL (3). Finally, the fourth term provides exactly what we were looking for: that is, a term favoring models with a large value of det(σ). Indeed, if det(σ) is large, this means that the determinant of the Hessian matrix of log(l) is small, which expresses that the likelihood L does not vary too quickly around its maximum value L max. This is the precise mathematical way the BMSL criterion integrates fit (provided by the first term in Eq. (16)) and stability (provided by the fourth term), the second and third term just being there to account for possible differences in the global size of the tested models. Notice that if two models with the same number of free parameters and occupying 6

the same size are compared on a given data set D, BMSL just depends on the first and fourth terms, which is the (fit + stability) compromise we were looking for. Bayesian Model Selection has already been applied to the comparison of AV speech perception models, including FLMP (see Myung & Pitt, 1997; Massaro et al., 2001; Pitt et al., 2003). However, this involved heavy computations of integrals in Eq. (10) through Monte Carlo techniques, which would be difficult to apply in all the model comparison works in the domain. BMSL has the double interest to be easy to compute, and easy to interpret in terms of (fit + stability) compromise. Furthermore, if the amount of available data is much higher than the number of parameters involved in the models to compare (that is, the dimension m of the Θ space) the probability distributions become highly peaked around their maxima, and the central limit theorem shows that the approximation in Eqs. (11-12) becomes quite reasonable (Walker, 1967). Kass & Raftery (1995) suggest that the approximation should work well for a sample size greater than 20 times the parameter size m (see Slate, 1999, for further discussions about assessing non-normality). 7

C. Implementing BMSL for audiovisual speech perception experiments An audiovisual speech perception experiment typically involves various experimental conditions E i (e.g. various A, V, AV stimuli, conflicting or not), with categorization data described by observed frequencies p ij for each category C j in each condition E i (Σ j p ij = 1 for all values of i). A model M, depending on m free parameters Θ, predicts probabilities P ij (Θ) for each category C j in each condition E i. The distribution of probabilities in each experimental condition follows a multinomial law hence the logarithm of the likelihood of the Θ parameter set can be approximated by: log(l(θ)) = Σ ij n i (p ij log(p ij (Θ / p ij )) (17) where n i is the total number of responses provided by the subjects in condition E i. Therefore, the computation of BMSL can be easily done in four steps: (i) select the value of the Θ parameter set maximizing log(l(θ)), that is θ providing log(l(θ)) = L max ; (ii) compute the Hessian matrix of log(l) around θ, and its opposite inverse Σ; (iii) estimate the volume V of the Θ parameter set; (iv) compute BMSL according to Eq. (16). Let us take as an example the Fuzzy-Logical Model of Perception (FLMP) (Massaro, 1987, 1998) simulation of a test-case with two categories C1 and C2; one A, one V and one AV condition; and the following pattern of data: p A (C1)=0.99, p V (C1)=0.01, and p AV (C1)=0.95 obtained on 10 repetitions of each condition (n=10). The basic FLMP equation is: 8

P AV (C i ) = P A (C i )P V (C i ) / j P A (C j )P V (C j ) (18) Ci and Cj being phonetic categories involved in the experiment, and P A, P V and P AV the model probability of responses respectively in the A, V and AV conditions (observed probabilities are in lower case and simulated probabilities in upper case throughout this paper).the FLMP depends on two parameters Θ A and Θ V, varying each one between 0 and 1, hence in Eq. (16) we take m=2 and V=1. Θ A and Θ V respectively predict the audio and video responses: P A (C1) = Θ A P V (C1) = Θ V while the AV response is predicted by Eq (18) : P AV (C1) = Θ A Θ V / (Θ A Θ V + (1 Θ A ) (1 Θ V )) The probabilities of category C2 are of course the complement to 1 of all values for C1: P A (C2) = 1 P A (C1) P V (C2) = 1 P V (C1) P AV (C2) = 1 P AV (C1) In the continuation, all observed and predicted probabilities for C1 are respectively called p or P, and all observed and predicted probabilities for C2 are respectively called q or Q. This enables to compute the model log-likelihood function from Eq. (17): log(l(θ)) = n (p A log(p A /p A ) + q A log(q A /q A ) + p V log(p V /p V ) + q V log(q V /q V ) + p AV log(p AV /p AV ) + q AV log(q AV /q AV )) The next step consists in minimizing log(l(θ)) over the range Θ A,Θ V [0, 1]. This can be done by any optimization algorithm available in various libraries. In the present case, the minimum should be obtained around: θ A = 0.9994 9

θ V = 0.0105 which provide: log(l(θ A, θ V )) = log(l max ) = 0.1574 This is the end of step (i). Step (ii) consists in the computation of the Hessian matrix H of log(l) around θ. This can be done by classical numeric approximations of differential functions by Taylor developments. The core program, which can be directly implemented by users of the BMSL algorithm, is provided here under: ε = 0.00001; z = zeros (1,m); for i = 1:m e = z ; e(i) = ε ; H (i, i) = (log(l(θ+e)) + log(l(θ e)) 2* log(l(θ))) / ε 2 ; end for i=1:m for j=(i+1):m e=z; e(i)= ε ; e(j)= ε ; b = (log(l(θ+e)) + log(l(θ e)) 2* log(l(θ))) / ε 2 ; H (i, j) = (b H (i, i) H (j, j)) / 2; H (j, i) = H(i, j); end end Σ = inv(h); Computation of BMSL can then be done from Eq. 16, in which all terms are now computed. This provides a BMSL value of 7.94 in the present example. 10

Footnotes 1. In the following, bold symbols deal with vectors or matrices, and all maximizations are computed on the model parameter set Θ. 2. Massaro (1998) proposes to apply a correction factor k/(k-f) to RMSE, with k the number of data and f the freedom degree of the model (p. 301). 3. The interpretation of the term log(v) is straightforward, and results in handicapping large models by increasing BMSL. The term m/2 log(2π) comes more indirectly from the analysis, and could seem to favor large models. In fact, it can only decrease the trend to favor small models over large ones. 11

References Gilks, W.R., Richardson, S., & Spiegelhalter, D.J. (1996). Markov Chain Monte Carlo in Practice. New-York: Chapman & Hall. Jaynes E.T. (1995). Probability theory - The logic of science. Cambridge University Press (in press). http://bayes.wustl.edu.` Kass, R.E., & Raftery, A.E. (1995). Bayes factor, Journal of the American Statistical Association 90, 773-795. MacKay, D.J.C. (1992). Bayesian interpolation, Neural Computation 4, 415-447. Massaro, D.W. (1987). Speech perception by ear and eye: a paradigm for psychological inquiry. London: Laurence Erlbaum Associates. Massaro, D.W. (1998). Perceiving Talking Faces. Cambridge: MIT Press. Massaro, D.W., Cohen, M. M., Campbell, C.S., & Rodriguez, T. (2001). Bayes factor of model selection validates FLMP, Psychonomic Bulletin & Review 8, 1-17. Myung, I. J., & Pitt, M. A. (1997). Applying Occam's razor in modeling cognition: A Bayesian approach, Psychonomic Bulletin & Review 4, 79-95. Pitt, M.A., & Myung, I.J. (2002). When a good fit can be bad., Trends in Cognitive Science 6, 421-425. Pitt, M.A., Kim, W., & Myung, I.J. (2003). Flexibility versus generalizablity in model selection., Psychonomic Bulletin & Review 10, 29-44. Slate, E.H. (1999). Assessing multivariate nonnormality using univariate distributions, Biometrika 86, 191-202. Walker, A.M. (1967). On the asymptotic behaviour of posterior distributions, J. R. Stat. Soc. B. 31, 80-88. 12