# A tutorial on Bayesian model selection. and on the BMSL Laplace approximation

Save this PDF as:

Size: px
Start display at page:

Download "A tutorial on Bayesian model selection. and on the BMSL Laplace approximation"

## Transcription

1 A tutorial on Bayesian model selection and on the BMSL Laplace approximation Jean-Luc Institut de la Communication Parlée, CNRS UMR 5009, INPG-Université Stendhal INPG, 46 Av. Félix Viallet, Grenoble Cedex 1, France 1

2 A. The Bayesian framework for model assessment In most papers comparing models in the field of speech perception, the tool used to compare models is the fit estimated by the root mean square error RMSE, computed by taking the squared distances between observed and predicted probabilities of responses, averaging them over all categories Ci (total number n C ) and all experimental conditions Ej (total number n E ), and taking the square root of the result: RMSE = [( Ej, Ci (P Ej (C i ) p Ej (C i )) 2 ) / (n E n C )] 1/2 (1) The fit may be derived from the logarithm of the maximum likelihood of a model, considering a data set. If D is a set of k data d i, and M a model with parameters Θ, the estimation of the best set of parameter values θ is provided by (1) : θ = argmax p(θ D,M) (2) or, through the Bayes formula and assuming that all values of Θ are a priori equiprobable: θ = argmax p(d Θ,M) (3) If the model predicts that the d i values come from Gaussian models (δ i, σ i ), we have: log (p(d Θ,M)) = constant 1/2 i (d i -δ i ) 2 /σ i 2 (4) log (p(d θ,m)) = constant k/2 RMSE /σ 2 if σ i 2 =σ 2 i (5) 2

3 Hence the θ parameters maximizing the likelihood of M are those providing the best fit by minimizing RMSE. Notice that if the d i values do not come from Gaussian models, maximal likelihood is no more equivalent to best fit, which is typically the case with models of audiovisual categorization data, involving multinomial laws. More importantly, in the Bayesian theory, the comparison of two models is more complex than the comparison of their best fit (Jaynes, 1995). Indeed, comparing a model M 1 with a model M 2 by comparing their best fits means that there is a first step of estimation of these best fits, and it must be acknowledged that the estimation process is not error-free. Therefore, the comparison must account for this error-prone process, which is done by computing the total likelihood of the model knowing the data. This results in integrating likelihood over all model parameter values: p(d M)= p(d,θ M) dθ= p(d Θ,M) p(θ M) dθ= L(D M) p(θ M) dθ (6) where L(Θ M) is the likelihood of parameter Θ for the model, considering the data: L(Θ M) = p(d Θ,M) (7) This means that the a priori distribution of data D knowing model M integrates the distribution for all values Θ of the parameters of the model. Taking the opposite of the logarithm of total likelihood leads to the so-called Bayesian Model Selection (BMS) criterion for model evaluation (MacKay, 1992, Pitt & Myung, 2002): BMS = log L(Θ M) p(θ M) dθ (8) 3

4 Let us consider two models M 1 and M 2 that have to be compared in relation to a data set D. The best fit θ 1 for model M 1 provides a posterior likelihood Λ 1 =max p(d Θ 1,M 1 ) and the best fit θ 2 for model M 2 provides a posterior likelihood Λ 2 =max p(d Θ 2,M 2 ). From Eq. (6) it follows that the model comparison criterion is not provided by Λ 1 /Λ 2 (or by comparing RMSE 1 and RMSE 2, as classically done), but by: p(m 1 D) / p(m 2 D) = Λ 1 W 1 / Λ 2 W 2 (9) with: W i = [p(d Θ i,m i )/p(d θ i,m i )] p(θ i M i )dθ i (10) The ratio in Eq. (9) is called the Bayes factor (Kass & Raftery, 1995). The term p(d Θ i,m i )/p(d θ i,m i ) in Eq. (10) evaluates the likelihood of Θ i values relative to the likelihood of the θ i set providing the highest likelihood Λ i for model M i. Hence W i evaluates the volume of Θ i values providing an acceptable fit (not too far from the best one) relative to the whole volume of possible Θ i values. This relative volume decreases with the increase of the total Θ i volume: for example with the dimension of the Θ i space (2). But it also decreases if the function p(d Θ i,m i )/p(d θ i,m i ) decreases too quickly: this is what happens if the model is too sensitive. 4

5 B. BMSL, a simple and intuitive approximation of BMS The computation of BMS through Eq. (8) or the Bayes factor through Eq. (9-10) is complex. It involves the estimation of an integral, which generally requires use of numerical integration techniques, typically Monte-Carlo methods (e.g. Gilks et al., 1996). However, Jaynes (1995, ch. 24) proposes an approximation of the total likelihood in Eq. (6), based on an expansion of log(l) around the maximum likelihood point θ. Log(L(Θ)) Log(L(θ)) + 1/2 (Θ θ) [ 2 log(l) / Θ 2 ] θ (Θ θ) (11) where [ 2 log(l) / Θ 2 ] θ is the Hessian matrix of the function log(l) computed at the position of the parameter set θ providing the maximal likelihood L max of the considered model. Then, near this position, a good approximation of the likelihood is provided by: L(Θ) L max exp [ 1/2 (Θ θ) Σ 1 (Θ θ) ] (12) that is a multivariate Gaussian function with the inverse covariance matrix : Σ 1 = [ 2 log(l) / Θ 2 ] θ (13) Coming back to Eq. (6), and assuming that there is no a priori assumption on the distribution of parameters Θ, that is their distribution is uniform, we obtain: p(d M) = L(Θ M) p(θ M)dΘ L max exp [ 1/2 (Θ θ) Σ 1 (Θ θ) ] p(θ M) dθ (14) 5

6 Since p(θ M) is constant, the integral is now simply the volume of a Gaussian distribution: p(d M) L max (2π) m/2 det(σ) / V (15) where V is the total volume of the space occupied by parameters Θ and m is its dimension, that is the number of free parameters in the considered model. This leads to the so-called Laplace approximation of the BMS criterion (Kass & Raftery, 1995): BMSL = log(l max ) m/2 log(2π) + log(v) 1/2 log(det(σ)) (16) The preferred model considering the data D should minimize the BMSL criterion. There are in fact three kinds of terms in Eq. (16). Firstly, the term log(l max ) is directly linked to the maximum likelihood of the model, more or less accurately estimated by RMSE in Eq. (5): the larger the maximum likelihood, the smaller the BMSL criterion. Then, the two following terms are linked to the dimensionality and volume of the considered model. Altogether, they result in the handicapping of models that are too large (that is, models with a too high number of free parameters) by increasing BMSL (3). Finally, the fourth term provides exactly what we were looking for: that is, a term favoring models with a large value of det(σ). Indeed, if det(σ) is large, this means that the determinant of the Hessian matrix of log(l) is small, which expresses that the likelihood L does not vary too quickly around its maximum value L max. This is the precise mathematical way the BMSL criterion integrates fit (provided by the first term in Eq. (16)) and stability (provided by the fourth term), the second and third term just being there to account for possible differences in the global size of the tested models. Notice that if two models with the same number of free parameters and occupying 6

7 the same size are compared on a given data set D, BMSL just depends on the first and fourth terms, which is the (fit + stability) compromise we were looking for. Bayesian Model Selection has already been applied to the comparison of AV speech perception models, including FLMP (see Myung & Pitt, 1997; Massaro et al., 2001; Pitt et al., 2003). However, this involved heavy computations of integrals in Eq. (10) through Monte Carlo techniques, which would be difficult to apply in all the model comparison works in the domain. BMSL has the double interest to be easy to compute, and easy to interpret in terms of (fit + stability) compromise. Furthermore, if the amount of available data is much higher than the number of parameters involved in the models to compare (that is, the dimension m of the Θ space) the probability distributions become highly peaked around their maxima, and the central limit theorem shows that the approximation in Eqs. (11-12) becomes quite reasonable (Walker, 1967). Kass & Raftery (1995) suggest that the approximation should work well for a sample size greater than 20 times the parameter size m (see Slate, 1999, for further discussions about assessing non-normality). 7

8 C. Implementing BMSL for audiovisual speech perception experiments An audiovisual speech perception experiment typically involves various experimental conditions E i (e.g. various A, V, AV stimuli, conflicting or not), with categorization data described by observed frequencies p ij for each category C j in each condition E i (Σ j p ij = 1 for all values of i). A model M, depending on m free parameters Θ, predicts probabilities P ij (Θ) for each category C j in each condition E i. The distribution of probabilities in each experimental condition follows a multinomial law hence the logarithm of the likelihood of the Θ parameter set can be approximated by: log(l(θ)) = Σ ij n i (p ij log(p ij (Θ / p ij )) (17) where n i is the total number of responses provided by the subjects in condition E i. Therefore, the computation of BMSL can be easily done in four steps: (i) select the value of the Θ parameter set maximizing log(l(θ)), that is θ providing log(l(θ)) = L max ; (ii) compute the Hessian matrix of log(l) around θ, and its opposite inverse Σ; (iii) estimate the volume V of the Θ parameter set; (iv) compute BMSL according to Eq. (16). Let us take as an example the Fuzzy-Logical Model of Perception (FLMP) (Massaro, 1987, 1998) simulation of a test-case with two categories C1 and C2; one A, one V and one AV condition; and the following pattern of data: p A (C1)=0.99, p V (C1)=0.01, and p AV (C1)=0.95 obtained on 10 repetitions of each condition (n=10). The basic FLMP equation is: 8

9 P AV (C i ) = P A (C i )P V (C i ) / j P A (C j )P V (C j ) (18) Ci and Cj being phonetic categories involved in the experiment, and P A, P V and P AV the model probability of responses respectively in the A, V and AV conditions (observed probabilities are in lower case and simulated probabilities in upper case throughout this paper).the FLMP depends on two parameters Θ A and Θ V, varying each one between 0 and 1, hence in Eq. (16) we take m=2 and V=1. Θ A and Θ V respectively predict the audio and video responses: P A (C1) = Θ A P V (C1) = Θ V while the AV response is predicted by Eq (18) : P AV (C1) = Θ A Θ V / (Θ A Θ V + (1 Θ A ) (1 Θ V )) The probabilities of category C2 are of course the complement to 1 of all values for C1: P A (C2) = 1 P A (C1) P V (C2) = 1 P V (C1) P AV (C2) = 1 P AV (C1) In the continuation, all observed and predicted probabilities for C1 are respectively called p or P, and all observed and predicted probabilities for C2 are respectively called q or Q. This enables to compute the model log-likelihood function from Eq. (17): log(l(θ)) = n (p A log(p A /p A ) + q A log(q A /q A ) + p V log(p V /p V ) + q V log(q V /q V ) + p AV log(p AV /p AV ) + q AV log(q AV /q AV )) The next step consists in minimizing log(l(θ)) over the range Θ A,Θ V [0, 1]. This can be done by any optimization algorithm available in various libraries. In the present case, the minimum should be obtained around: θ A =

10 θ V = which provide: log(l(θ A, θ V )) = log(l max ) = This is the end of step (i). Step (ii) consists in the computation of the Hessian matrix H of log(l) around θ. This can be done by classical numeric approximations of differential functions by Taylor developments. The core program, which can be directly implemented by users of the BMSL algorithm, is provided here under: ε = ; z = zeros (1,m); for i = 1:m e = z ; e(i) = ε ; H (i, i) = (log(l(θ+e)) + log(l(θ e)) 2* log(l(θ))) / ε 2 ; end for i=1:m for j=(i+1):m e=z; e(i)= ε ; e(j)= ε ; b = (log(l(θ+e)) + log(l(θ e)) 2* log(l(θ))) / ε 2 ; H (i, j) = (b H (i, i) H (j, j)) / 2; H (j, i) = H(i, j); end end Σ = inv(h); Computation of BMSL can then be done from Eq. 16, in which all terms are now computed. This provides a BMSL value of 7.94 in the present example. 10

11 Footnotes 1. In the following, bold symbols deal with vectors or matrices, and all maximizations are computed on the model parameter set Θ. 2. Massaro (1998) proposes to apply a correction factor k/(k-f) to RMSE, with k the number of data and f the freedom degree of the model (p. 301). 3. The interpretation of the term log(v) is straightforward, and results in handicapping large models by increasing BMSL. The term m/2 log(2π) comes more indirectly from the analysis, and could seem to favor large models. In fact, it can only decrease the trend to favor small models over large ones. 11

12 References Gilks, W.R., Richardson, S., & Spiegelhalter, D.J. (1996). Markov Chain Monte Carlo in Practice. New-York: Chapman & Hall. Jaynes E.T. (1995). Probability theory - The logic of science. Cambridge University Press (in press). Kass, R.E., & Raftery, A.E. (1995). Bayes factor, Journal of the American Statistical Association 90, MacKay, D.J.C. (1992). Bayesian interpolation, Neural Computation 4, Massaro, D.W. (1987). Speech perception by ear and eye: a paradigm for psychological inquiry. London: Laurence Erlbaum Associates. Massaro, D.W. (1998). Perceiving Talking Faces. Cambridge: MIT Press. Massaro, D.W., Cohen, M. M., Campbell, C.S., & Rodriguez, T. (2001). Bayes factor of model selection validates FLMP, Psychonomic Bulletin & Review 8, Myung, I. J., & Pitt, M. A. (1997). Applying Occam's razor in modeling cognition: A Bayesian approach, Psychonomic Bulletin & Review 4, Pitt, M.A., & Myung, I.J. (2002). When a good fit can be bad., Trends in Cognitive Science 6, Pitt, M.A., Kim, W., & Myung, I.J. (2003). Flexibility versus generalizablity in model selection., Psychonomic Bulletin & Review 10, Slate, E.H. (1999). Assessing multivariate nonnormality using univariate distributions, Biometrika 86, Walker, A.M. (1967). On the asymptotic behaviour of posterior distributions, J. R. Stat. Soc. B. 31,

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

### A Bayesian Antidote Against Strategy Sprawl

A Bayesian Antidote Against Strategy Sprawl Benjamin Scheibehenne (benjamin.scheibehenne@unibas.ch) University of Basel, Missionsstrasse 62a 4055 Basel, Switzerland & Jörg Rieskamp (joerg.rieskamp@unibas.ch)

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

### Tutorial on Markov Chain Monte Carlo

Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,

### Face Recognition using Principle Component Analysis

Face Recognition using Principle Component Analysis Kyungnam Kim Department of Computer Science University of Maryland, College Park MD 20742, USA Summary This is the summary of the basic idea about PCA

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Why the Normal Distribution?

Why the Normal Distribution? Raul Rojas Freie Universität Berlin Februar 2010 Abstract This short note explains in simple terms why the normal distribution is so ubiquitous in pattern recognition applications.

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### The Conquest of U.S. Inflation: Learning and Robustness to Model Uncertainty

The Conquest of U.S. Inflation: Learning and Robustness to Model Uncertainty Timothy Cogley and Thomas Sargent University of California, Davis and New York University and Hoover Institution Conquest of

### Introduction to Monte-Carlo Methods

Introduction to Monte-Carlo Methods Bernard Lapeyre Halmstad January 2007 Monte-Carlo methods are extensively used in financial institutions to compute European options prices to evaluate sensitivities

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Combining information from different survey samples - a case study with data collected by world wide web and telephone

Combining information from different survey samples - a case study with data collected by world wide web and telephone Magne Aldrin Norwegian Computing Center P.O. Box 114 Blindern N-0314 Oslo Norway E-mail:

### Bayesian modeling of inseparable space-time variation in disease risk

Bayesian modeling of inseparable space-time variation in disease risk Leonhard Knorr-Held Laina Mercer Department of Statistics UW May 23, 2013 Motivation Area and time-specific disease rates Area and

### Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

### 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

### i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Bayesian probability theory

Bayesian probability theory Bruno A. Olshausen arch 1, 2004 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using probability. The foundations

### Gaussian Classifiers CS498

Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### AN ACCESSIBLE TREATMENT OF MONTE CARLO METHODS, TECHNIQUES, AND APPLICATIONS IN THE FIELD OF FINANCE AND ECONOMICS

Brochure More information from http://www.researchandmarkets.com/reports/2638617/ Handbook in Monte Carlo Simulation. Applications in Financial Engineering, Risk Management, and Economics. Wiley Handbooks

### Don t forget the degrees of freedom: evaluating uncertainty from small numbers of repeated measurements

Don t forget the degrees of freedom: evaluating uncertainty from small numbers of repeated measurements Blair Hall b.hall@irl.cri.nz Talk given via internet to the 35 th ANAMET Meeting, October 20, 2011.

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

### Stat260: Bayesian Modeling and Inference Lecture Date: February 1, Lecture 3

Stat26: Bayesian Modeling and Inference Lecture Date: February 1, 21 Lecture 3 Lecturer: Michael I. Jordan Scribe: Joshua G. Schraiber 1 Decision theory Recall that decision theory provides a quantification

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### A hidden Markov model for criminal behaviour classification

RSS2004 p.1/19 A hidden Markov model for criminal behaviour classification Francesco Bartolucci, Institute of economic sciences, Urbino University, Italy. Fulvia Pennoni, Department of Statistics, University

### Sufficient Statistics and Exponential Family. 1 Statistics and Sufficient Statistics. Math 541: Statistical Theory II. Lecturer: Songfeng Zheng

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Sufficient Statistics and Exponential Family 1 Statistics and Sufficient Statistics Suppose we have a random sample X 1,, X n taken from a distribution

### Complexity measures of musical rhythms

COMPLEXITY MEASURES OF MUSICAL RHYTHMS 1 Complexity measures of musical rhythms Ilya Shmulevich and Dirk-Jan Povel [Shmulevich, I., Povel, D.J. (2000) Complexity measures of musical rhythms. In P. Desain

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

### Handling attrition and non-response in longitudinal data

Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### APPLIED MISSING DATA ANALYSIS

APPLIED MISSING DATA ANALYSIS Craig K. Enders Series Editor's Note by Todd D. little THE GUILFORD PRESS New York London Contents 1 An Introduction to Missing Data 1 1.1 Introduction 1 1.2 Chapter Overview

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Modeling Individual Differences in Category Learning

Modeling Individual Differences in Category Learning Michael R. Webb (michael.webb@dsto.defence.gov.au) Command and Control Division, Defence Science and Technology Organisation Edinburgh, South Australia,

### Department of Economics

Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 1473-0278 On Testing for Diagonality of Large Dimensional

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### Lecture 1: Introduction to Random Walks and Diffusion

Lecture : Introduction to Random Walks and Diffusion Scribe: Chris H. Rycroft (and Martin Z. Bazant) Department of Mathematics, MIT February, 5 History The term random walk was originally proposed by Karl

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### Dirichlet forms methods for error calculus and sensitivity analysis

Dirichlet forms methods for error calculus and sensitivity analysis Nicolas BOULEAU, Osaka university, november 2004 These lectures propose tools for studying sensitivity of models to scalar or functional

### BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Probabilistic user behavior models in online stores for recommender systems

Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### A General Framework for Temporal Video Scene Segmentation

A General Framework for Temporal Video Scene Segmentation Yun Zhai School of Computer Science University of Central Florida yzhai@cs.ucf.edu Mubarak Shah School of Computer Science University of Central

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Gaussian Conjugate Prior Cheat Sheet

Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian

### Mälardalen University

Mälardalen University http://www.mdh.se 1/38 Value at Risk and its estimation Anatoliy A. Malyarenko Department of Mathematics & Physics Mälardalen University SE-72 123 Västerås,Sweden email: amo@mdh.se

### Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany carl@tuebingen.mpg.de WWW home page: http://www.tuebingen.mpg.de/ carl

### L1 vs. L2 Regularization and feature selection.

L1 vs. L2 Regularization and feature selection. Paper by Andrew Ng (2004) Presentation by Afshin Rostami Main Topics Covering Numbers Definition Convergence Bounds L1 regularized logistic regression L1

### Computing with Finite and Infinite Networks

Computing with Finite and Infinite Networks Ole Winther Theoretical Physics, Lund University Sölvegatan 14 A, S-223 62 Lund, Sweden winther@nimis.thep.lu.se Abstract Using statistical mechanics results,

### Application of discriminant analysis to predict the class of degree for graduating students in a university system

International Journal of Physical Sciences Vol. 4 (), pp. 06-0, January, 009 Available online at http://www.academicjournals.org/ijps ISSN 99-950 009 Academic Journals Full Length Research Paper Application

### By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

This material is posted here with permission of the IEEE Such permission of the IEEE does not in any way imply IEEE endorsement of any of Helsinki University of Technology's products or services Internal

### Classification by Pairwise Coupling

Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating

### Simplifying Bayesian Inference

Simplifying Bayesian Inference Stefan Krauß, Laura Martignon & Ulrich Hoffrage Max Planck Institute For Human Development Lentzeallee 94, 14195 Berlin-Dahlem Probability theory can be used to model inference

### Incorporating cost in Bayesian Variable Selection, with application to cost-effective measurement of quality of health care.

Incorporating cost in Bayesian Variable Selection, with application to cost-effective measurement of quality of health care University of Florida 10th Annual Winter Workshop: Bayesian Model Selection and

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Learning Organizational Principles in Human Environments

Learning Organizational Principles in Human Environments Outline Motivation: Object Allocation Problem Organizational Principles in Kitchen Environments Datasets Learning Organizational Principles Features

### Statistical Analysis with Missing Data

Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES

### Multivariate Analysis of Variance (MANOVA): I. Theory

Gregory Carey, 1998 MANOVA: I - 1 Multivariate Analysis of Variance (MANOVA): I. Theory Introduction The purpose of a t test is to assess the likelihood that the means for two groups are sampled from the

### 11. Time series and dynamic linear models

11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd

### Introduction to Detection Theory

Introduction to Detection Theory Reading: Ch. 3 in Kay-II. Notes by Prof. Don Johnson on detection theory, see http://www.ece.rice.edu/~dhj/courses/elec531/notes5.pdf. Ch. 10 in Wasserman. EE 527, Detection

### ELEC-E8104 Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems

Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems Minimum Mean Square Error (MMSE) MMSE estimation of Gaussian random vectors Linear MMSE estimator for arbitrarily distributed

### Item selection by latent class-based methods: an application to nursing homes evaluation

Item selection by latent class-based methods: an application to nursing homes evaluation Francesco Bartolucci, Giorgio E. Montanari, Silvia Pandolfi 1 Department of Economics, Finance and Statistics University

### Lab M1: The Simple Pendulum

Lab M1: The Simple Pendulum Introduction. The simple pendulum is a favorite introductory exercise because Galileo's experiments on pendulums in the early 1600s are usually regarded as the beginning of

### A Game Theoretical Framework for Adversarial Learning

A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,

### An introduction to Value-at-Risk Learning Curve September 2003

An introduction to Value-at-Risk Learning Curve September 2003 Value-at-Risk The introduction of Value-at-Risk (VaR) as an accepted methodology for quantifying market risk is part of the evolution of risk

### Monotonicity Hints. Abstract

Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology

### Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand

### Logistic Regression (1/24/13)

STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

### Multilevel Modeling of Complex Survey Data

Multilevel Modeling of Complex Survey Data Tihomir Asparouhov 1, Bengt Muthen 2 Muthen & Muthen 1 University of California, Los Angeles 2 Abstract We describe a multivariate, multilevel, pseudo maximum

### The Delta Method and Applications

Chapter 5 The Delta Method and Applications 5.1 Linear approximations of functions In the simplest form of the central limit theorem, Theorem 4.18, we consider a sequence X 1, X,... of independent and

### Probabilistic Methods for Time-Series Analysis

Probabilistic Methods for Time-Series Analysis 2 Contents 1 Analysis of Changepoint Models 1 1.1 Introduction................................ 1 1.1.1 Model and Notation....................... 2 1.1.2 Example:

### Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

### Theta Functions. Lukas Lewark. Seminar on Modular Forms, 31. Januar 2007

Theta Functions Lukas Lewark Seminar on Modular Forms, 31. Januar 007 Abstract Theta functions are introduced, associated to lattices or quadratic forms. Their transformation property is proven and the

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

### R f. V i. ET 438a Automatic Control Systems Technology Laboratory 4 Practical Differentiator Response

ET 438a Automatic Control Systems Technology Laboratory 4 Practical Differentiator Response Objective: Design a practical differentiator circuit using common OP AMP circuits. Test the frequency response

### Examination 110 Probability and Statistics Examination

Examination 0 Probability and Statistics Examination Sample Examination Questions The Probability and Statistics Examination consists of 5 multiple-choice test questions. The test is a three-hour examination

### The equivalence of logistic regression and maximum entropy models

The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/

### Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models

Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models Clement A Stone Abstract Interest in estimating item response theory (IRT) models using Bayesian methods has grown tremendously

### Generalized BIC for Singular Models Factoring through Regular Models

Generalized BIC for Singular Models Factoring through Regular Models Shaowei Lin http://math.berkeley.edu/ shaowei/ Department of Mathematics, University of California, Berkeley PhD student (Advisor: Bernd

### S-Parameters and Related Quantities Sam Wetterlin 10/20/09

S-Parameters and Related Quantities Sam Wetterlin 10/20/09 Basic Concept of S-Parameters S-Parameters are a type of network parameter, based on the concept of scattering. The more familiar network parameters

### Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### Measuring the tracking error of exchange traded funds: an unobserved components approach

Measuring the tracking error of exchange traded funds: an unobserved components approach Giuliano De Rossi Quantitative analyst +44 20 7568 3072 UBS Investment Research June 2012 Analyst Certification

### Lecture 11: Graphical Models for Inference

Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference - the Bayesian network and the Join tree. These two both represent the same joint probability

### A BAYESIAN MODEL COMMITTEE APPROACH TO FORECASTING GLOBAL SOLAR RADIATION

A BAYESIAN MODEL COMMITTEE APPROACH TO FORECASTING GLOBAL SOLAR RADIATION Philippe Lauret Hadja Maïmouna Diagne Mathieu David PIMENT University of La Reunion 97715 Saint Denis Cedex 9 hadja.diagne@univ-reunion.fr

### P (A) = lim P (A) = N(A)/N,

1.1 Probability, Relative Frequency and Classical Definition. Probability is the study of random or non-deterministic experiments. Suppose an experiment can be repeated any number of times, so that we

### Language Modeling. Chapter 1. 1.1 Introduction

Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

### CS229 Lecture notes. Andrew Ng

CS229 Lecture notes Andrew Ng Part X Factor analysis Whenwehavedatax (i) R n thatcomesfromamixtureofseveral Gaussians, the EM algorithm can be applied to fit a mixture model. In this setting, we usually

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written