A tutorial on Bayesian model selection. and on the BMSL Laplace approximation


 Aldous Alexander
 2 years ago
 Views:
Transcription
1 A tutorial on Bayesian model selection and on the BMSL Laplace approximation JeanLuc Institut de la Communication Parlée, CNRS UMR 5009, INPGUniversité Stendhal INPG, 46 Av. Félix Viallet, Grenoble Cedex 1, France 1
2 A. The Bayesian framework for model assessment In most papers comparing models in the field of speech perception, the tool used to compare models is the fit estimated by the root mean square error RMSE, computed by taking the squared distances between observed and predicted probabilities of responses, averaging them over all categories Ci (total number n C ) and all experimental conditions Ej (total number n E ), and taking the square root of the result: RMSE = [( Ej, Ci (P Ej (C i ) p Ej (C i )) 2 ) / (n E n C )] 1/2 (1) The fit may be derived from the logarithm of the maximum likelihood of a model, considering a data set. If D is a set of k data d i, and M a model with parameters Θ, the estimation of the best set of parameter values θ is provided by (1) : θ = argmax p(θ D,M) (2) or, through the Bayes formula and assuming that all values of Θ are a priori equiprobable: θ = argmax p(d Θ,M) (3) If the model predicts that the d i values come from Gaussian models (δ i, σ i ), we have: log (p(d Θ,M)) = constant 1/2 i (d i δ i ) 2 /σ i 2 (4) log (p(d θ,m)) = constant k/2 RMSE /σ 2 if σ i 2 =σ 2 i (5) 2
3 Hence the θ parameters maximizing the likelihood of M are those providing the best fit by minimizing RMSE. Notice that if the d i values do not come from Gaussian models, maximal likelihood is no more equivalent to best fit, which is typically the case with models of audiovisual categorization data, involving multinomial laws. More importantly, in the Bayesian theory, the comparison of two models is more complex than the comparison of their best fit (Jaynes, 1995). Indeed, comparing a model M 1 with a model M 2 by comparing their best fits means that there is a first step of estimation of these best fits, and it must be acknowledged that the estimation process is not errorfree. Therefore, the comparison must account for this errorprone process, which is done by computing the total likelihood of the model knowing the data. This results in integrating likelihood over all model parameter values: p(d M)= p(d,θ M) dθ= p(d Θ,M) p(θ M) dθ= L(D M) p(θ M) dθ (6) where L(Θ M) is the likelihood of parameter Θ for the model, considering the data: L(Θ M) = p(d Θ,M) (7) This means that the a priori distribution of data D knowing model M integrates the distribution for all values Θ of the parameters of the model. Taking the opposite of the logarithm of total likelihood leads to the socalled Bayesian Model Selection (BMS) criterion for model evaluation (MacKay, 1992, Pitt & Myung, 2002): BMS = log L(Θ M) p(θ M) dθ (8) 3
4 Let us consider two models M 1 and M 2 that have to be compared in relation to a data set D. The best fit θ 1 for model M 1 provides a posterior likelihood Λ 1 =max p(d Θ 1,M 1 ) and the best fit θ 2 for model M 2 provides a posterior likelihood Λ 2 =max p(d Θ 2,M 2 ). From Eq. (6) it follows that the model comparison criterion is not provided by Λ 1 /Λ 2 (or by comparing RMSE 1 and RMSE 2, as classically done), but by: p(m 1 D) / p(m 2 D) = Λ 1 W 1 / Λ 2 W 2 (9) with: W i = [p(d Θ i,m i )/p(d θ i,m i )] p(θ i M i )dθ i (10) The ratio in Eq. (9) is called the Bayes factor (Kass & Raftery, 1995). The term p(d Θ i,m i )/p(d θ i,m i ) in Eq. (10) evaluates the likelihood of Θ i values relative to the likelihood of the θ i set providing the highest likelihood Λ i for model M i. Hence W i evaluates the volume of Θ i values providing an acceptable fit (not too far from the best one) relative to the whole volume of possible Θ i values. This relative volume decreases with the increase of the total Θ i volume: for example with the dimension of the Θ i space (2). But it also decreases if the function p(d Θ i,m i )/p(d θ i,m i ) decreases too quickly: this is what happens if the model is too sensitive. 4
5 B. BMSL, a simple and intuitive approximation of BMS The computation of BMS through Eq. (8) or the Bayes factor through Eq. (910) is complex. It involves the estimation of an integral, which generally requires use of numerical integration techniques, typically MonteCarlo methods (e.g. Gilks et al., 1996). However, Jaynes (1995, ch. 24) proposes an approximation of the total likelihood in Eq. (6), based on an expansion of log(l) around the maximum likelihood point θ. Log(L(Θ)) Log(L(θ)) + 1/2 (Θ θ) [ 2 log(l) / Θ 2 ] θ (Θ θ) (11) where [ 2 log(l) / Θ 2 ] θ is the Hessian matrix of the function log(l) computed at the position of the parameter set θ providing the maximal likelihood L max of the considered model. Then, near this position, a good approximation of the likelihood is provided by: L(Θ) L max exp [ 1/2 (Θ θ) Σ 1 (Θ θ) ] (12) that is a multivariate Gaussian function with the inverse covariance matrix : Σ 1 = [ 2 log(l) / Θ 2 ] θ (13) Coming back to Eq. (6), and assuming that there is no a priori assumption on the distribution of parameters Θ, that is their distribution is uniform, we obtain: p(d M) = L(Θ M) p(θ M)dΘ L max exp [ 1/2 (Θ θ) Σ 1 (Θ θ) ] p(θ M) dθ (14) 5
6 Since p(θ M) is constant, the integral is now simply the volume of a Gaussian distribution: p(d M) L max (2π) m/2 det(σ) / V (15) where V is the total volume of the space occupied by parameters Θ and m is its dimension, that is the number of free parameters in the considered model. This leads to the socalled Laplace approximation of the BMS criterion (Kass & Raftery, 1995): BMSL = log(l max ) m/2 log(2π) + log(v) 1/2 log(det(σ)) (16) The preferred model considering the data D should minimize the BMSL criterion. There are in fact three kinds of terms in Eq. (16). Firstly, the term log(l max ) is directly linked to the maximum likelihood of the model, more or less accurately estimated by RMSE in Eq. (5): the larger the maximum likelihood, the smaller the BMSL criterion. Then, the two following terms are linked to the dimensionality and volume of the considered model. Altogether, they result in the handicapping of models that are too large (that is, models with a too high number of free parameters) by increasing BMSL (3). Finally, the fourth term provides exactly what we were looking for: that is, a term favoring models with a large value of det(σ). Indeed, if det(σ) is large, this means that the determinant of the Hessian matrix of log(l) is small, which expresses that the likelihood L does not vary too quickly around its maximum value L max. This is the precise mathematical way the BMSL criterion integrates fit (provided by the first term in Eq. (16)) and stability (provided by the fourth term), the second and third term just being there to account for possible differences in the global size of the tested models. Notice that if two models with the same number of free parameters and occupying 6
7 the same size are compared on a given data set D, BMSL just depends on the first and fourth terms, which is the (fit + stability) compromise we were looking for. Bayesian Model Selection has already been applied to the comparison of AV speech perception models, including FLMP (see Myung & Pitt, 1997; Massaro et al., 2001; Pitt et al., 2003). However, this involved heavy computations of integrals in Eq. (10) through Monte Carlo techniques, which would be difficult to apply in all the model comparison works in the domain. BMSL has the double interest to be easy to compute, and easy to interpret in terms of (fit + stability) compromise. Furthermore, if the amount of available data is much higher than the number of parameters involved in the models to compare (that is, the dimension m of the Θ space) the probability distributions become highly peaked around their maxima, and the central limit theorem shows that the approximation in Eqs. (1112) becomes quite reasonable (Walker, 1967). Kass & Raftery (1995) suggest that the approximation should work well for a sample size greater than 20 times the parameter size m (see Slate, 1999, for further discussions about assessing nonnormality). 7
8 C. Implementing BMSL for audiovisual speech perception experiments An audiovisual speech perception experiment typically involves various experimental conditions E i (e.g. various A, V, AV stimuli, conflicting or not), with categorization data described by observed frequencies p ij for each category C j in each condition E i (Σ j p ij = 1 for all values of i). A model M, depending on m free parameters Θ, predicts probabilities P ij (Θ) for each category C j in each condition E i. The distribution of probabilities in each experimental condition follows a multinomial law hence the logarithm of the likelihood of the Θ parameter set can be approximated by: log(l(θ)) = Σ ij n i (p ij log(p ij (Θ / p ij )) (17) where n i is the total number of responses provided by the subjects in condition E i. Therefore, the computation of BMSL can be easily done in four steps: (i) select the value of the Θ parameter set maximizing log(l(θ)), that is θ providing log(l(θ)) = L max ; (ii) compute the Hessian matrix of log(l) around θ, and its opposite inverse Σ; (iii) estimate the volume V of the Θ parameter set; (iv) compute BMSL according to Eq. (16). Let us take as an example the FuzzyLogical Model of Perception (FLMP) (Massaro, 1987, 1998) simulation of a testcase with two categories C1 and C2; one A, one V and one AV condition; and the following pattern of data: p A (C1)=0.99, p V (C1)=0.01, and p AV (C1)=0.95 obtained on 10 repetitions of each condition (n=10). The basic FLMP equation is: 8
9 P AV (C i ) = P A (C i )P V (C i ) / j P A (C j )P V (C j ) (18) Ci and Cj being phonetic categories involved in the experiment, and P A, P V and P AV the model probability of responses respectively in the A, V and AV conditions (observed probabilities are in lower case and simulated probabilities in upper case throughout this paper).the FLMP depends on two parameters Θ A and Θ V, varying each one between 0 and 1, hence in Eq. (16) we take m=2 and V=1. Θ A and Θ V respectively predict the audio and video responses: P A (C1) = Θ A P V (C1) = Θ V while the AV response is predicted by Eq (18) : P AV (C1) = Θ A Θ V / (Θ A Θ V + (1 Θ A ) (1 Θ V )) The probabilities of category C2 are of course the complement to 1 of all values for C1: P A (C2) = 1 P A (C1) P V (C2) = 1 P V (C1) P AV (C2) = 1 P AV (C1) In the continuation, all observed and predicted probabilities for C1 are respectively called p or P, and all observed and predicted probabilities for C2 are respectively called q or Q. This enables to compute the model loglikelihood function from Eq. (17): log(l(θ)) = n (p A log(p A /p A ) + q A log(q A /q A ) + p V log(p V /p V ) + q V log(q V /q V ) + p AV log(p AV /p AV ) + q AV log(q AV /q AV )) The next step consists in minimizing log(l(θ)) over the range Θ A,Θ V [0, 1]. This can be done by any optimization algorithm available in various libraries. In the present case, the minimum should be obtained around: θ A =
10 θ V = which provide: log(l(θ A, θ V )) = log(l max ) = This is the end of step (i). Step (ii) consists in the computation of the Hessian matrix H of log(l) around θ. This can be done by classical numeric approximations of differential functions by Taylor developments. The core program, which can be directly implemented by users of the BMSL algorithm, is provided here under: ε = ; z = zeros (1,m); for i = 1:m e = z ; e(i) = ε ; H (i, i) = (log(l(θ+e)) + log(l(θ e)) 2* log(l(θ))) / ε 2 ; end for i=1:m for j=(i+1):m e=z; e(i)= ε ; e(j)= ε ; b = (log(l(θ+e)) + log(l(θ e)) 2* log(l(θ))) / ε 2 ; H (i, j) = (b H (i, i) H (j, j)) / 2; H (j, i) = H(i, j); end end Σ = inv(h); Computation of BMSL can then be done from Eq. 16, in which all terms are now computed. This provides a BMSL value of 7.94 in the present example. 10
11 Footnotes 1. In the following, bold symbols deal with vectors or matrices, and all maximizations are computed on the model parameter set Θ. 2. Massaro (1998) proposes to apply a correction factor k/(kf) to RMSE, with k the number of data and f the freedom degree of the model (p. 301). 3. The interpretation of the term log(v) is straightforward, and results in handicapping large models by increasing BMSL. The term m/2 log(2π) comes more indirectly from the analysis, and could seem to favor large models. In fact, it can only decrease the trend to favor small models over large ones. 11
12 References Gilks, W.R., Richardson, S., & Spiegelhalter, D.J. (1996). Markov Chain Monte Carlo in Practice. NewYork: Chapman & Hall. Jaynes E.T. (1995). Probability theory  The logic of science. Cambridge University Press (in press). Kass, R.E., & Raftery, A.E. (1995). Bayes factor, Journal of the American Statistical Association 90, MacKay, D.J.C. (1992). Bayesian interpolation, Neural Computation 4, Massaro, D.W. (1987). Speech perception by ear and eye: a paradigm for psychological inquiry. London: Laurence Erlbaum Associates. Massaro, D.W. (1998). Perceiving Talking Faces. Cambridge: MIT Press. Massaro, D.W., Cohen, M. M., Campbell, C.S., & Rodriguez, T. (2001). Bayes factor of model selection validates FLMP, Psychonomic Bulletin & Review 8, Myung, I. J., & Pitt, M. A. (1997). Applying Occam's razor in modeling cognition: A Bayesian approach, Psychonomic Bulletin & Review 4, Pitt, M.A., & Myung, I.J. (2002). When a good fit can be bad., Trends in Cognitive Science 6, Pitt, M.A., Kim, W., & Myung, I.J. (2003). Flexibility versus generalizablity in model selection., Psychonomic Bulletin & Review 10, Slate, E.H. (1999). Assessing multivariate nonnormality using univariate distributions, Biometrika 86, Walker, A.M. (1967). On the asymptotic behaviour of posterior distributions, J. R. Stat. Soc. B. 31,
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationBayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
More informationA Bayesian Antidote Against Strategy Sprawl
A Bayesian Antidote Against Strategy Sprawl Benjamin Scheibehenne (benjamin.scheibehenne@unibas.ch) University of Basel, Missionsstrasse 62a 4055 Basel, Switzerland & Jörg Rieskamp (joerg.rieskamp@unibas.ch)
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002Topics in StatisticsBiological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationL10: Probability, statistics, and estimation theory
L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian
More informationTutorial on Markov Chain Monte Carlo
Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,
More informationFace Recognition using Principle Component Analysis
Face Recognition using Principle Component Analysis Kyungnam Kim Department of Computer Science University of Maryland, College Park MD 20742, USA Summary This is the summary of the basic idea about PCA
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationWhy the Normal Distribution?
Why the Normal Distribution? Raul Rojas Freie Universität Berlin Februar 2010 Abstract This short note explains in simple terms why the normal distribution is so ubiquitous in pattern recognition applications.
More informationIntroduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models  part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK2800 Kgs. Lyngby
More informationThe Conquest of U.S. Inflation: Learning and Robustness to Model Uncertainty
The Conquest of U.S. Inflation: Learning and Robustness to Model Uncertainty Timothy Cogley and Thomas Sargent University of California, Davis and New York University and Hoover Institution Conquest of
More informationIntroduction to MonteCarlo Methods
Introduction to MonteCarlo Methods Bernard Lapeyre Halmstad January 2007 MonteCarlo methods are extensively used in financial institutions to compute European options prices to evaluate sensitivities
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationCombining information from different survey samples  a case study with data collected by world wide web and telephone
Combining information from different survey samples  a case study with data collected by world wide web and telephone Magne Aldrin Norwegian Computing Center P.O. Box 114 Blindern N0314 Oslo Norway Email:
More informationBayesian modeling of inseparable spacetime variation in disease risk
Bayesian modeling of inseparable spacetime variation in disease risk Leonhard KnorrHeld Laina Mercer Department of Statistics UW May 23, 2013 Motivation Area and timespecific disease rates Area and
More informationStatistics in Retail Finance. Chapter 6: Behavioural models
Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics: Behavioural
More information11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression
Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 letex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial
More informationi=1 In practice, the natural logarithm of the likelihood function, called the loglikelihood function and denoted by
Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a pdimensional parameter
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 SigmaRestricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationBayesian probability theory
Bayesian probability theory Bruno A. Olshausen arch 1, 2004 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using probability. The foundations
More informationGaussian Classifiers CS498
Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationAN ACCESSIBLE TREATMENT OF MONTE CARLO METHODS, TECHNIQUES, AND APPLICATIONS IN THE FIELD OF FINANCE AND ECONOMICS
Brochure More information from http://www.researchandmarkets.com/reports/2638617/ Handbook in Monte Carlo Simulation. Applications in Financial Engineering, Risk Management, and Economics. Wiley Handbooks
More informationDon t forget the degrees of freedom: evaluating uncertainty from small numbers of repeated measurements
Don t forget the degrees of freedom: evaluating uncertainty from small numbers of repeated measurements Blair Hall b.hall@irl.cri.nz Talk given via internet to the 35 th ANAMET Meeting, October 20, 2011.
More informationRegression Using Support Vector Machines: Basic Foundations
Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering
More informationStat260: Bayesian Modeling and Inference Lecture Date: February 1, Lecture 3
Stat26: Bayesian Modeling and Inference Lecture Date: February 1, 21 Lecture 3 Lecturer: Michael I. Jordan Scribe: Joshua G. Schraiber 1 Decision theory Recall that decision theory provides a quantification
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationA hidden Markov model for criminal behaviour classification
RSS2004 p.1/19 A hidden Markov model for criminal behaviour classification Francesco Bartolucci, Institute of economic sciences, Urbino University, Italy. Fulvia Pennoni, Department of Statistics, University
More informationSufficient Statistics and Exponential Family. 1 Statistics and Sufficient Statistics. Math 541: Statistical Theory II. Lecturer: Songfeng Zheng
Math 541: Statistical Theory II Lecturer: Songfeng Zheng Sufficient Statistics and Exponential Family 1 Statistics and Sufficient Statistics Suppose we have a random sample X 1,, X n taken from a distribution
More informationComplexity measures of musical rhythms
COMPLEXITY MEASURES OF MUSICAL RHYTHMS 1 Complexity measures of musical rhythms Ilya Shmulevich and DirkJan Povel [Shmulevich, I., Povel, D.J. (2000) Complexity measures of musical rhythms. In P. Desain
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More information4. Introduction to Statistics
Statistics for Engineers 41 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation
More informationHandling attrition and nonresponse in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 6372 Handling attrition and nonresponse in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationAPPLIED MISSING DATA ANALYSIS
APPLIED MISSING DATA ANALYSIS Craig K. Enders Series Editor's Note by Todd D. little THE GUILFORD PRESS New York London Contents 1 An Introduction to Missing Data 1 1.1 Introduction 1 1.2 Chapter Overview
More informationA crash course in probability and Naïve Bayes classification
Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s
More informationModeling Individual Differences in Category Learning
Modeling Individual Differences in Category Learning Michael R. Webb (michael.webb@dsto.defence.gov.au) Command and Control Division, Defence Science and Technology Organisation Edinburgh, South Australia,
More informationDepartment of Economics
Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 14730278 On Testing for Diagonality of Large Dimensional
More informationChristfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
More informationLecture 1: Introduction to Random Walks and Diffusion
Lecture : Introduction to Random Walks and Diffusion Scribe: Chris H. Rycroft (and Martin Z. Bazant) Department of Mathematics, MIT February, 5 History The term random walk was originally proposed by Karl
More informationNotes for STA 437/1005 Methods for Multivariate Data
Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.
More informationProbabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur
Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:
More informationDirichlet forms methods for error calculus and sensitivity analysis
Dirichlet forms methods for error calculus and sensitivity analysis Nicolas BOULEAU, Osaka university, november 2004 These lectures propose tools for studying sensitivity of models to scalar or functional
More informationBayesX  Software for Bayesian Inference in Structured Additive Regression
BayesX  Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, LudwigMaximiliansUniversity Munich
More informationLOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
More informationProbabilistic user behavior models in online stores for recommender systems
Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user
More informationMultivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #47/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
More informationData Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a
More informationA General Framework for Temporal Video Scene Segmentation
A General Framework for Temporal Video Scene Segmentation Yun Zhai School of Computer Science University of Central Florida yzhai@cs.ucf.edu Mubarak Shah School of Computer Science University of Central
More informationParametric Models Part I: Maximum Likelihood and Bayesian Density Estimation
Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015
More informationGaussian Conjugate Prior Cheat Sheet
Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian
More informationMälardalen University
Mälardalen University http://www.mdh.se 1/38 Value at Risk and its estimation Anatoliy A. Malyarenko Department of Mathematics & Physics Mälardalen University SE72 123 Västerås,Sweden email: amo@mdh.se
More informationGaussian Processes in Machine Learning
Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany carl@tuebingen.mpg.de WWW home page: http://www.tuebingen.mpg.de/ carl
More informationL1 vs. L2 Regularization and feature selection.
L1 vs. L2 Regularization and feature selection. Paper by Andrew Ng (2004) Presentation by Afshin Rostami Main Topics Covering Numbers Definition Convergence Bounds L1 regularized logistic regression L1
More informationComputing with Finite and Infinite Networks
Computing with Finite and Infinite Networks Ole Winther Theoretical Physics, Lund University Sölvegatan 14 A, S223 62 Lund, Sweden winther@nimis.thep.lu.se Abstract Using statistical mechanics results,
More informationApplication of discriminant analysis to predict the class of degree for graduating students in a university system
International Journal of Physical Sciences Vol. 4 (), pp. 060, January, 009 Available online at http://www.academicjournals.org/ijps ISSN 99950 009 Academic Journals Full Length Research Paper Application
More informationBy choosing to view this document, you agree to all provisions of the copyright laws protecting it.
This material is posted here with permission of the IEEE Such permission of the IEEE does not in any way imply IEEE endorsement of any of Helsinki University of Technology's products or services Internal
More informationClassification by Pairwise Coupling
Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating
More informationSimplifying Bayesian Inference
Simplifying Bayesian Inference Stefan Krauß, Laura Martignon & Ulrich Hoffrage Max Planck Institute For Human Development Lentzeallee 94, 14195 BerlinDahlem Probability theory can be used to model inference
More informationIncorporating cost in Bayesian Variable Selection, with application to costeffective measurement of quality of health care.
Incorporating cost in Bayesian Variable Selection, with application to costeffective measurement of quality of health care University of Florida 10th Annual Winter Workshop: Bayesian Model Selection and
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationLearning Organizational Principles in Human Environments
Learning Organizational Principles in Human Environments Outline Motivation: Object Allocation Problem Organizational Principles in Kitchen Environments Datasets Learning Organizational Principles Features
More informationStatistical Analysis with Missing Data
Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES
More informationMultivariate Analysis of Variance (MANOVA): I. Theory
Gregory Carey, 1998 MANOVA: I  1 Multivariate Analysis of Variance (MANOVA): I. Theory Introduction The purpose of a t test is to assess the likelihood that the means for two groups are sampled from the
More information11. Time series and dynamic linear models
11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd
More informationIntroduction to Detection Theory
Introduction to Detection Theory Reading: Ch. 3 in KayII. Notes by Prof. Don Johnson on detection theory, see http://www.ece.rice.edu/~dhj/courses/elec531/notes5.pdf. Ch. 10 in Wasserman. EE 527, Detection
More informationELECE8104 Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems
Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems Minimum Mean Square Error (MMSE) MMSE estimation of Gaussian random vectors Linear MMSE estimator for arbitrarily distributed
More informationItem selection by latent classbased methods: an application to nursing homes evaluation
Item selection by latent classbased methods: an application to nursing homes evaluation Francesco Bartolucci, Giorgio E. Montanari, Silvia Pandolfi 1 Department of Economics, Finance and Statistics University
More informationLab M1: The Simple Pendulum
Lab M1: The Simple Pendulum Introduction. The simple pendulum is a favorite introductory exercise because Galileo's experiments on pendulums in the early 1600s are usually regarded as the beginning of
More informationA Game Theoretical Framework for Adversarial Learning
A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,
More informationAn introduction to ValueatRisk Learning Curve September 2003
An introduction to ValueatRisk Learning Curve September 2003 ValueatRisk The introduction of ValueatRisk (VaR) as an accepted methodology for quantifying market risk is part of the evolution of risk
More informationMonotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. AbuMostafa EE and CS Deptartments California Institute of Technology
More informationMachine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand
More informationLogistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
More informationMultilevel Modeling of Complex Survey Data
Multilevel Modeling of Complex Survey Data Tihomir Asparouhov 1, Bengt Muthen 2 Muthen & Muthen 1 University of California, Los Angeles 2 Abstract We describe a multivariate, multilevel, pseudo maximum
More informationThe Delta Method and Applications
Chapter 5 The Delta Method and Applications 5.1 Linear approximations of functions In the simplest form of the central limit theorem, Theorem 4.18, we consider a sequence X 1, X,... of independent and
More informationProbabilistic Methods for TimeSeries Analysis
Probabilistic Methods for TimeSeries Analysis 2 Contents 1 Analysis of Changepoint Models 1 1.1 Introduction................................ 1 1.1.1 Model and Notation....................... 2 1.1.2 Example:
More informationBayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
More informationTheta Functions. Lukas Lewark. Seminar on Modular Forms, 31. Januar 2007
Theta Functions Lukas Lewark Seminar on Modular Forms, 31. Januar 007 Abstract Theta functions are introduced, associated to lattices or quadratic forms. Their transformation property is proven and the
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationMINITAB ASSISTANT WHITE PAPER
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. OneWay
More informationR f. V i. ET 438a Automatic Control Systems Technology Laboratory 4 Practical Differentiator Response
ET 438a Automatic Control Systems Technology Laboratory 4 Practical Differentiator Response Objective: Design a practical differentiator circuit using common OP AMP circuits. Test the frequency response
More informationExamination 110 Probability and Statistics Examination
Examination 0 Probability and Statistics Examination Sample Examination Questions The Probability and Statistics Examination consists of 5 multiplechoice test questions. The test is a threehour examination
More informationThe equivalence of logistic regression and maximum entropy models
The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.winvector.com/blog/20/09/thesimplerderivationoflogisticregression/
More informationUsing SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models
Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models Clement A Stone Abstract Interest in estimating item response theory (IRT) models using Bayesian methods has grown tremendously
More informationGeneralized BIC for Singular Models Factoring through Regular Models
Generalized BIC for Singular Models Factoring through Regular Models Shaowei Lin http://math.berkeley.edu/ shaowei/ Department of Mathematics, University of California, Berkeley PhD student (Advisor: Bernd
More informationSParameters and Related Quantities Sam Wetterlin 10/20/09
SParameters and Related Quantities Sam Wetterlin 10/20/09 Basic Concept of SParameters SParameters are a type of network parameter, based on the concept of scattering. The more familiar network parameters
More informationMarkov Chain Monte Carlo Simulation Made Simple
Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical
More informationBayes and Naïve Bayes. cs534machine Learning
Bayes and aïve Bayes cs534machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule
More informationMeasuring the tracking error of exchange traded funds: an unobserved components approach
Measuring the tracking error of exchange traded funds: an unobserved components approach Giuliano De Rossi Quantitative analyst +44 20 7568 3072 UBS Investment Research June 2012 Analyst Certification
More informationLecture 11: Graphical Models for Inference
Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference  the Bayesian network and the Join tree. These two both represent the same joint probability
More informationA BAYESIAN MODEL COMMITTEE APPROACH TO FORECASTING GLOBAL SOLAR RADIATION
A BAYESIAN MODEL COMMITTEE APPROACH TO FORECASTING GLOBAL SOLAR RADIATION Philippe Lauret Hadja Maïmouna Diagne Mathieu David PIMENT University of La Reunion 97715 Saint Denis Cedex 9 hadja.diagne@univreunion.fr
More informationP (A) = lim P (A) = N(A)/N,
1.1 Probability, Relative Frequency and Classical Definition. Probability is the study of random or nondeterministic experiments. Suppose an experiment can be repeated any number of times, so that we
More informationLanguage Modeling. Chapter 1. 1.1 Introduction
Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set
More informationCS229 Lecture notes. Andrew Ng
CS229 Lecture notes Andrew Ng Part X Factor analysis Whenwehavedatax (i) R n thatcomesfromamixtureofseveral Gaussians, the EM algorithm can be applied to fit a mixture model. In this setting, we usually
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationPackage fastghquad. R topics documented: February 19, 2015
Package fastghquad February 19, 2015 Type Package Title Fast Rcpp implementation of GaussHermite quadrature Version 0.2 Date 20140813 Author Alexander W Blocker Maintainer Fast, numericallystable GaussHermite
More informationMultivariable Calculus and Optimization
Multivariable Calculus and Optimization Dudley Cooke Trinity College Dublin Dudley Cooke (Trinity College Dublin) Multivariable Calculus and Optimization 1 / 51 EC2040 Topic 3  Multivariable Calculus
More information