# The Exponential Family

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural parameter t.x/ are sufficient statistics h.x/ is the underlying measure, ensures x is in the right space a./ is the log normalizer Examples of exponential family distributions include Gaussian, gamma, Poisson, Bernoulli, multinomial, Markov models. Examples of distributions that are not in this family include student-t, mixtures, and hidden Markov models. (We are considering these families as distributions of data. The latent variables are implicitly marginalized out.) The statistic t.x/ is called sufficient because the probability of x under only depends on x through t.x/. The exponential family has fundamental connections to the world of graphical models. For our purposes, we ll use exponential families as components in directed graphical models, e.g., in the mixtures of Gaussians. The log normalizer ensures that the density integrates to 1, Z a./ D log h.x/ expf > t.x/gdx (2) This is the negative logarithm of the normalizing constant. Example: Bernoulli. As an example, let s put the Bernoulli (in its usual form) into its exponential family form. The Bernoulli you are used to seeing is p.x j / D x.1 / 1 x x 2 f0; 1g (3) 1

2 In exponential family form p.x j / D exp log x.1 / 1 x (4) D expfx log C.1 x/ log.1 /g (5) D expfx log x log.1 / C log.1 /g (6) D expfx log.=.1 // C log.1 /g (7) This reveals the exponential family where D log.=.1 // (8) t.x/ D x (9) a./ D log.1 / D log.1 C e / (10) h.x/ D 1 (11) Note the relationship between and is invertible This is the logistic function. D 1=.1 C e / (12) Example: Gaussian. The familiar form of the univariate Gaussian is p.x j ; 2 1.x / 2 / D p exp (13) We put it in exponential family form by expanding the square p.x j ; 2 / D p 1 exp 2 x x log (14) We see that D Œ= 2 ; 1=2 2 (15) t.x/ D Œx; x 2 (16) a./ D 2 =2 2 C log (17) D 2 1 =4 2.1=2/ log. 2 2 / (18) h.x/ D 1= p 2 (19) 2

3 Moments of an exponential family Let s go back to the general family. We are going to take derivatives of the log normalizer. This gives us moments of the sufficient statistics, r a./ D r flog R expf > t.x/gh.x/dxg (20) D r R expf > t.x/gh.x/dx R (21) expf> t.x/gh.x/dx Z expf > t.x/gh.x/ D t.x/ R dx (22) expf> t.x/gh.x/dx D E Œt.X/ (23) Check: Higher order derivatives give higher order moments. For example, Cov.t i.x/; t j.x// j (24) There is a 1-1 relationship between the mean of the sufficient statistics E Œt.X/ and natural parameter. In other words, the mean is an alternative parameterization of the distribution. (Many of the forms of exponential families that you know are the mean parameterization.) Consider again the Bernoulli. We saw this with the logistic function, where note that D E ŒX (because X is an indicator). Now consider the Gaussian. The derivative with respect to 1 is da./ d 1 D (25) D (26) D E ŒX (27) The derivative with respect to 2 is da./ d 2 D (28) D 2 C 2 (29) D E X 2 (30) 3

4 This means that the variance is Var.X/ D E X 2 E ŒX 2 (31) D (32) D 2 (33) To see the 1 1 relationship between E Œt.X/ and, note that Var.t.X// D r 2 a is positive.! a./ is convex.! 1-1 relationship between its argument and first derivative Here is some notation for later (when we discuss generalized linear models). Denote the mean parameter as D E Œt.X/ ; denote the inverse map as./, which gives the such that E Œt.X/ D. Maximum likelihood estimation of an exponential family. The data are x 1Wn. We seek the value of that maximizes the likelihood. The log likelihood is L D D NX log p.x n j / (34) nd1 NX.log h.x n / C > t.x n / a.// (35) nd1 D P N nd1 log h.x n/ C > P N nd1 t.x n/ N a./ (36) As a function of, the log likelihood only depends on P N nd1 t.x n/. It has fixed dimension; there is no need to store the data. (And note that it is sufficient for estimating.) Take the gradient of the likelihood and set it to zero, r L D NX t.x n / N r a./: (37) nd1 It s now easy to solve for the mean parameter: ML D P N nd1 t.x n/ : (38) N 4

5 This is, as you might guess, the empirical mean of the sufficient statistics. The inverse map gives us the natural parameter, ML D. ML /: (39) Consider the Bernoulli. The MLE of the mean parameter ML is the sample mean; the MLE of the natural parameter is the corresponding log odds. Consider the Gaussian. The MLE of the mean parameters are the sample mean and the sample variance. Conjugacy Consider the following set up: F. j / (40) x i G. j / for i 2 f1; : : : ; ng: (41) This is a classical Bayesian data analysis setting. And, this is used as a component in more complicated models, e.g., in hierarchical models. The posterior distribution of given the data x 1Wn is p. j x 1Wn ; / / F. j / ny G.x i j /: (42) Suppose this distribution is in the same family as F, i.e., its parameters are in the space indexed by. Then F and G are a conjugate pair. For example, Gaussian likelihood with fixed variance; Gaussian prior on the mean Multinomial likelihood; Dirichlet prior on the probabilities Bernoulli likelihood; beta prior on the bias Poisson likelihood; gamma prior on the rate In all these settings, the conditional distribution of the parameter given the data is in the same family as the prior. Suppose the data come from an exponential family. Every exponential family has a conjugate prior, p.x i j / D h`.x/ expf > t.x i / a`./g (43) p. j / D h c./ expf > 1 C 2. a`.// a c./g: (44) 5

6 The natural parameter D Œ 1 ; 2 has dimension dim./ C 1. The sufficient statistics are Œ; a./. The other terms h c./ and a c./ depend on the form of the exponential family. For example, when are multinomial parameters then the other terms help define a Dirichlet. Let s compute the posterior in the general case, p. j x 1Wn ; / / p. j / ny p.x i j / (45) D h./ expf > 1 C 2. a.// a c./g (46) Q n h.x i/ expf P > n t.x i/ na`./g / h./ expf. 1 C P n t.x i// > C. 2 C n/. a.//g: (47) This is the same exponential family as the prior, with parameters O 1 D 1 C nx t.x i / (48) O 2 D 2 C n: (49) Posterior predictive distribution Let s compute the posterior predictive distribution. This comes up frequently in applications of models (i.e., when doing prediction) and in inference (i.e., when doing collapsed Gibbs sampling). Consider a new data point x new. Its posterior predictive is Z p.x new j x 1Wn / D p.x new j /p. j x 1Wn /d (50) Z / expf > x new a./g expf O > 1 C O 2. a.// a. /gd O (51) R expf. 1 O C x new / > C. D O 2 C 1/. a.//d expfa. O 1 ; O (52) 2 /g D expfa. O 1 C x new ; O 2 C 1/ a. O 1 ; O 2 /g (53) In other words, the posterior predictive density is a ratio of normalizing constants. In the numerator is the posterior with the new data point added; in the denominator is the posterior without the new data point. This is how we can derive collapsed Gibbs samplers. 6

7 Canonical prior There is a variation on the form of the simple conjugate prior that is useful for understanding its properties. We set 1 D x 0 n 0 and 2 D n 0. (And, for convenience, we drop the sufficient statistic so that t.x/ D x.) In this form the conjugate prior is p. j x 0 ; n 0 / / expfn 0 x > 0 n 0a./g: (54) Here we can interpret x 0 as our prior idea of the expectation of x, and n 0 as the number of prior data points. Consider the predictive expectation of X, E ŒE ŒX j. The first expectation is with respect to the prior; the second is respect to the data distribution. As an exercise, show that E ŒE ŒX j D x 0 : (55) Assume data x 1Wn. The mean is Nx D.1=n/ nx x i (56) Given the data and a prior, the posterior parameters are Ox 0 D n 0x 0 C n Nx n C n 0 (57) On D n 0 C n (58) This is easy to see from our previous derivation of O 1 ; O 2. Further, we see that the posterior expectation of X is a convex combination of x 0 (our prior idea) and Nx (the data mean). Example: Data from a unit variance Gaussian Suppose the data x i come from a unit variance Gaussian p.x j / D 1 p 2 expf.x / 2 =2g: (59) This is a simpler exponential family than the previous Gaussian p.x j / D expf x2 =2g p 2 expfx 2 =2g: (60) 7

8 In this case D (61) t.x/ D x (62) h.x/ D expf x2 =2g p 2 (63) a./ D 2 =2 D 2 =2: (64) What is the conjugate prior? It is p. j / D h./ expf 1 C 2. 2 =2/ a c./g (65) This has sufficient statistics Œ; 2 =2, which means it s a Gaussian distribution. Put it in canonical form, p. j n 0 ; x 0 / D h./ expfn 0 x 0 n 0. 2 =2/ a c.n 0 ; x 0 /g: (66) The posterior parameters are This implies that the posterior mean and variance are These are results that we asserted earlier. Ox 0 D n 0x 0 C n Nx n 0 C n (67) On 0 D n C n 0 : (68) O D Ox 0 (69) O 2 D 1=.n 0 C n/: (70) Intuitively, when we haven t seen any data then our estimate of the mean is the prior mean. As we see more data, our estimate of the mean moves towards the sample mean. Before seeing data, our confidence about the estimate is the prior variance. As we see more data, the confidence decreases. Example: Data from a Bernoulli The Bernoulli likelihood is p.x j / D x.1 / 1 x : (71) 8

9 Above we have seen its form as a minimal exponential family. Let s rewrite that, but keeping the mean parameter in the picture, p.x j / D expfx log.=.1 // log.1 /g (72) The canonical conjugate prior therefore looks like this, p. j x 0 ; n 0 / D expfn 0 x 0 log.=.1 // C n 0 log.1 / a.n 0 ; x 0 /g (73) This simplifies to p. j x 0 ; n 0 / D expfn 0 x 0 log./ C n 0.1 x 0 / log.1 // a.n 0 ; x 0 /g (74) Putting this in non-exponential family form, p. j x 0 ; n 0 / / n 0x 0.1 / n 0.1 x 0 / (75) which nearly looks like the familiar Beta distribution. To get the beta, we set n 0 x 0 C 1 and ˇ n 0.1 x 0 / C 1. Bringing in the resulting normalizer, we have p. j ; ˇ/ D. C ˇ/. /.ˇ/ 1.1 /ˇ 1 : (76) The posterior distribution follows the usual recipe, from which we can derive that the corresponding updates are Ǫ D C Oˇ D ˇ C nx x i (77) nx.1 x i / (78) To see this, update Ox 0 and On 0 as above and compute Ǫ and Oˇ from the definitions. The big picture Exponential families and conjugate priors can be used in many graphical models. Gibbs sampling is straightforward when each complete conditional involves a conjugate prior and likelihood pair. For example, we can now define an exponential family mixture model. The mixture components are drawn from a conjugate prior; the data are drawn from the corresponding exponential family. 9

10 Imagine a generic model p.x 1Wn ; z 1Wn j /. Suppose each complete conditional is in the exponential family, p.z i j z i ; x; / D h.z i / expf i.z i ; x/ > z i a. i.z i;x //g: (79) Notice that the natural parameter is a function of the variables we condition on. This is called a conditionally conjugate model. Provided we can compute the natural parameter, Gibbs sampling is immediate. Many models from the machine learning research literature are conditionally conjugate. Examples include HMMs, mixture models, hierarchical linear regression, probit regression, factorial models, and others. Consider the HMM, for example, but with its parameters unknown, [ graphical model ] Notice that this is like a mixture model, but where the mixture components are embedded in a Markov chain. Each row of the transition matrix is a point on the.k 1/-simplex; each set of emission probabilities is a point on the.v 1/-simplex. We can place Dirichlet priors on these components. (We will talk about the Dirichlet later. It is a multivariate generalization of the beta distribution, and is the conjugate prior to the categorical/multinomial distribution.) We can easily obtain the complete conditional of z i. The complete conditionals of the transitions & emissions follow from our discussion of conjugacy. Collapsed Gibbs sampling is possible, but is quite expensive. One needs to run forwardbackward at each sampling of a z i. 10

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Gaussian Conjugate Prior Cheat Sheet

Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Dirichlet Processes A gentle tutorial

Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.

### Lecture 6: The Bayesian Approach

Lecture 6: The Bayesian Approach What Did We Do Up to Now? We are given a model Log-linear model, Markov network, Bayesian network, etc. This model induces a distribution P(X) Learning: estimate a set

### Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### Logistic Regression (1/24/13)

STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

### Chapter 8. The exponential family: Basics. 8.1 The exponential family

Chapter 8 The exponential family: Basics In this chapter we extend the scope of our modeling toolbox to accommodate a variety of additional data types, including counts, time intervals and rates. We introduce

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)}

C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)} 1. EES 800: Econometrics I Simple linear regression and correlation analysis. Specification and estimation of a regression model. Interpretation of regression

### Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

### Master s Theory Exam Spring 2006

Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

### Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

### P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )

Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

### EC 6310: Advanced Econometric Theory

EC 6310: Advanced Econometric Theory July 2008 Slides for Lecture on Bayesian Computation in the Nonlinear Regression Model Gary Koop, University of Strathclyde 1 Summary Readings: Chapter 5 of textbook.

### Lecture 10: Sequential Data Models

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 10: Sequential Data Models 1 Example: sequential data Until now, considered data to be i.i.d. Turn attention to sequential data Time-series: stock

### Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### 11. Time series and dynamic linear models

11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd

### Sufficient Statistics and Exponential Family. 1 Statistics and Sufficient Statistics. Math 541: Statistical Theory II. Lecturer: Songfeng Zheng

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Sufficient Statistics and Exponential Family 1 Statistics and Sufficient Statistics Suppose we have a random sample X 1,, X n taken from a distribution

### The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference Stephen Tu tu.stephenl@gmail.com 1 Introduction This document collects in one place various results for both the Dirichlet-multinomial

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

### Statistical Analysis with Missing Data

Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES

### Principle of Data Reduction

Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

### L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Models for Count Data With Overdispersion

Models for Count Data With Overdispersion Germán Rodríguez November 6, 2013 Abstract This addendum to the WWS 509 notes covers extra-poisson variation and the negative binomial model, with brief appearances

### Part 2: One-parameter models

Part 2: One-parameter models Bernoilli/binomial models Return to iid Y 1,...,Y n Bin(1, θ). The sampling model/likelihood is p(y 1,...,y n θ) =θ P y i (1 θ) n P y i When combined with a prior p(θ), Bayes

### Markov random fields and Gibbs measures

Chapter Markov random fields and Gibbs measures 1. Conditional independence Suppose X i is a random element of (X i, B i ), for i = 1, 2, 3, with all X i defined on the same probability space (.F, P).

### PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

### Probability and Statistics

CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

### CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091) Magnus Wiktorsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February

### Introduction to latent variable models

Introduction to latent variable models lecture 1 Francesco Bartolucci Department of Economics, Finance and Statistics University of Perugia, IT bart@stat.unipg.it Outline [2/24] Latent variables and their

### CS229 Lecture notes. Andrew Ng

CS229 Lecture notes Andrew Ng Part X Factor analysis Whenwehavedatax (i) R n thatcomesfromamixtureofseveral Gaussians, the EM algorithm can be applied to fit a mixture model. In this setting, we usually

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### Notes on the Negative Binomial Distribution

Notes on the Negative Binomial Distribution John D. Cook October 28, 2009 Abstract These notes give several properties of the negative binomial distribution. 1. Parameterizations 2. The connection between

### Centre for Central Banking Studies

Centre for Central Banking Studies Technical Handbook No. 4 Applied Bayesian econometrics for central bankers Andrew Blake and Haroon Mumtaz CCBS Technical Handbook No. 4 Applied Bayesian econometrics

### 1 Prior Probability and Posterior Probability

Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Language Modeling. Chapter 1. 1.1 Introduction

Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

### On Gauss s characterization of the normal distribution

Bernoulli 13(1), 2007, 169 174 DOI: 10.3150/07-BEJ5166 On Gauss s characterization of the normal distribution ADELCHI AZZALINI 1 and MARC G. GENTON 2 1 Department of Statistical Sciences, University of

### 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

### Interpretation of Somers D under four simple models

Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms

### CS540 Machine learning Lecture 14 Mixtures, EM, Non-parametric models

CS540 Machine learning Lecture 14 Mixtures, EM, Non-parametric models Outline Mixture models EM for mixture models K means clustering Conditional mixtures Kernel density estimation Kernel regression GMM

### Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

### Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany carl@tuebingen.mpg.de WWW home page: http://www.tuebingen.mpg.de/ carl

### 1 Another method of estimation: least squares

1 Another method of estimation: least squares erm: -estim.tex, Dec8, 009: 6 p.m. (draft - typos/writos likely exist) Corrections, comments, suggestions welcome. 1.1 Least squares in general Assume Y i

### MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...

MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 2009-2016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

### 1 Teaching notes on GMM 1.

Bent E. Sørensen January 23, 2007 1 Teaching notes on GMM 1. Generalized Method of Moment (GMM) estimation is one of two developments in econometrics in the 80ies that revolutionized empirical work in

### Clustering - example. Given some data x i X Find a partitioning of the data into k disjunctive clusters Example: k-means clustering

Clustering - example Graph Mining and Graph Kernels Given some data x i X Find a partitioning of the data into k disjunctive clusters Example: k-means clustering x!!!!8!!! 8 x 1 1 Clustering - example

### Bayesian probability theory

Bayesian probability theory Bruno A. Olshausen arch 1, 2004 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using probability. The foundations

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### The Chinese Restaurant Process

COS 597C: Bayesian nonparametrics Lecturer: David Blei Lecture # 1 Scribes: Peter Frazier, Indraneel Mukherjee September 21, 2007 In this first lecture, we begin by introducing the Chinese Restaurant Process.

### Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

### Natural Conjugate Gradient in Variational Inference

Natural Conjugate Gradient in Variational Inference Antti Honkela, Matti Tornio, Tapani Raiko, and Juha Karhunen Adaptive Informatics Research Centre, Helsinki University of Technology P.O. Box 5400, FI-02015

### Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Errata for ASM Exam C/4 Study Manual (Sixteenth Edition) Sorted by Page 1 Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page Practice exam 1:9, 1:22, 1:29, 9:5, and 10:8

### Probability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0

Probability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0 This primer provides an overview of basic concepts and definitions in probability and statistics. We shall

### Chapter 4 Expected Values

Chapter 4 Expected Values 4. The Expected Value of a Random Variables Definition. Let X be a random variable having a pdf f(x). Also, suppose the the following conditions are satisfied: x f(x) converges

### Credit Risk Models: An Overview

Credit Risk Models: An Overview Paul Embrechts, Rüdiger Frey, Alexander McNeil ETH Zürich c 2003 (Embrechts, Frey, McNeil) A. Multivariate Models for Portfolio Credit Risk 1. Modelling Dependent Defaults:

### A Introduction to Matrix Algebra and Principal Components Analysis

A Introduction to Matrix Algebra and Principal Components Analysis Multivariate Methods in Education ERSH 8350 Lecture #2 August 24, 2011 ERSH 8350: Lecture 2 Today s Class An introduction to matrix algebra

### Sections 2.11 and 5.8

Sections 211 and 58 Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I 1/25 Gesell data Let X be the age in in months a child speaks his/her first word and

### Lab 8: Introduction to WinBUGS

40.656 Lab 8 008 Lab 8: Introduction to WinBUGS Goals:. Introduce the concepts of Bayesian data analysis.. Learn the basic syntax of WinBUGS. 3. Learn the basics of using WinBUGS in a simple example. Next

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### MULTIVARIATE PROBABILITY DISTRIBUTIONS

MULTIVARIATE PROBABILITY DISTRIBUTIONS. PRELIMINARIES.. Example. Consider an experiment that consists of tossing a die and a coin at the same time. We can consider a number of random variables defined

### Advanced Spatial Statistics Fall 2012 NCSU. Fuentes Lecture notes

Advanced Spatial Statistics Fall 2012 NCSU Fuentes Lecture notes Areal unit data 2 Areal Modelling Areal unit data Key Issues Is there spatial pattern? Spatial pattern implies that observations from units

### 2.3. Finding polynomial functions. An Introduction:

2.3. Finding polynomial functions. An Introduction: As is usually the case when learning a new concept in mathematics, the new concept is the reverse of the previous one. Remember how you first learned

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Parametric Models. dh(t) dt > 0 (1)

Parametric Models: The Intuition Parametric Models As we saw early, a central component of duration analysis is the hazard rate. The hazard rate is the probability of experiencing an event at time t i

### Poisson Models for Count Data

Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

### 1. The maximum likelihood principle 2. Properties of maximum-likelihood estimates

The maximum-likelihood method Volker Blobel University of Hamburg March 2005 1. The maximum likelihood principle 2. Properties of maximum-likelihood estimates Keys during display: enter = next page; =

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. HMM tutorial 4. by Dr Philip Jackson

Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. HMM tutorial 4 by Dr Philip Jackson Discrete & continuous HMMs - Gaussian & other distributions Gaussian mixtures -

### i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter

### Goodness of fit assessment of item response theory models

Goodness of fit assessment of item response theory models Alberto Maydeu Olivares University of Barcelona Madrid November 1, 014 Outline Introduction Overall goodness of fit testing Two examples Assessing

### THE MULTINOMIAL DISTRIBUTION. Throwing Dice and the Multinomial Distribution

THE MULTINOMIAL DISTRIBUTION Discrete distribution -- The Outcomes Are Discrete. A generalization of the binomial distribution from only 2 outcomes to k outcomes. Typical Multinomial Outcomes: red A area1

### Bayesian Methods. 1 The Joint Posterior Distribution

Bayesian Methods Every variable in a linear model is a random variable derived from a distribution function. A fixed factor becomes a random variable with possibly a uniform distribution going from a lower

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Stat260: Bayesian Modeling and Inference Lecture Date: February 1, Lecture 3

Stat26: Bayesian Modeling and Inference Lecture Date: February 1, 21 Lecture 3 Lecturer: Michael I. Jordan Scribe: Joshua G. Schraiber 1 Decision theory Recall that decision theory provides a quantification