# Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University CS 551, Fall 2015 CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 1 / 33

2 Introduction Bayesian Decision Theory shows us how to design an optimal classifier if we know the prior probabilities P (w i ) and the class-conditional densities p(x w i ). Unfortunately, we rarely have complete knowledge of the probabilistic structure. However, we can often find design samples or training data that include particular representatives of the patterns we want to classify. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 2 / 33

3 Introduction To simplify the problem, we can assume some parametric form for the conditional densities and estimate these parameters using training data. Then, we can use the resulting estimates as if they were the true values and perform classification using the Bayesian decision rule. We will consider only the supervised learning case where the true class label for each sample is known. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 3 / 33

4 Introduction We will study two estimation procedures: Maximum likelihood estimation Views the parameters as quantities whose values are fixed but unknown. Estimates these values by maximizing the probability of obtaining the samples observed. Bayesian estimation Views the parameters as random variables having some known prior distribution. Observing new samples converts the prior to a posterior density. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 4 / 33

5 Maximum Likelihood Estimation Suppose we have a set D = {x 1,..., x n } of independent and identically distributed (i.i.d.) samples drawn from the density p(x θ). We would like to use training samples in D to estimate the unknown parameter vector θ. Define L(θ D) as the likelihood function of θ with respect to D as L(θ D) = p(d θ) = p(x 1,..., x n θ) = n p(x i θ). i=1 CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 5 / 33

6 Maximum Likelihood Estimation The maximum likelihood estimate (MLE) of θ is, by definition, the value ˆθ that maximizes L(θ D) and can be computed as ˆθ = arg max L(θ D). θ It is often easier to work with the logarithm of the likelihood function (log-likelihood function) that gives ˆθ = arg max θ log L(θ D) = arg max θ n log p(x i θ). i=1 CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 6 / 33

7 Maximum Likelihood Estimation If the number of parameters is p, i.e., θ = (θ 1,..., θ p ) T, define the gradient operator θ θ 1. θ p. Then, the MLE of θ should satisfy the necessary conditions θ log L(θ D) = n θ log p(x i θ) = 0. i=1 CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 7 / 33

8 Maximum Likelihood Estimation Properties of MLEs: The MLE is the parameter point for which the observed sample is the most likely. The procedure with partial derivatives may result in several local extrema. We should check each solution individually to identify the global optimum. Boundary conditions must also be checked separately for extrema. Invariance property: if ˆθ is the MLE of θ, then for any function f(θ), the MLE of f(θ) is f(ˆθ). CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 8 / 33

9 The Gaussian Case Suppose that p(x θ) = N(µ, Σ). When Σ is known but µ is unknown: ˆµ = 1 n n i=1 x i When both µ and Σ are unknown: ˆµ = 1 n n x i and ˆΣ = 1 n i=1 n (x i ˆµ)(x i ˆµ) T i=1 CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 9 / 33

10 The Bernoulli Case Suppose that P (x θ) = Bernoulli(θ) = θ x (1 θ) 1 x where x = 0, 1 and 0 θ 1. The MLE of θ can be computed as ˆθ = 1 n n x i. i=1 CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 10 / 33

11 Bias of Estimators Bias of an estimator ˆθ is the difference between the expected value of ˆθ and θ. The MLE of µ is an unbiased estimator for µ because E[ˆµ] = µ. The MLE of Σ is not an unbiased estimator for Σ because E[ˆΣ] = n 1 n Σ Σ. The sample covariance S 2 = 1 n 1 n (x i ˆµ)(x i ˆµ) T i=1 is an unbiased estimator for Σ. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 11 / 33

12 Goodness-of-fit To measure how well a fitted distribution resembles the sample data (goodness-of-fit), we can use the Kolmogorov-Smirnov test statistic. It is defined as the maximum value of the absolute difference between the cumulative distribution function estimated from the sample and the one calculated from the fitted distribution. After estimating the parameters for different distributions, we can compute the Kolmogorov-Smirnov statistic for each distribution and choose the one with the smallest value as the best fit to our sample. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 12 / 33

13 Maximum Likelihood Estimation Examples Random sample from N(10,2 2 ) Histogram Gaussian fit 0.7 Random sample from 0.5 N(10,0.4 2 ) N(11,0.5 2 ) Histogram Gaussian fit pdf 0.1 pdf x (a) True pdf is N(10, 4). N(9.98, 4.05). Estimated pdf is x (b) True pdf is 0.5N(10, 0.16) + 0.5N(11, 0.25). Estimated pdf is N(10.50, 0.47) Random sample from Gamma(4,4) Histogram Gaussian fit Gamma fit Cumulative distribution functions pdf cdf x (c) True pdf is Gamma(4, 4). Estimated pdfs are N(16.1, 67.4) and Gamma(3.8, 4.2) True cdf Gaussian fit cdf Gamma fit cdf x (d) Cumulative distribution functions for the example in (c). Figure 1: Histograms of samples and estimated densities for different distributions. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 13 / 33

14 Bayesian Estimation Suppose the set D = {x 1,..., x n } contains the samples drawn independently from the density p(x θ) whose form is assumed to be known but θ is not known exactly. Assume that θ is a quantity whose variation can be described by the prior probability distribution p(θ). CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 14 / 33

15 Bayesian Estimation Given D, the prior distribution can be updated to form the posterior distribution using the Bayes rule p(θ D) = p(d θ)p(θ) p(d) where p(d) = p(d θ) p(θ) dθ and p(d θ) = n p(x i θ). i=1 CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 15 / 33

16 Bayesian Estimation The posterior distribution p(θ D) can be used to find estimates for θ (e.g., the expected value of p(θ D) can be used as an estimate for θ). Then, the conditional density p(x D) can be computed as p(x D) = p(x θ) p(θ D) dθ and can be used in the Bayesian classifier. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 16 / 33

17 MLEs vs. Bayes Estimates Maximum likelihood estimation finds an estimate of θ based on the samples in D but a different sample set would give rise to a different estimate. Bayes estimate takes into account the sampling variability. We assume that we do not know the true value of θ, and instead of taking a single estimate, we take a weighted sum of the densities p(x θ) weighted by the distribution p(θ D). CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 17 / 33

18 The Gaussian Case Consider the univariate case p(x µ) = N(µ, σ 2 ) where µ is the only unknown parameter with a prior distribution p(µ) = N(µ 0, σ0) 2 (σ 2, µ 0 and σ0 2 are all known). This corresponds to drawing a value for µ from the population with density p(µ), treating it as the true value in the density p(x µ), and drawing samples for x from this density. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 18 / 33

19 The Gaussian Case Given D = {x 1,..., x n }, we obtain p(µ D) where µ n = σ 2 n = n p(x i µ)p(µ) i=1 exp [ 1 2 = N(µ n, σ 2 n) ( ( n ( ) ( nσ 2 0 nσ0 2 + ˆµ n + σ2 σ2 0 σ2 nσ σ2. σ σ 2 0 σ 2 nσ σ2 ) ( 1 µ 2 2 σ 2 n i=1 ) µ 0 ( ˆµ n = 1 n x i + µ ) )] 0 σ0 2 µ n ) x i i=1 CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 19 / 33

20 The Gaussian Case µ 0 is our best prior guess and σ0 2 is the uncertainty about this guess. µ n is our best guess after observing D and σn 2 is the uncertainty about this guess. µ n always lies between ˆµ n and µ 0. If σ0 = 0, then µ n = µ 0 (no observation can change our prior opinion). If σ0 σ, then µ n = ˆµ n (we are very uncertain about our prior guess). Otherwise, µn approaches ˆµ n as n approaches infinity. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 20 / 33

21 The Gaussian Case Given the posterior density p(µ D), the conditional density p(x D) can be computed as p(x D) = N(µ n, σ 2 + σn) 2 where the conditional mean µ n is treated as if it were the true mean, and the known variance is increased to account for our lack of exact knowledge of the mean µ. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 21 / 33

22 The Gaussian Case Consider the multivariate case p(x µ) = N(µ, Σ) where µ is the only unknown parameter with a prior distribution p(µ) = N(µ 0, Σ 0 ) Given D = {x 1,..., x n }, we obtain p(µ D) exp [ 1 2 (Σ, µ 0 and Σ 0 are all known). ( ) (µ T nσ 1 + Σ 1 0 µ 2µ T ( Σ 1 n i=1 x i + Σ 1 0 µ 0 ) )]. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 22 / 33

23 The Gaussian Case It follows that p(µ D) = N(µ n, Σ n ) where µ n = Σ 0 ( Σ n Σ ) 1 ˆµ n + 1 n Σ (Σ n Σ ) 1 µ 0, Σ n = 1 n Σ 0 ( Σ n Σ ) 1 Σ. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 23 / 33

24 The Gaussian Case Given the posterior density p(µ D), the conditional density p(x D) can be computed as p(x D) = N(µ n, Σ + Σ n ) which can be viewed as the sum of a random vector µ with p(µ D) = N(µ n, Σ n ) and an independent random vector y with p(y) = N(0, Σ). CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 24 / 33

25 The Bernoulli Case Consider P (x θ) = Bernoulli(θ) where θ is the unknown parameter with a prior distribution p(θ) = Beta(α, β) (α and β are both known). Given D = {x 1,..., x n }, we obtain ( ) n n p(θ D) = Beta α + x i, β + n x i. i=1 i=1 CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 25 / 33

26 The Bernoulli Case The Bayes estimate of θ can be computed as the expected value of p(θ D), i.e., ˆθ = α + n i=1 x i α + β + n ( = n α + β + n ) 1 n n x i + i=1 ( α + β ) α α + β + n α + β. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 26 / 33

27 Conjugate Priors A conjugate prior is one which, when multiplied with the probability of the observation, gives a posterior probability having the same functional form as the prior. This relationship allows the posterior to be used as a prior in further computations. Table 1: Conjugate prior distributions. pdf generating the sample Gaussian Exponential Poisson Binomial Multinomial corresponding conjugate prior Gaussian Gamma Gamma Beta Dirichlet CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 27 / 33

28 Recursive Bayes Learning What about the convergence of p(x D) to p(x)? Given D n = {x 1,..., x n }, for n > 1 p(d n θ) = p(x n θ)p(d n 1 θ) and where p(θ D n ) = p(x n θ) p(θ D n 1 ) p(xn θ) p(θ D n 1 ) dθ p(θ D 0 ) = p(θ) quite useful if the distributions can be represented using only a few parameters (sufficient statistics). CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 28 / 33

29 Recursive Bayes Learning Consider the Bernoulli case P (x θ) = Bernoulli(θ) where p(θ) = Beta(α, β), the Bayes estimate of θ is ˆθ = α α + β. Given the training set D = {x 1,..., x n }, we obtain p(θ D) = Beta(α + m, β + n m) where m = n i=1 x i = #{x i x i = 1, x i D}. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 29 / 33

30 Recursive Bayes Learning The Bayes estimate of θ becomes ˆθ = α + m α + β + n. Then, given a new training set D = {x 1,..., x n }, we obtain p(θ D, D ) = Beta(α + m + m, β + n m + n m ) where m = n i=1 x i = #{x i x i = 1, x i D }. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 30 / 33

31 Recursive Bayes Learning The Bayes estimate of θ becomes ˆθ = α + m + m α + β + n + n. Thus, recursive Bayes learning involves only keeping the counts m (related to sufficient statistics of Beta) and the number of training samples n. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 31 / 33

32 MLEs vs. Bayes Estimates computational complexity Table 2: Comparison of MLEs and Bayes estimates. MLE differential calculus, gradient search Bayes multidimensional integration interpretability point estimate weighted average of models prior information assume the parametric model p(x θ) assume the models p(θ) and p(x θ) but the resulting distribution p(x D) may not have the same form as p(x θ) If there is much data (strongly peaked p(θ D)) and the prior p(θ) is uniform, then the Bayes estimate and MLE are equivalent. CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 32 / 33

33 Classification Error To apply these results to multiple classes, separate the training samples to c subsets D 1,..., D c, with the samples in D i belonging to class w i, and then estimate each density p(x w i, D i ) separately. Different sources of error: Bayes error: due to overlapping class-conditional densities (related to the features used). Model error: due to incorrect model. Estimation error: due to estimation from a finite sample (can be reduced by increasing the amount of training data). CS 551, Fall 2015 c 2015, Selim Aksoy (Bilkent University) 33 / 33

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### 1 Prior Probability and Posterior Probability

Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

### Maximum Likelihood Estimation

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

### Practice problems for Homework 11 - Point Estimation

Practice problems for Homework 11 - Point Estimation 1. (10 marks) Suppose we want to select a random sample of size 5 from the current CS 3341 students. Which of the following strategies is the best:

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### Principle of Data Reduction

Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

### Hypothesis Testing. 1 Introduction. 2 Hypotheses. 2.1 Null and Alternative Hypotheses. 2.2 Simple vs. Composite. 2.3 One-Sided and Two-Sided Tests

Hypothesis Testing 1 Introduction This document is a simple tutorial on hypothesis testing. It presents the basic concepts and definitions as well as some frequently asked questions associated with hypothesis

### 1 Sufficient statistics

1 Sufficient statistics A statistic is a function T = rx 1, X 2,, X n of the random sample X 1, X 2,, X n. Examples are X n = 1 n s 2 = = X i, 1 n 1 the sample mean X i X n 2, the sample variance T 1 =

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### Part 2: One-parameter models

Part 2: One-parameter models Bernoilli/binomial models Return to iid Y 1,...,Y n Bin(1, θ). The sampling model/likelihood is p(y 1,...,y n θ) =θ P y i (1 θ) n P y i When combined with a prior p(θ), Bayes

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Efficiency and the Cramér-Rao Inequality

Chapter Efficiency and the Cramér-Rao Inequality Clearly we would like an unbiased estimator ˆφ (X of φ (θ to produce, in the long run, estimates which are fairly concentrated i.e. have high precision.

### The Exponential Family

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...

MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 2009-2016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Probability Review 15.075 Cynthia Rudin A probability space, defined by Kolmogorov (1903-1987) consists of: A set of outcomes S, e.g., for the roll of a die, S = {1, 2, 3, 4, 5, 6}, 1 1 2 1 6 for the roll

### Lecture 8. Confidence intervals and the central limit theorem

Lecture 8. Confidence intervals and the central limit theorem Mathematical Statistics and Discrete Mathematics November 25th, 2015 1 / 15 Central limit theorem Let X 1, X 2,... X n be a random sample of

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )

Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Chapter 8. The exponential family: Basics. 8.1 The exponential family

Chapter 8 The exponential family: Basics In this chapter we extend the scope of our modeling toolbox to accommodate a variety of additional data types, including counts, time intervals and rates. We introduce

### A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

REVSTAT Statistical Journal Volume 4, Number 2, June 2006, 131 142 A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA Authors: Daiane Aparecida Zuanetti Departamento de Estatística, Universidade Federal de São

### Notes on the Negative Binomial Distribution

Notes on the Negative Binomial Distribution John D. Cook October 28, 2009 Abstract These notes give several properties of the negative binomial distribution. 1. Parameterizations 2. The connection between

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

### Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

### Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Errata for ASM Exam C/4 Study Manual (Sixteenth Edition) Sorted by Page 1 Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page Practice exam 1:9, 1:22, 1:29, 9:5, and 10:8

### Introduction to Hypothesis Testing. Point estimation and confidence intervals are useful statistical inference procedures.

Introduction to Hypothesis Testing Point estimation and confidence intervals are useful statistical inference procedures. Another type of inference is used frequently used concerns tests of hypotheses.

### Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x θ) Likelihood: eskmate

### Summary of Probability

Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible

### The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference Stephen Tu tu.stephenl@gmail.com 1 Introduction This document collects in one place various results for both the Dirichlet-multinomial

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### Constrained Bayes and Empirical Bayes Estimator Applications in Insurance Pricing

Communications for Statistical Applications and Methods 2013, Vol 20, No 4, 321 327 DOI: http://dxdoiorg/105351/csam2013204321 Constrained Bayes and Empirical Bayes Estimator Applications in Insurance

### Chapter 9: Hypothesis Testing Sections

Chapter 9: Hypothesis Testing Sections - we are still here Skip: 9.2 Testing Simple Hypotheses Skip: 9.3 Uniformly Most Powerful Tests Skip: 9.4 Two-Sided Alternatives 9.5 The t Test 9.6 Comparing the

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

### Bayesian Methods. 1 The Joint Posterior Distribution

Bayesian Methods Every variable in a linear model is a random variable derived from a distribution function. A fixed factor becomes a random variable with possibly a uniform distribution going from a lower

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Monte Carlo Simulation: IEOR E4703 Fall 2004 c 2004 by Martin Haugh Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab 1 Overview of Monte Carlo Simulation 1.1 Why use simulation?

### Master s Theory Exam Spring 2006

Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

### 7 Hypothesis testing - one sample tests

7 Hypothesis testing - one sample tests 7.1 Introduction Definition 7.1 A hypothesis is a statement about a population parameter. Example A hypothesis might be that the mean age of students taking MAS113X

### Exact Confidence Intervals

Math 541: Statistical Theory II Instructor: Songfeng Zheng Exact Confidence Intervals Confidence intervals provide an alternative to using an estimator ˆθ when we wish to estimate an unknown parameter

### An Internal Model for Operational Risk Computation

An Internal Model for Operational Risk Computation Seminarios de Matemática Financiera Instituto MEFF-RiskLab, Madrid http://www.risklab-madrid.uam.es/ Nicolas Baud, Antoine Frachot & Thierry Roncalli

### Sampling and Hypothesis Testing

Population and sample Sampling and Hypothesis Testing Allin Cottrell Population : an entire set of objects or units of observation of one sort or another. Sample : subset of a population. Parameter versus

### Designing a learning system

Lecture Designing a learning system Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x4-8845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing

### Dongfeng Li. Autumn 2010

Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis

### Joint Exam 1/P Sample Exam 1

Joint Exam 1/P Sample Exam 1 Take this practice exam under strict exam conditions: Set a timer for 3 hours; Do not stop the timer for restroom breaks; Do not look at your notes. If you believe a question

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

### Gaussian Conjugate Prior Cheat Sheet

Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian

### Sufficient Statistics and Exponential Family. 1 Statistics and Sufficient Statistics. Math 541: Statistical Theory II. Lecturer: Songfeng Zheng

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Sufficient Statistics and Exponential Family 1 Statistics and Sufficient Statistics Suppose we have a random sample X 1,, X n taken from a distribution

### Hypothesis Testing COMP 245 STATISTICS. Dr N A Heard. 1 Hypothesis Testing 2 1.1 Introduction... 2 1.2 Error Rates and Power of a Test...

Hypothesis Testing COMP 45 STATISTICS Dr N A Heard Contents 1 Hypothesis Testing 1.1 Introduction........................................ 1. Error Rates and Power of a Test.............................

### Normal distribution. ) 2 /2σ. 2π σ

Normal distribution The normal distribution is the most widely known and used of all distributions. Because the normal distribution approximates many natural phenomena so well, it has developed into a

### 38. STATISTICS. 38. Statistics 1. 38.1. Fundamental concepts

38. Statistics 1 38. STATISTICS Revised September 2013 by G. Cowan (RHUL). This chapter gives an overview of statistical methods used in high-energy physics. In statistics, we are interested in using a

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,

### MT426 Notebook 3 Fall 2012 prepared by Professor Jenny Baglivo. 3 MT426 Notebook 3 3. 3.1 Definitions... 3. 3.2 Joint Discrete Distributions...

MT426 Notebook 3 Fall 2012 prepared by Professor Jenny Baglivo c Copyright 2004-2012 by Jenny A. Baglivo. All Rights Reserved. Contents 3 MT426 Notebook 3 3 3.1 Definitions............................................

### Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

### Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March 2015. Due:-March 25, 2015.

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment -3, Probability and Statistics, March 05. Due:-March 5, 05.. Show that the function 0 for x < x+ F (x) = 4 for x < for x

### Models for Count Data With Overdispersion

Models for Count Data With Overdispersion Germán Rodríguez November 6, 2013 Abstract This addendum to the WWS 509 notes covers extra-poisson variation and the negative binomial model, with brief appearances

### 2WB05 Simulation Lecture 8: Generating random variables

2WB05 Simulation Lecture 8: Generating random variables Marko Boon http://www.win.tue.nl/courses/2wb05 January 7, 2013 Outline 2/36 1. How do we generate random variables? 2. Fitting distributions Generating

### Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

### Aggregate Loss Models

Aggregate Loss Models Chapter 9 Stat 477 - Loss Models Chapter 9 (Stat 477) Aggregate Loss Models Brian Hartman - BYU 1 / 22 Objectives Objectives Individual risk model Collective risk model Computing

### Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Section 5. Stan for Big Data Bob Carpenter Columbia University Part I Overview Scaling and Evaluation data size (bytes) 1e18 1e15 1e12 1e9 1e6 Big Model and Big Data approach state of the art big model

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### Chapter 4 Statistical Inference in Quality Control and Improvement. Statistical Quality Control (D. C. Montgomery)

Chapter 4 Statistical Inference in Quality Control and Improvement 許 湘 伶 Statistical Quality Control (D. C. Montgomery) Sampling distribution I a random sample of size n: if it is selected so that the

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Probability Theory. Elementary rules of probability Sum rule. Product rule. p. 23

Probability Theory Uncertainty is key concept in machine learning. Probability provides consistent framework for the quantification and manipulation of uncertainty. Probability of an event is the fraction

### Normality Testing in Excel

Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

### Chapter 9: Hypothesis Testing Sections

Chapter 9: Hypothesis Testing Sections 9.1 Problems of Testing Hypotheses Skip: 9.2 Testing Simple Hypotheses Skip: 9.3 Uniformly Most Powerful Tests Skip: 9.4 Two-Sided Alternatives 9.6 Comparing the

### 1 The Pareto Distribution

Estimating the Parameters of a Pareto Distribution Introducing a Quantile Regression Method Joseph Lee Petersen Introduction. A broad approach to using correlation coefficients for parameter estimation

### SYSM 6304: Risk and Decision Analysis Lecture 3 Monte Carlo Simulation

SYSM 6304: Risk and Decision Analysis Lecture 3 Monte Carlo Simulation M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu September 19, 2015 Outline

### Using pivots to construct confidence intervals. In Example 41 we used the fact that

Using pivots to construct confidence intervals In Example 41 we used the fact that Q( X, µ) = X µ σ/ n N(0, 1) for all µ. We then said Q( X, µ) z α/2 with probability 1 α, and converted this into a statement

### Statistics. Aims of this course... Schedules... Recommended books... Keywords... Notation...

Statistics Aims of this course............................. Schedules.................................. Recommended books............................ Keywords.................................. Notation...................................

### Statistics 100A Homework 8 Solutions

Part : Chapter 7 Statistics A Homework 8 Solutions Ryan Rosario. A player throws a fair die and simultaneously flips a fair coin. If the coin lands heads, then she wins twice, and if tails, the one-half

### Chapter 9: Hypothesis Testing Sections

Chapter 9: Hypothesis Testing Sections 9.1 Problems of Testing Hypotheses Skip: 9.2 Testing Simple Hypotheses Skip: 9.3 Uniformly Most Powerful Tests Skip: 9.4 Two-Sided Alternatives 9.5 The t Test 9.6

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### Basic Bayesian Methods

6 Basic Bayesian Methods Mark E. Glickman and David A. van Dyk Summary In this chapter, we introduce the basics of Bayesian data analysis. The key ingredients to a Bayesian analysis are the likelihood

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Principled Hybrids of Generative and Discriminative Models

Principled Hybrids of Generative and Discriminative Models Julia A. Lasserre University of Cambridge Cambridge, UK jal62@cam.ac.uk Christopher M. Bishop Microsoft Research Cambridge, UK cmbishop@microsoft.com

α α λ α = = λ λ α ψ = = α α α λ λ ψ α = + β = > θ θ β > β β θ θ θ β θ β γ θ β = γ θ > β > γ θ β γ = θ β = θ β = θ β = β θ = β β θ = = = β β θ = + α α α α α = = λ λ λ λ λ λ λ = λ λ α α α α λ ψ + α =

### 171:290 Model Selection Lecture II: The Akaike Information Criterion

171:290 Model Selection Lecture II: The Akaike Information Criterion Department of Biostatistics Department of Statistics and Actuarial Science August 28, 2012 Introduction AIC, the Akaike Information

### Contents. TTM4155: Teletraffic Theory (Teletrafikkteori) Probability Theory Basics. Yuming Jiang. Basic Concepts Random Variables

TTM4155: Teletraffic Theory (Teletrafikkteori) Probability Theory Basics Yuming Jiang 1 Some figures taken from the web. Contents Basic Concepts Random Variables Discrete Random Variables Continuous Random

### Nonparametric Statistics

Nonparametric Statistics References Some good references for the topics in this course are 1. Higgins, James (2004), Introduction to Nonparametric Statistics 2. Hollander and Wolfe, (1999), Nonparametric

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### Lecture 7: Continuous Random Variables

Lecture 7: Continuous Random Variables 21 September 2005 1 Our First Continuous Random Variable The back of the lecture hall is roughly 10 meters across. Suppose it were exactly 10 meters, and consider

### Chapter 3 RANDOM VARIATE GENERATION

Chapter 3 RANDOM VARIATE GENERATION In order to do a Monte Carlo simulation either by hand or by computer, techniques must be developed for generating values of random variables having known distributions.

### Why the Normal Distribution?

Why the Normal Distribution? Raul Rojas Freie Universität Berlin Februar 2010 Abstract This short note explains in simple terms why the normal distribution is so ubiquitous in pattern recognition applications.