Lecture 6: Discrete Distributions

Similar documents
Bayes and Naïve Bayes. cs534-machine Learning

Lecture 9: Bayesian hypothesis testing

Basics of Statistical Machine Learning

Lecture 2 Binomial and Poisson Probability Distributions

CHAPTER 2 Estimating Probabilities

AP Statistics 7!3! 6!

E3: PROBABILITY AND STATISTICS lecture notes

Section 5 Part 2. Probability Distributions for Discrete Random Variables

REPEATED TRIALS. The probability of winning those k chosen times and losing the other times is then p k q n k.

Master s Theory Exam Spring 2006

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

Sample Questions for Mastery #5

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

The Binomial Distribution

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

1 Maximum likelihood estimation

WHERE DOES THE 10% CONDITION COME FROM?

3F3: Signal and Pattern Processing

Solutions: Problems for Chapter 3. Solutions: Problems for Chapter 3

STA 4273H: Statistical Machine Learning

3.4. The Binomial Probability Distribution. Copyright Cengage Learning. All rights reserved.

Notes on the Negative Binomial Distribution

Remarks on the Concept of Probability

Binomial Sampling and the Binomial Distribution

MATH 140 Lab 4: Probability and the Standard Normal Distribution

STAT 315: HOW TO CHOOSE A DISTRIBUTION FOR A RANDOM VARIABLE

Machine Learning Logistic Regression

WEEK #23: Statistics for Spread; Binomial Distribution

Chapter 4 Lecture Notes

Likelihood: Frequentist vs Bayesian Reasoning

Inference of Probability Distributions for Trust and Security applications

Unit 4 The Bernoulli and Binomial Distributions

Business Statistics 41000: Probability 1

DETERMINE whether the conditions for a binomial setting are met. COMPUTE and INTERPRET probabilities involving binomial random variables

The normal approximation to the binomial

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

Chapter 3: DISCRETE RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS. Part 3: Discrete Uniform Distribution Binomial Distribution

HT2015: SC4 Statistical Data Mining and Machine Learning

Law of Large Numbers. Alexandra Barbato and Craig O Connell. Honors 391A Mathematical Gems Jenia Tevelev

Probability, statistics and football Franka Miriam Bru ckler Paris, 2015.

Bayesian Networks. Read R&N Ch Next lecture: Read R&N

Random variables, probability distributions, binomial random variable

Basic Probability Concepts

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

better off than the subjective Bayesian or vice versa

STA 371G: Statistics and Modeling

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 13. Random Variables: Distribution and Expectation

The Kelly Betting System for Favorable Games.

LECTURE 16. Readings: Section 5.1. Lecture outline. Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process

Chapter 4 - Practice Problems 2

The Math. P (x) = 5! = = 120.

University of California, Los Angeles Department of Statistics. Random variables

Section 6.2 Definition of Probability

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

Section 6.1 Discrete Random variables Probability Distribution

Introduction to Probability

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

2. Discrete random variables

1 Prior Probability and Posterior Probability

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

6.2. Discrete Probability Distributions

6 PROBABILITY GENERATING FUNCTIONS

Bayesian models of cognition

A Tutorial on Probability Theory

SCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics

Section 5-3 Binomial Probability Distributions

Chapter 5. Random variables

Bayesian Tutorial (Sheet Updated 20 March)

Lecture 5 : The Poisson Distribution

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Discrete Math in Computer Science Homework 7 Solutions (Max Points: 80)

Foundations of Statistics Frequentist and Bayesian

Chapter 5. Discrete Probability Distributions

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Probabilities. Probability of a event. From Random Variables to Events. From Random Variables to Events. Probability Theory I

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

2WB05 Simulation Lecture 8: Generating random variables

Probability Calculator

PROBABILITY AND SAMPLING DISTRIBUTIONS

Statistics and Random Variables. Math 425 Introduction to Probability Lecture 14. Finite valued Random Variables. Expectation defined

Concepts of Probability

Economics 1011a: Intermediate Microeconomics

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Experimental Uncertainty and Probability

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Gaussian Conjugate Prior Cheat Sheet

Bayesian Analysis for the Social Sciences

6 Scalar, Stochastic, Discrete Dynamic Systems

Bayes Theorem. Bayes Theorem- Example. Evaluation of Medical Screening Procedure. Evaluation of Medical Screening Procedure

ST 371 (IV): Discrete Random Variables

Christfried Webers. Canberra February June 2015

Probability Distributions

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 10

Lecture 8. Confidence intervals and the central limit theorem

Transcription:

Lecture 6: Discrete Distributions 4F3: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f3/ Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions / 6

Coin tossing You are presented with a coin: what is the probability of heads? What does this question even mean? How much are you willing to bet p(head) > 0.5? Do you expect this coin to come up heads more often that tails? Wait... can you throw the coin a few times, I need data! Ok, you observe the following sequence of outcomes (T: tail, H: head): H This is not enough data! Now you observe the outcome of three additional throws: HHTH How much are you now willing to bet p(head) > 0.5? Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 2 / 6

The Bernoulli discrete distribution The Bernoulli discrete probability distribution over binary random variables: Binary random variable X: outcome x of a single coin throw. The two values x can take are X = 0 for tail, X = for heads. Let the probability of heads be π = p(x = ). π is the parameter of the Bernoulli distribution. The probability of tail is p(x = 0) = π. We can compactly write p(x = x π) = p(x π) = π x ( π) x What do we think π is after observing a single heads outcome? Maximum likelihood! Maximise p(h π) with respect to π: p(h π) = p(x = π) = π, argmax π [0,] π = Ok, so the answer is π =. This coin only generates heads. Is this reasonable? How much are you willing to bet p(heads)>0.5? Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 3 / 6

The Binomial distribution: counts of binary outcomes We observe a sequence of throws rather than a single throw: HHTH The probability of this particular sequence is: p(hhth) = π 3 ( π). But so is the probability of THHH, of HTHH and of HHHT. We don t really care about the order of the outcomes, only about the counts. In our example the probability of 3 heads out of 4 throws is: 4π 3 ( π). The Binomial distribution gives the probability of observing k heads out of n throws p(k π, n) = ( n ) π k k ( π) n k This assumes independent throws from a Bernoulli distribution p(x π). ( n ) = n! k k!(n k)! is the Binomial coefficient, also known as n choose k. Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 4 / 6

Maximum likelihood under a Binomial distribution If we observe k heads out of n throws, what do we think π is? We can maximise the likelihood of parameter π given the observed data. p(k π, n) π k ( π) n k It is convenient to take the logarithm and derivatives with respect to π log p(k π, n) = k log π + (n k) log( π) + Constant log p(k π, n) π = k π n k π = 0 π = k n Is this reasonable? For HHTH we get π = 3/4. How much would you bet now that p(heads) > 0.5? What do you think p(π > 0.5)is? Wait! This is a probability over... a probability? Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 5 / 6

Prior beliefs about coins before throwing the coin So you have observed 3 heads out of 4 throws but are unwilling to bet 00 that p(heads) > 0.5? (That for example out of 0,000,000 throws at least 5,000,00 will be heads) Why? You might believe that coins tend to be fair (π 2 ). A finite set of observations updates your opinion about π. But how to express your opinion about π before you see any data? Pseudo-counts: You think the coin is fair and... you are... Not very sure. You act as if you had seen 2 heads and 2 tails before. Pretty sure. It is as if you had observed 20 heads and 20 tails before. Totally sure. As if you had seen 000 heads and 000 heads before. Depending on the strength of your prior assumptions, it takes a different number of actual observations to change your mind. Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 6 / 6

The Beta distribution: distributions on probabilities Continuous probability distribution defined on the interval (0, ) Beta(π α, β) = Γ(α + β) Γ(α)Γ(β) πα ( π) β = B(α, β) πα ( π) β α > 0 and β > 0 are the shape parameters. the parameters correspond to one plus the pseudo-counts. Γ(α) is an extension of the factorial function. Γ(n) = (n )! for integer n. B(α, β) is the beta function, it normalises the Beta distribution. The mean is given by E(π) = 2 Beta(,) α α+β. [Left: α = β =, Right: α = β = 3] 2 Beta(3,3).5.5 p(π) p(π) 0.5 0.5 0 0 0.2 0.4 0.6 0.8 π 0 0 0.2 0.4 0.6 0.8 π Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 7 / 6

Posterior for coin tossing Imagine we observe a single coin toss and it comes out heads. Our observed data is: D = {k = }, where n =. The probability of the observed data given π is the likelihood: p(d π) = π We use our prior p(π α, β) = Beta(π α, β) to get the posterior probability: p(π D) = p(π α, β)p(d π) p(d) π Beta(π α, β) π π (α ) ( π) (β ) Beta(π α +, β) The Beta distribution is a conjugate prior to the Binomial distribution: The resulting posterior is also a Beta distribution. α The posterior parameters are given by: posterior = α prior + k β posterior = β prior + (n k) Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 8 / 6

Before and after observing one head 2 Beta(,) Prior 2 Beta(3,3).5.5 p(π) p(π) 0.5 0.5 0 0 0.2 0.4 0.6 0.8 π 0 0 0.2 0.4 0.6 0.8 π 2 Beta(2,) 2.5 Beta(4,3).5 2.5 p(π) p(π) 0.5 0.5 0 0 0.2 0.4 0.6 0.8 π Posterior 0 0 0.2 0.4 0.6 0.8 π Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 9 / 6

Making predictions - posterior mean Under the Maximum Likelihood approach we report the value of π that maximises the likelihood of π given the observed data. With the Bayesian approach, average over all possible parameter settings: p(x = D) = p(x = π) p(π D) dπ This corresponds to reporting the mean of the posterior distribution. Learner A with Beta(, ) predicts p(x = D) = 2 3 Learner B with Beta(3, 3) predicts p(x = D) = 4 7 Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 0 / 6

Making predictions - other statistics Given the posterior distribution, we can also answer other questions such as what is the probability that π > 0.5 given the observed data? p(π > 0.5 D) = 0.5 p(π D) dπ = 0.5 Beta(π α, β )dπ Learner A with prior Beta(, ) predicts p(π > 0.5 D) = 0.75 Learner B with prior Beta(3, 3) predicts p(π > 0.5 D) = 0.66 Note that for any l > and fixed α and β, the two posteriors Beta(π α, β) and Beta(π lα, lβ) have the same average π, but give different values for p(π > 0.5). Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions / 6

Learning about a coin, multiple models () Consider two alternative models of a coin, fair and bent. A priori, we may think that fair is more probable, eg: p(fair) = 0.8, p(bent) = 0.2 For the bent coin, (a little unrealistically) all parameter values could be equally likely, where the fair coin has a fixed probability: p(q bent) 0.8 0.6 0.4 0.2 0 0 0.5 parameter, q p(q fair) 0.8 0.6 0.4 0.2 0 0 0.5 parameter, q We make 0 tosses, and get: T H T H T T T T T T Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 2 / 6

Learning about a coin, multiple models (2) The evidence for the fair model is: p(d fair) = (/2) 0 0.00 and for the bent model: p(d bent) = dπ p(d π, bent)p(π bent) = dπ π 2 ( π) 8 = B(3, 9) 0.002 The posterior for the models, by Bayes rule: p(fair D) 0.0008, p(bent D) 0.0004, ie, two thirds probability that the coin is fair. How do we make predictions? By weighting the predictions from each model by their probability. Probability of Head at next toss is: 2 3 2 + 3 3 2 = 5 2. Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 3 / 6

The Multinomial distribution () Generalisation of the Binomial distribution from 2 outcomes to m outcomes. Useful for random variables that take one of a finite set of possible outcomes. Throw a die n = 60 times, and count the of observed (6 possible) outcomes. Outcome Count X = x = k = 2 X = x 2 = 2 k 2 = 7 X = x 3 = 3 k 3 = X = x 4 = 4 k 4 = 8 X = x 5 = 5 k 5 = 9 X = x 6 = 6 k 6 = 3 Note that we have one parameter too many. We don t need to know all the k i and n, because 6 i= k i = n. Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 4 / 6

The Multinomial distribution (2) Consider a discrete random variable X that can take one of m values x,..., x m. Out of n independent trials, let k i be the number of times X = x i was observed. It follows that m i= k i = n. Denote by π i the probability that X = x i, with m i= π i =. The probability of observing a vector of occurrences k = [k,..., k m ] is given by the Multinomial distribution parametrised by π = [π,..., π m ] : p(k π, n) = p(k,..., k m π,..., π m, n) = n! k!k 2!... k m! i= π k i i Note that we can write p(k π) since n is redundant. n! The multinomial coefficient k!k 2!...k m! is a generalisation of ( n ). k Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 5 / 6

Example: word counts in text Consider describing a text document by the frequency of occurrence of every distinct word. The UCI Bag of Words dataset from the University of California, Irvine. http://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/ Quiñonero-Candela & Rasmussen (CUED) Lecture 6: Discrete Distributions 6 / 6