# Sufficient Statistics and Exponential Family. 1 Statistics and Sufficient Statistics. Math 541: Statistical Theory II. Lecturer: Songfeng Zheng

Save this PDF as:

Size: px
Start display at page:

Download "Sufficient Statistics and Exponential Family. 1 Statistics and Sufficient Statistics. Math 541: Statistical Theory II. Lecturer: Songfeng Zheng"

## Transcription

1 Math 541: Statistical Theory II Lecturer: Songfeng Zheng Sufficient Statistics and Exponential Family 1 Statistics and Sufficient Statistics Suppose we have a random sample X 1,, X n taken from a distribution f(x θ) which relies on an unknown parameter θ in a parameter space Θ. The purpose of parameter estimation is to estimate the parameter θ from the random sample. We have already studied three parameter estimation methods: method of moment, maximum likelihood, and Bayes estimation. We can see from the previous examples that the estimators can be expressed as a function of the random sample X 1,, X n. Such a function is called a statistic. Formally, any real-valued function T = r(x 1,, X n ) of the observations in the sample is called a statistic. In this function, there should not be any unknown parameter. For example, suppose we have a random sample X 1,, X n, then X, max(x 1,, X n ), median(x 1,, X n ), and r(x 1,, X n ) = 4 are statistics; however X 1 + µ is not statistic if µ is unknown. For the parameter estimation problem, we know nothing about the parameter but the observations from such a distribution. Therefore, the observations X 1,, X n is our first hand of information source about the parameter, that is to say, all the available information about the parameter is contained in the observations. However, we know that the estimators we obtained are always functions of the observations, i.e., the estimators are statistics, e.g. sample mean, sample standard deviations, etc. In some sense, this process can be thought of as compress the original observation data: initially we have n numbers, but after this compression, we only have 1 numbers. This compression always makes us lose information about the parameter, can never makes us obtain more information. The best case is that this compression result contains the same amount of information as the information contained in the n observations. We call such a statistic as sufficient statistic. From the above intuitive analysis, we can see that sufficient statistic absorbs all the available information about θ contained in the sample. This concept was introduced by R. A. Fisher in If T (X 1,, X n ) is a statistic and t is a particular value of T, then the conditional joint distribution of X 1,, X n given that T = t can be calculated. In general, this joint conditional distribution will depend on the value of θ. Therefore, for each value of t, there 1

2 2 will be a family of possible conditional distributions corresponding to the different possible values of θ Θ. However, it may happen that for each possible value of t, the conditional joint distribution of X 1,, X n given that T = t is the same for all the values of θ Θ and therefore does not actually depend on the value of θ. In this case, we say that T is a sufficient statistic for the parameter θ. Formally, a statistic T (X 1,, X n ) is said to be sufficient for θ if the conditional distribution of X 1,, X n, given T = t, does not depend on θ for any value of t. In other words, given the value of T, we can gain no more knowledge about θ from knowing more about the probability distribution of X 1,, X n. We could envision keeping only T and throwing away all the X i without losing any information! The concept of sufficiency arises as an attempt to answer the following question: Is there a statistic, i.e. a function T (X 1,, X n ), that contains all the information in the sample about θ? If so, a reduction or compression of the original data to this statistic without loss of information is possible. For example, consider a sequence of independent Bernoulli trials with unknown probability of success, θ. We may have the intuitive feeling that the total number of successes contains all the information about θ that is in the sample, that the order in which the successes occurred, for example, does not give any additional information about θ. Example 1: Let X 1,, X n be a sequence of independent bernoulli trials with P (X i = 1) = θ. We will verify that T = n X i is sufficient for θ. Proof: We have P (X 1 = x 1,, X n = x n T = t) = P (X 1 = x 1,, X n = x n ) P (T = t) Bearing in mind that the X i can take on only the values 0s or 1s, the probability in the numerator is the probability that some particular set of t X i are equal to 1s and the other n t are 0s. Since the X i are independent, the probability of this is θ t (1 θ) n t. To find the denominator, note that the distribution of T, the total number of ones, is binomial with n trials and probability of success θ. Therefore the ratio in the above equation is ( n t θ t (1 θ) n t ) θt (1 θ) = ( 1 n t n t) The conditional distribution thus does not involve θ at all. Given the total number of ones, the probability that they occur on any particular set of t trials is the same for any value of θ so that set of trials contains no additional information about θ.

3 3 2 Factorization Theorem The preceding definition of sufficiency is hard to work with, because it does not indicate how to go about finding a sufficient statistic, and given a candidate statistic, T, it would typically be very hard to conclude whether it was sufficient statistic because of the difficulty in evaluating the conditional distribution. We shall now present a simple method for finding a sufficient statistic which can be applied in many problems. This method is based on the following result, which was developed with increasing generality by R. A. Fisher in 1922, J. Neyman in 1935, and P. R. Halmos and L. J. Savage in 1949, and this result is know as the Factorization Theorem. Factorization Theorem: Let X 1,, X n form a random sample from either a continuous distribution or a discrete distribution for which the pdf or the point mass function is f(x θ), where the value of θ is unknown and belongs to a given parameter space Θ. A statistic T (X 1,, X n ) is a sufficient statistic for θ if and only if the joint pdf or the joint point mass function f n (x θ) of X 1,, X n can be factorized as follows for all values of x = (x 1,, x n ) R n and all values of θ Θ: f n (x θ) = u(x)v[t (x), θ]. Here, the function u and v are nonnegative, the function u may depend on x but does not depend on θ, and the function v depends on θ but will depend on the observed value x only through the value of the statistic T (x). Note: In this expression, we can see that the statistic T (X 1,, X n ) is like an interface between the random sample X 1,, X n and the function v. Proof: We give a proof for the discrete case. First, suppose the frequency function can be factored in the given form. We let X = (X 1,, X n ) and x = (x 1,, x n ), then P (T = t) = Then we can have x:t (x)=t P (X = x) = x:t (x)=t u(x)v(t (x), θ) = v(t, θ) x:t (x)=t u(x) P (X = x T = t) = P (X = x, T = t) P (T = t) = u(x) x:t (x)=t u(x) which does not depend on θ, and therefore T is a sufficient statistic. Conversely, suppose the conditional distribution of X given T is independent of θ, that is T is a sufficient statistic. Then we can let u(x) = P (X = x T = t, θ), and let v(t, θ) = P (T = t θ). It follows that P (X = x θ) = P (X = x, T = t θ) this is true because T (x) = t, it is redundant = P (X = x T = t, θ)p (T = t θ) = u(x)v(t (x 1,, x n )), θ)

4 4 which is of the desired form. Example 2: Suppose that X 1,, X n form a random sample from a Poisson distribution for which the value of the mean θ is unknown (θ > 0). Show that T = n X i is a sufficient statistic for θ. Proof: For every set of nonnegative integers x 1,, x n, the joint probability mass function f n (x θ) of X 1,, X n is as follows: n e θ θ x ( i n ) 1 f n (x θ) = = e nθ θ n x i x i! x i! It can be seen that f n (x θ) has been expressed as the product of a function that does not depend on θ and a function that depends on θ but depends on the observed vector x only through the value of n x i. By factorization theorem, it follows that T = n X i is a sufficient statistic for θ. Example 3: Applying the Factorization Theorem to a Continuous Distribution. Suppose that X 1,, X n form a random sample from a continuous distribution with the following p.d.f.: { θx f(x θ) = θ 1, for 0 < x < 1 0, otherwise It is assumed that the value of the parameter θ is unknown (θ > 0). We shall show that T = n X i is a sufficient statistic for θ. Proof: For 0 < x i < 1 (i = 1,, n), the joint p.d.f. f n (x θ) of X 1,, X n is as follows: ( n ) f n (x θ) = θ n θ 1 x i. Furthermore, if at least one value of x i is outside the interval 0 < x i < 1, then f n (x θ) = 0 for every value of θ Θ. The right side of the above equation depends on x only through the value of the product n x i. Therefore, if we let u(x) = 1 and r(x) = n x i, then f n (x θ) can be considered to be factored in the form specified by the factorization theorem. It follows from the factorization theorem that the statistic T = n X i is a sufficient statistic for θ. Example 4: Suppose that X 1,, X n form a random sample from a normal distribution for which the mean µ is unknown but the variance σ 2 is known. Find a sufficient statistic for µ. Solution: For x = (x 1,, x n ), the joint pdf of X 1,, X n is [ n 1 f n (x µ) = (2π) 1/2 σ exp (x i µ) 2 ]. 2σ 2 This could be rewritten as ( 1 n f n (x µ) = (2π) n/2 σ exp x 2 ) ( i µ exp n 2σ 2 σ 2 n ) x i nµ2 2σ 2

5 5 It can be seen that f n (x µ) has now been expressed as the product of a function that does not depend on µ and a function the depends on x only through the value of n x i. It follows from the factorization theorem that T = n X i is a sufficient statistic for µ. Since n = n x, we can state equivalently that the final expression depends on x only through the value of x, therefore X is also a sufficient statistic for µ. More generally, every one to one function of X will be a sufficient statistic for µ. Property of sufficient statistic: Suppose that X 1,, X n form a random sample from a distribution for which the p.d.f. is f(x θ), where the value of the parameter θ belongs to a given parameter space Θ. Suppose that T (X 1,, X n ) and T (X 1,, X n ) are two statistics, and there is a one-to-one map between T and T ; that is, the value of T can be determined from the value of T without knowing the values of X 1,, X n, and the value of T can be determined from the value of T without knowing the values of X 1,, X n. Then T is a sufficient statistic for θ if and only if T is a sufficient statistic for θ. Proof: Suppose the one-to-one mapping between T and T is g, i.e. T = g(t ) and T = g 1 (T ), and g 1 is also one-to-one. T is a sufficient statistic if and only if the joint pdf f n (X T = t) can be factorized as and this can be written as f n (x T = t) = u(x)v[t (x), θ] u(x)v[t (x), θ] = u(x)v[g 1 (T (x)), θ] = u(x)v [T (x), θ] Therefore the joint pdf can be factorized as u(x)v [T (x), θ], by factorization theorem, T is a sufficient statistic. For instance, in Example 4, we showed that for normal distribution, both T 1 = n X i and T 2 = X are sufficient statistics, and there is a one-to-one mapping between T 1 and T 2 : T 2 = T 1 /n. Other statistics like ( n X i ) 3, exp ( n X i ) are also sufficient statistics. Example 5: Suppose that X 1,, X n form a random sample from a beta distribution with parameters α and β, where the value of α is known and the value of β is unknown (β > 0). Show that the following statistic T is a sufficient statistic for the parameter β: T = 1 n ( n log 1 1 X i Proof: The p.d.f. f(x β) of each individual observation X i is { Γ(α+β) f(x β) = Γ(α)Γ(β) xα 1 (1 x) β 1, for 0 x 1 0, otherwise Therefore, the joint p.d.f. f n (x β) of X 1,, X n is ( n Γ(α + β) n ) f n (x β) = Γ(α)Γ(β) xα 1 i (1 x i ) β 1 = Γ(α) n α 1 Γ(α + β)n x i Γ(β) n ) 3. ( n (1 x i ) ) β 1

6 6 We define T (X 1,, X n ) = n (1 X i ), and because α is known, so we can define ( n ) u(x) = Γ(α) n α 1 x i v(t, β) = Γ(α + β)n T (x Γ(β) n 1,, x n ) β 1 We can see that the function v depends on x only through T, therefore T is a sufficient statistic. It is easy to see that T = g(t ) = log( T ) 3, n and the function g is a one-to-one mapping. Therefore T is a sufficient statistic. Example 6: Sampling for a Uniform distribution. Suppose that X 1,, X n form a random sample from a uniform distribution on the interval [0, θ], where the value of the parameter θ is unknown (θ > 0). We shall show that T = max(x 1,, X n ) is a sufficient statistic for θ. Proof: The p.d.f. f(x θ) of each individual observation X i is f(x θ) = { 1 θ, for 0 x θ 0, otherwise Therefore, the joint p.d.f. f n (x θ) of X 1,, X n is f n (x θ) = { 1 θ n, for 0 x i θ 0, otherwise It can be seen that if x i < 0 for at least one value of i (i = 1,, n), then f n (x θ) = 0 for every value of θ > 0. Therefore it is only necessary to consider the factorization of f n (x θ) for values of x i 0 (i = 1,, n). Define h[max(x 1,, x n ), θ] as h[max(x 1,, x n ), θ] = { 1, if max(x1,, x n ) θ 0, if max(x 1,, x n ) > θ Also, x i θ for i = 1,, n if and only if max(x 1,, x n ) θ. Therefore, for x i 0 (i = 1,, n), we can rewrite f n (x θ) as follows: f n (x θ) = 1 θ n h[max(x 1,, x n ), θ]. Since the right side depends on x only through the value of max(x 1,, x n ), it follows that T = max(x 1,, X n ) is a sufficient statistic for θ. According to the property of sufficient statistic, any one-to-one function of T is a sufficient statistic as well.

7 7 Example 7: Suppose that X 1, X 2,, X n are i.i.d. random variables on the interval [0, 1] with the density function f(x α) = Γ(2α) [x(1 x)]α 1 Γ(α) 2 where α > 0 is a parameter to be estimated from the sample. Find a sufficient statistic for α. Solution: The joint density function of x 1,, x n is f(x 1,, x n α) = n Comparing with the form in factorization theorem, [ Γ(2α) n ] Γ(α) [x i(1 x 2 i )] α 1 = Γ(2α)n α 1 x Γ(α) 2n i (1 x i ) f(x 1,, x n θ) = u(x 1,, x n )v[t (x 1,, x n ), θ] we see that T = n X i (1 X i ), v(t, θ) = Γ(2α)n Γ(α) 2n t α 1, u(x 1,, x n ) = 1, i.e. v depends on x 1,, x n only through t. By the factorization theorem, T = n X i (1 X i ) is a sufficient statistic. According to the property of sufficient statistic, any one-to-one function of T is a sufficient statistic as well. 3 Sufficient Statistics and Estimators We know estimators are statistics, in particular, we want the obtained estimator to be sufficient statistic, since we want the estimator absorbs all the available information contained in the sample. Suppose that X 1,, X n form a random sample from a distribution for which the pdf or point mass function is f(x θ), where the value of the parameter θ is unknown. And we assume there is a sufficient statistic for θ, which is T (X 1,, X n ). We will show that the MLE of θ, ˆθ, depends on X 1,, X n only through the statistic T. It follows from the factorization theorem that the likelihood function f n (x θ) can be written as f n (x θ) = u(x)v[t (x), θ]. We know that the MLE ˆθ is the value of θ for which f n (x θ) is maximized. We also know that both u and v are positive. Therefore, it follows that ˆθ will be the value of θ for which v[t (x), θ] is maximized. Since v[t (x), θ] depends on x only through the function T (x), it follows that ˆθ will depend on x only through the function T (x). Thus the MLE estimator ˆθ is a function of the sufficient statistic T (X 1,, X n ). In many problems, the MLE ˆθ is actually a sufficient statistic. For instance, Example 1 shows a sufficient statistic for the success probability in Bernoulli trial is n X i, and we

8 8 know the MLE for θ is X; Example 4 shows a sufficient statistic for µ in normal distribution is X, and this is the MLE for µ; Example 6 shows a sufficient statistic of θ for the uniform distribution on (0, θ) is max(x 1,, X n ), and this is the MLE for θ. The above discussion for MLE also holds for Bayes estimator. Let θ be a parameter with parameter space Θ equal to an interval of real numbers (possibly unbounded), and we assume that the prior distribution for θ is p(θ). Let X have p.d.f. f(x θ) conditional on θ. Suppose we have a random sample X 1,, X n from f(x θ). Let T (X 1,, X n ) be a sufficient statistic. We first show that the posterior p.d.f. of θ given X = x depends on x only through T (x). The likelihood term is f n (x θ), according to Bayes formula, we have the posterior distribution for θ is f(θ x) = f n(x θ)p(θ) Θ f n(x θ)p(θ)dθ = Θ u(x)v(t (x), θ)p(θ) = u(x)v(t (x), θ)p(θ)dθ Θ v(t (x), θ)p(θ) v(t (x), θ)p(θ)dθ where the second step uses the factorization theorem. We can see that the posterior p.d.f. of θ given X = x depends on x only through T (x). Since the Bayes estimator of θ with respect to a specified loss function is calculated from this posterior p.d.f., the estimator also will depend on the observed vector x only through the value of T (x). In other words, the Bayes estimator is a function of the sufficient statistic T (X 1,, X n ). Summarizing our discussion above, both the MLE estimator and Bayes estimator are functions of sufficient statistic, therefore they absorb all the available information contained in the sample at hand. 4 Exponential Family of Probability Distribution A study of the properties of probability distributions that have sufficient statistics of the same dimension as the parameter space regardless of the sample size led to the development of what is called the exponential family of probability distributions. Many common distributions, including the normal, the binomial, the Poisson, and the gamma, are members of this family. One-parameter members of the exponential family have density or mass function of the form f(x θ) = exp[c(θ)t (x) + d(θ) + S(x)] Suppose that X 1,, X n are i.i.d. samples from a member of the exponential family, then the joint probability function is n f(x θ) = exp[c(θ)t (x i ) + d(θ) + S(x i )] [ ] [ n n ] = exp c(θ) T (x i ) + nd(θ) exp S(x i )

9 9 From this result, it is apparent by the factorization theorem that n T (x i ) is a sufficient statistic. Example 8: The frequency function of Bernoulli distribution is P (X = x) = θ x (1 θ) 1 x x = 0 or x = 1 [ ( ) ] θ = exp x log + log(1 θ) 1 θ (1) It can be seen that this is a member of the exponential family with T (x) = x, and we can also see that n X i is a sufficient statistic, which is the same as in example 1. Example 9: Suppose that X 1, X 2,, X n are i.i.d. random variables on the interval [0, 1] with the density function f(x α) = Γ(2α) [x(1 x)]α 1 Γ(α) 2 where α > 0 is a parameter to be estimated from the sample. Find a sufficient statistic for α by verifying that this distribution belongs to exponential family. Solution: The density function f(x α) = Γ(2α) Γ(α) 2 [x(1 x)]α 1 = exp {log Γ(2α) 2 log Γ(α) + α log[x(1 x)] log[x(1 x)]} Comparing to the form of exponential family, T (x) = log[x(1 x)]; c(α) = α; S(x) = log[x(1 x)]; d(α) = log Γ(2α) 2 log Γ(α) Therefore, f(x α) belongs to exponential family. Then the sufficient statistic is [ n n n ] T (X i ) = log[x i (1 X i )] = log X i (1 X i ) In example 6, we got the sufficient statistic was n X i (1 X i ), which is different from the result here. But both of them are sufficient statistics because of the functional relationship between them. A k-parameter member of the exponential family has a density or frequency function of the form [ k ] f(x θ) = exp c i (θ)t i (x) + d(θ) + S(x) For example, the normal distribution, gamma distribution (Example below), beta distribution are of this form. Example 10: Show that the gamma distribution belongs to the exponential family.

10 10 Proof: Gamma distribution has density function which can be written as f(x α, β) = βα Γ(α) xα 1 e βx, 0 x < f(x α, β) = exp { βx + (α 1) log x + α log β log Γ(α)} Comparing with the form of exponential family { n } exp c i (θ)t i (x) + d(θ) + S(x) We see that Gamma distribution has the form of exponential distribution with c 1 (α, β) = β, c 2 (α, β) = α 1, T 1 (x) = x, T 2 (x) = log x, d(α, β) = α log β log Γ(α), and S(x) = 0. Therefore, gamma distribution belongs to the exponential family. 5 Exercises Instructions for Exercises 1 to 4: In each of these exercises, assume that the random variables X 1,, X n form a random sample of size n form the distribution specified in that exercise, and show that the statistic T specified in the exercise is a sufficient statistic for the parameter: Exercise 1: A normal distribution for which the mean µ is known and the variance σ 2 is unknown; T = n (X i µ) 2. Exercise 2: A gamma distribution with parameters α and β, where the value of β is known and the value of α is unknown (α > 0); T = n X i. Exercise 3: A uniform distribution on the interval [a, b], where the value of a is known and the value of b is unknown (b > a); T = max(x 1,, X n ). Exercise 4: A uniform distribution on the interval [a, b], where the value of b is known and the value of a is unknown (b > a); T = min(x 1,, X n ). Exercise 5: Suppose that X 1,, X n form a random sample from a gamma distribution with parameters α > 0 and β > 0, and the value of β is known. Show that the statistic T = n log X i is a sufficient statistic for the parameter α. Exercise 6: The Pareto distribution has density function: f(x x 0, θ) = θx θ 0x θ 1, x x 0, θ > 1 Assume that x 0 > 0 is given and that X 1, X 2,, X n is an i.i.d. sample. Find a sufficient statistic for θ by (a) using factorization theorem, (b) using the property of exponential family. Are they the same? If not, why are both of them sufficient?

11 11 Exercise 7: Verify that following are members of exponential family: a) Geometric distribution p(x) = p x 1 (1 p); λ λx b) Poisson distribution p(x) = e ; x! c) Normal distribution N(µ, σ 2 ); d) Beta distribution.

### Fisher Information and Cramér-Rao Bound. 1 Fisher Information. Math 541: Statistical Theory II. Instructor: Songfeng Zheng

Math 54: Statistical Theory II Fisher Information Cramér-Rao Bound Instructor: Songfeng Zheng In the parameter estimation problems, we obtain information about the parameter from a sample of data coming

### Maximum Likelihood Estimation

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for

### 1 A simple example. A short introduction to Bayesian statistics, part I Math 218, Mathematical Statistics D Joyce, Spring 2016

and P (B X). In order to do find those conditional probabilities, we ll use Bayes formula. We can easily compute the reverse probabilities A short introduction to Bayesian statistics, part I Math 18, Mathematical

### 1 Prior Probability and Posterior Probability

Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

### Common probability distributionsi Math 217/218 Probability and Statistics Prof. D. Joyce, 2016

Introduction. ommon probability distributionsi Math 7/8 Probability and Statistics Prof. D. Joyce, 06 I summarize here some of the more common distributions used in probability and statistics. Some are

### 0.1 Solutions to CAS Exam ST, Fall 2015

0.1 Solutions to CAS Exam ST, Fall 2015 The questions can be found at http://www.casact.org/admissions/studytools/examst/f15-st.pdf. 1. The variance equals the parameter λ of the Poisson distribution for

### Random Variables of The Discrete Type

Random Variables of The Discrete Type Definition: Given a random experiment with an outcome space S, a function X that assigns to each element s in S one and only one real number x X(s) is called a random

### 3. Continuous Random Variables

3. Continuous Random Variables A continuous random variable is one which can take any value in an interval (or union of intervals) The values that can be taken by such a variable cannot be listed. Such

### 1 Sufficient statistics

1 Sufficient statistics A statistic is a function T = rx 1, X 2,, X n of the random sample X 1, X 2,, X n. Examples are X n = 1 n s 2 = = X i, 1 n 1 the sample mean X i X n 2, the sample variance T 1 =

### Stat260: Bayesian Modeling and Inference Lecture Date: February 1, Lecture 3

Stat26: Bayesian Modeling and Inference Lecture Date: February 1, 21 Lecture 3 Lecturer: Michael I. Jordan Scribe: Joshua G. Schraiber 1 Decision theory Recall that decision theory provides a quantification

### Lecture 10: Other Continuous Distributions and Probability Plots

Lecture 10: Other Continuous Distributions and Probability Plots Devore: Section 4.4-4.6 Page 1 Gamma Distribution Gamma function is a natural extension of the factorial For any α > 0, Γ(α) = 0 x α 1 e

### Topic 2: Scalar random variables. Definition of random variables

Topic 2: Scalar random variables Discrete and continuous random variables Probability distribution and densities (cdf, pmf, pdf) Important random variables Expectation, mean, variance, moments Markov and

### Generalized Linear Model Theory

Appendix B Generalized Linear Model Theory We describe the generalized linear model as formulated by Nelder and Wedderburn (1972), and discuss estimation of the parameters and tests of hypotheses. B.1

### Principle of Data Reduction

Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

### Continuous random variables

Continuous random variables So far we have been concentrating on discrete random variables, whose distributions are not continuous. Now we deal with the so-called continuous random variables. A random

### Lecture 13: Some Important Continuous Probability Distributions (Part 2)

Lecture 13: Some Important Continuous Probability Distributions (Part 2) Kateřina Staňková Statistics (MAT1003) May 14, 2012 Outline 1 Erlang distribution Formulation Application of Erlang distribution

### MS&E 226: Small Data

MS&E 226: Small Data Lecture 16: Bayesian inference (v3) Ramesh Johari ramesh.johari@stanford.edu 1 / 35 Priors 2 / 35 Frequentist vs. Bayesian inference Frequentists treat the parameters as fixed (deterministic).

### POL 571: Expectation and Functions of Random Variables

POL 571: Expectation and Functions of Random Variables Kosuke Imai Department of Politics, Princeton University March 10, 2006 1 Expectation and Independence To gain further insights about the behavior

### Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 10, 2014 1 Principle of maximum likelihood Consider a family of probability distributions

### Lecture 4: Random Variables

Lecture 4: Random Variables 1. Definition of Random variables 1.1 Measurable functions and random variables 1.2 Reduction of the measurability condition 1.3 Transformation of random variables 1.4 σ-algebra

### Review. Lecture 3: Probability Distributions. Poisson Distribution. May 8, 2012 GENOME 560, Spring Su In Lee, CSE & GS

Lecture 3: Probability Distributions May 8, 202 GENOME 560, Spring 202 Su In Lee, CSE & GS suinlee@uw.edu Review Random variables Discrete: Probability mass function (pmf) Continuous: Probability density

### Applications of the Central Limit Theorem

Applications of the Central Limit Theorem October 23, 2008 Take home message. I expect you to know all the material in this note. We will get to the Maximum Liklihood Estimate material very soon! 1 Introduction

### Probabilities for machine learning

Probabilities for machine learning Roland Memisevic February 2, 2006 Why probabilities? One of the hardest problems when building complex intelligent systems is brittleness. How can we keep tiny irregularities

### Solution Using the geometric series a/(1 r) = x=1. x=1. Problem For each of the following distributions, compute

Math 472 Homework Assignment 1 Problem 1.9.2. Let p(x) 1/2 x, x 1, 2, 3,..., zero elsewhere, be the pmf of the random variable X. Find the mgf, the mean, and the variance of X. Solution 1.9.2. Using the

### Chapter 4 Expected Values

Chapter 4 Expected Values 4. The Expected Value of a Random Variables Definition. Let X be a random variable having a pdf f(x). Also, suppose the the following conditions are satisfied: x f(x) converges

### Lecture 2: Statistical Estimation and Testing

Bioinformatics: In-depth PROBABILITY AND STATISTICS Spring Semester 2012 Lecture 2: Statistical Estimation and Testing Stefanie Muff (stefanie.muff@ifspm.uzh.ch) 1 Problems in statistics 2 The three main

### Master s Theory Exam Spring 2006

Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

### Renewal Theory. (iv) For s < t, N(t) N(s) equals the number of events in (s, t].

Renewal Theory Def. A stochastic process {N(t), t 0} is said to be a counting process if N(t) represents the total number of events that have occurred up to time t. X 1, X 2,... times between the events

### MATH 56A SPRING 2008 STOCHASTIC PROCESSES 31

MATH 56A SPRING 2008 STOCHASTIC PROCESSES 3.3. Invariant probability distribution. Definition.4. A probability distribution is a function π : S [0, ] from the set of states S to the closed unit interval

### What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

0. 1. Introduction and probability review 1.1. What is Statistics? What is Statistics? Lecture 1. Introduction and probability review There are many definitions: I will use A set of principle and procedures

### Test Code MS (Short answer type) 2004 Syllabus for Mathematics. Syllabus for Statistics

Test Code MS (Short answer type) 24 Syllabus for Mathematics Permutations and Combinations. Binomials and multinomial theorem. Theory of equations. Inequalities. Determinants, matrices, solution of linear

### Review Solutions, Exam 3, Math 338

Review Solutions, Eam 3, Math 338. Define a random sample: A random sample is a collection of random variables, typically indeed as X,..., X n.. Why is the t distribution used in association with sampling?

### Lecture Notes 1. Brief Review of Basic Probability

Probability Review Lecture Notes Brief Review of Basic Probability I assume you know basic probability. Chapters -3 are a review. I will assume you have read and understood Chapters -3. Here is a very

### L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

### CS340 Machine learning Naïve Bayes classifiers

CS340 Machine learning Naïve Bayes classifiers Document classification Let Y {1,,C} be the class label and x {0,1} d eg Y {spam, urgent, normal}, x i = I(word i is present in message) Bag of words model

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Definition The covariance of X and Y, denoted by cov(x, Y ) is defined by. cov(x, Y ) = E(X µ 1 )(Y µ 2 ).

Correlation Regression Bivariate Normal Suppose that X and Y are r.v. s with joint density f(x y) and suppose that the means of X and Y are respectively µ 1 µ 2 and the variances are 1 2. Definition The

### Gamma distribution. Let us take two parameters α > 0 and β > 0. Gamma function Γ(α) is defined by. 1 = x α 1 e x dx = y α 1 e βy dy 0 0

Lecture 6 Gamma distribution, χ -distribution, Student t-distribution, Fisher F -distribution. Gamma distribution. Let us tae two parameters α > and >. Gamma function is defined by If we divide both sides

### Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization Function to minimize w.r.t. b 0, b 1 Q = n (Y i (b 0 + b 1 X i )) 2 i=1 Minimize this by maximizing

### Some notes on the Poisson distribution

Some notes on the Poisson distribution Ernie Croot October 2, 2008 1 Introduction The Poisson distribution is one of the most important that we will encounter in this course it is right up there with the

### 1 Maximum likelihood estimators

Maximum likelihood estimators maxlik.tex and maxlik.pdf, March, 2003 Simplyput,ifweknowtheformoff X (x; θ) and have a sample from f X (x; θ), not necessarily random, the ml estimator of θ, θ ml,isthatθ

### Summary of Probability

Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible

### Lecture 2: October 24

Machine Learning: Foundations Fall Semester, 2 Lecture 2: October 24 Lecturer: Yishay Mansour Scribe: Shahar Yifrah, Roi Meron 2. Bayesian Inference - Overview This lecture is going to describe the basic

### Lecture 6: Discrete Distributions

Lecture 6: Discrete Distributions 4F3: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f3/ Quiñonero-Candela

### Examination 110 Probability and Statistics Examination

Examination 0 Probability and Statistics Examination Sample Examination Questions The Probability and Statistics Examination consists of 5 multiple-choice test questions. The test is a three-hour examination

### Bivariate Unit Normal. Lecture 19: Joint Distributions. Bivariate Unit Normal - Normalizing Constant. Bivariate Unit Normal, cont.

Lecture 19: Joint Distributions Statistics 14 If X and Y have independent unit normal distributions then their joint distribution f (x, y) is given by f (x, y) φ(x)φ(y) c e 1 (x +y ) Colin Rundel April,

### Bayesian probability: P. State of the World: X. P(X your information I)

Bayesian probability: P State of the World: X P(X your information I) 1 First example: bag of balls Every probability is conditional to your background knowledge I : P(A I) What is the (your) probability

### Homework n o 7 Math 505a

Homework n o 7 Math 505a Two players (player A and player B) play a board game. The rule is the following: both player start at position 0. Then, at every round, a dice is thrown: If the result is different

### Lecture 10: Characteristic Functions

Lecture 0: Characteristic Functions. Definition of characteristic functions. Complex random variables.2 Definition and basic properties of characteristic functions.3 Examples.4 Inversion formulas 2. Applications

### Discrete Binary Distributions

Discrete Binary Distributions Carl Edward Rasmussen November th, 26 Carl Edward Rasmussen Discrete Binary Distributions November th, 26 / 4 Key concepts Bernoulli: probabilities over binary variables Binomial:

### An introduction to the Beta-Binomial model

An introduction to the Beta-Binomial model COMPSCI 316: Computational Cognitive Science Dan Navarro & Amy Perfors University of Adelaide Abstract This note provides a detailed discussion of the beta-binomial

### The Delta Method and Applications

Chapter 5 The Delta Method and Applications 5.1 Linear approximations of functions In the simplest form of the central limit theorem, Theorem 4.18, we consider a sequence X 1, X,... of independent and

### m (t) = e nt m Y ( t) = e nt (pe t + q) n = (pe t e t + qe t ) n = (qe t + p) n

1. For a discrete random variable Y, prove that E[aY + b] = ae[y] + b and V(aY + b) = a 2 V(Y). Solution: E[aY + b] = E[aY] + E[b] = ae[y] + b where each step follows from a theorem on expected value from

### Expectation Discrete RV - weighted average Continuous RV - use integral to take the weighted average

PHP 2510 Expectation, variance, covariance, correlation Expectation Discrete RV - weighted average Continuous RV - use integral to take the weighted average Variance Variance is the average of (X µ) 2

### 18 Generalised Linear Models

18 Generalised Linear Models Generalised linear models (GLM) is a generalisation of ordinary least squares regression. See also Davison, Section 10.1-10.4, Green (1984) and Dobson and Barnett (2008). To

### Answers to some even exercises

Answers to some even eercises Problem - P (X = ) = P (white ball chosen) = /8 and P (X = ) = P (red ball chosen) = 7/8 E(X) = (P (X = ) + P (X = ) = /8 + 7/8 = /8 = /9 E(X ) = ( ) (P (X = ) + P (X = )

### MATH4427 Notebook 3 Spring MATH4427 Notebook Hypotheses: Definitions and Examples... 3

MATH4427 Notebook 3 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 2009-2016 by Jenny A. Baglivo. All Rights Reserved. Contents 3 MATH4427 Notebook 3 3 3.1 Hypotheses: Definitions and Examples............................

### Contents. TTM4155: Teletraffic Theory (Teletrafikkteori) Probability Theory Basics. Yuming Jiang. Basic Concepts Random Variables

TTM4155: Teletraffic Theory (Teletrafikkteori) Probability Theory Basics Yuming Jiang 1 Some figures taken from the web. Contents Basic Concepts Random Variables Discrete Random Variables Continuous Random

### Continuous Random Variables

Continuous Random Variables COMP 245 STATISTICS Dr N A Heard Contents 1 Continuous Random Variables 2 11 Introduction 2 12 Probability Density Functions 3 13 Transformations 5 2 Mean, Variance and Quantiles

### Continuous Distributions

MAT 2379 3X (Summer 2012) Continuous Distributions Up to now we have been working with discrete random variables whose R X is finite or countable. However we will have to allow for variables that can take

### Normal distribution Approximating binomial distribution by normal 2.10 Central Limit Theorem

1.1.2 Normal distribution 1.1.3 Approimating binomial distribution by normal 2.1 Central Limit Theorem Prof. Tesler Math 283 October 22, 214 Prof. Tesler 1.1.2-3, 2.1 Normal distribution Math 283 / October

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Continuous Random Variables

Probability 2 - Notes 7 Continuous Random Variables Definition. A random variable X is said to be a continuous random variable if there is a function f X (x) (the probability density function or p.d.f.)

### Review of Likelihood Theory

Appendix A Review of Likelihood Theory This is a brief summary of some of the key results we need from likelihood theory. A.1 Maximum Likelihood Estimation Let Y 1,..., Y n be n independent random variables

### 2WB05 Simulation Lecture 8: Generating random variables

2WB05 Simulation Lecture 8: Generating random variables Marko Boon http://www.win.tue.nl/courses/2wb05 January 7, 2013 Outline 2/36 1. How do we generate random variables? 2. Fitting distributions Generating

### The distance of a curve to its control polygon J. M. Carnicer, M. S. Floater and J. M. Peña

The distance of a curve to its control polygon J. M. Carnicer, M. S. Floater and J. M. Peña Abstract. Recently, Nairn, Peters, and Lutterkort bounded the distance between a Bézier curve and its control

### Chapter 3: Discrete Random Variable and Probability Distribution. January 28, 2014

STAT511 Spring 2014 Lecture Notes 1 Chapter 3: Discrete Random Variable and Probability Distribution January 28, 2014 3 Discrete Random Variables Chapter Overview Random Variable (r.v. Definition Discrete

### Chapter 1. Probability, Random Variables and Expectations. 1.1 Axiomatic Probability

Chapter 1 Probability, Random Variables and Expectations Note: The primary reference for these notes is Mittelhammer (1999. Other treatments of probability theory include Gallant (1997, Casella & Berger

### Continuous Random Variables and Probability Distributions. Stat 4570/5570 Material from Devore s book (Ed 8) Chapter 4 - and Cengage

4 Continuous Random Variables and Probability Distributions Stat 4570/5570 Material from Devore s book (Ed 8) Chapter 4 - and Cengage Continuous r.v. A random variable X is continuous if possible values

### Sampling Distribution of a Normal Variable

Ismor Fischer, 5/9/01 5.-1 5. Formal Statement and Examples Comments: Sampling Distribution of a Normal Variable Given a random variable. Suppose that the population distribution of is known to be normal,

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### MATH 201. Final ANSWERS August 12, 2016

MATH 01 Final ANSWERS August 1, 016 Part A 1. 17 points) A bag contains three different types of dice: four 6-sided dice, five 8-sided dice, and six 0-sided dice. A die is drawn from the bag and then rolled.

### CHAPTER III - MARKOV CHAINS

CHAPTER III - MARKOV CHAINS JOSEPH G. CONLON 1. General Theory of Markov Chains We have already discussed the standard random walk on the integers Z. A Markov Chain can be viewed as a generalization of

### Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization 1. Function to minimize w.r.t. β 0, β 1 Q = n (Y i (β 0 + β 1 X i )) 2 i=1 2. Minimize this by

### Applied Bayesian Inference

Applied Bayesian Inference Prof. Dr. Renate Meyer 1,2 1 Institute for Stochastics, Karlsruhe Institute of Technology, Germany 2 Department of Statistics, University of Auckland, New Zealand KIT, Winter

### MIDTERM EXAM - MATH 563, SPRING 2016

MIDTERM EXAM - MATH 563, SPRING 2016 NAME: SOLUTION Exam rules: There are 5 problems o this exam. You must show all work to receive credit, state ay theorems ad defiitios clearly. The istructor will NOT

### The Effects of a Square Root Transform on a Poisson Distributed Quantity.

Tina Memo No. 2001-010 Internal Report The Effects of a Square Root Transform on a Poisson Distributed Quantity. N.A. Thacker and P.A. Bromiley Last updated 25 / 01 / 2009 Imaging Science and Biomedical

### FALL 2005 EXAM C SOLUTIONS

FALL 005 EXAM C SOLUTIONS Question #1 Key: D S ˆ(300) = 3/10 (there are three observations greater than 300) H ˆ (300) = ln[ S ˆ (300)] = ln(0.3) = 1.0. Question # EX ( λ) = VarX ( λ) = λ µ = v = E( λ)

### Satistical modelling of clinical trials

Sept. 21st, 2012 1/13 Patients recruitment in a multicentric clinical trial : Recruit N f patients via C centres. Typically, N f 300 and C 50. Trial can be very long (several years) and very expensive

### Senior Secondary Australian Curriculum

Senior Secondary Australian Curriculum Mathematical Methods Glossary Unit 1 Functions and graphs Asymptote A line is an asymptote to a curve if the distance between the line and the curve approaches zero

### Baby Bayes using R. Rebecca C. Steorts

Baby Bayes using R Rebecca C. Steorts Last L A TEX d Tuesday 12 th January, 2016 Copyright c 2016 Rebecca C. Steorts; do not distribution without author s permission Contents 1 Motivations and Introduction

### Statistiek (WISB361)

Statistiek (WISB361) Final exam June 29, 2015 Schrijf uw naam op elk in te leveren vel. Schrijf ook uw studentnummer op blad 1. The maximum number of points is 100. Points distribution: 23 20 20 20 17

### The t Test. April 3-8, exp (2πσ 2 ) 2σ 2. µ ln L(µ, σ2 x) = 1 σ 2. i=1. 2σ (σ 2 ) 2. ˆσ 2 0 = 1 n. (x i µ 0 ) 2 = i=1

The t Test April 3-8, 008 Again, we begin with independent normal observations X, X n with unknown mean µ and unknown variance σ The likelihood function Lµ, σ x) exp πσ ) n/ σ x i µ) Thus, ˆµ x ln Lµ,

### Math Chapter Seven Sample Exam

Math 333 - Chapter Seven Sample Exam 1. The cereal boxes coming off an assembly line are each supposed to contain 12 ounces. It is reasonable to assume that the amount of cereal per box has an approximate

### P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )

Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

### Probability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0

Probability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0 This primer provides an overview of basic concepts and definitions in probability and statistics. We shall

### The Poisson Distribution

The Poisson Distribution Gary Schurman MBE, CFA June 2012 The Poisson distribution is a discrete probability distribution where a Poisson-distributed random variable can take the integer values [0, 1,

### Notes on Continuous Random Variables

Notes on Continuous Random Variables Continuous random variables are random quantities that are measured on a continuous scale. They can usually take on any value over some interval, which distinguishes

### Exact Confidence Intervals

Math 541: Statistical Theory II Instructor: Songfeng Zheng Exact Confidence Intervals Confidence intervals provide an alternative to using an estimator ˆθ when we wish to estimate an unknown parameter

### Chapter 13. Vehicle Arrival Models : Count Introduction Poisson Distribution

Chapter 13 Vehicle Arrival Models : Count 13.1 Introduction As already noted in the previous chapter that vehicle arrivals can be modelled in two interrelated ways; namely modelling how many vehicle arrive

### Joint Probability Distributions and Random Samples (Devore Chapter Five)

Joint Probability Distributions and Random Samples (Devore Chapter Five) 1016-345-01 Probability and Statistics for Engineers Winter 2010-2011 Contents 1 Joint Probability Distributions 1 1.1 Two Discrete

### Sample Size Determination for Clinical Trials

Sample Size Determination for Clinical Trials Paivand Jalalian Advisor: Professor Kelly McConville May 17, 2014 Abstract An important component of clinical trials is determining the smallest sample size

### Lecture 8: Signal Detection and Noise Assumption

ECE 83 Fall Statistical Signal Processing instructor: R. Nowak, scribe: Feng Ju Lecture 8: Signal Detection and Noise Assumption Signal Detection : X = W H : X = S + W where W N(, σ I n n and S = [s, s,...,

### Math 461 Fall 2006 Test 2 Solutions

Math 461 Fall 2006 Test 2 Solutions Total points: 100. Do all questions. Explain all answers. No notes, books, or electronic devices. 1. [105+5 points] Assume X Exponential(λ). Justify the following two

### Introduction to Hypothesis Testing. Point estimation and confidence intervals are useful statistical inference procedures.

Introduction to Hypothesis Testing Point estimation and confidence intervals are useful statistical inference procedures. Another type of inference is used frequently used concerns tests of hypotheses.

### ABC methods for model choice in Gibbs random fields

ABC methods for model choice in Gibbs random fields Jean-Michel Marin Institut de Mathématiques et Modélisation Université Montpellier 2 Joint work with Aude Grelaud, Christian Robert, François Rodolphe,

sheng@mail.ncyu.edu.tw 1 Content Introduction Expectation and variance of continuous random variables Normal random variables Exponential random variables Other continuous distributions The distribution

### EE 322: Probabilistic Methods for Electrical Engineers. Zhengdao Wang Department of Electrical and Computer Engineering Iowa State University

EE 322: Probabilistic Methods for Electrical Engineers Zhengdao Wang Department of Electrical and Computer Engineering Iowa State University Discrete Random Variables 1 Introduction to Random Variables