8 Laws of large numbers

Similar documents
THE CENTRAL LIMIT THEOREM TORONTO

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Math 431 An Introduction to Probability. Final Exam Solutions

Random variables, probability distributions, binomial random variable

Definition: Suppose that two random variables, either continuous or discrete, X and Y have joint density

Notes on Continuous Random Variables

Lecture 6: Discrete & Continuous Probability and Random Variables

5. Continuous Random Variables

Lecture 13: Martingales

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

Wald s Identity. by Jeffery Hein. Dartmouth College, Math 100

STAT 830 Convergence in Distribution

Random variables P(X = 3) = P(X = 3) = 1 8, P(X = 1) = P(X = 1) = 3 8.

Transformations and Expectations of random variables

Section 5.1 Continuous Random Variables: Introduction

6.041/6.431 Spring 2008 Quiz 2 Wednesday, April 16, 7:30-9:30 PM. SOLUTIONS

Introduction to Probability

E3: PROBABILITY AND STATISTICS lecture notes

STA 256: Statistics and Probability I

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Lecture 7: Continuous Random Variables

Probability Generating Functions

Joint Exam 1/P Sample Exam 1

Stat 704 Data Analysis I Probability Review

Lecture Notes 1. Brief Review of Basic Probability

x a x 2 (1 + x 2 ) n.

3. INNER PRODUCT SPACES

Gambling Systems and Multiplication-Invariant Measures

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 14 10/27/2008 MOMENT GENERATING FUNCTIONS

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

I. Pointwise convergence

1 if 1 x 0 1 if 0 x 1

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

SECTION 10-2 Mathematical Induction

ST 371 (IV): Discrete Random Variables

4 Sums of Random Variables

Random Variables. Chapter 2. Random Variables 1

An Introduction to Basic Statistics and Probability

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem

Maximum Likelihood Estimation

MAS108 Probability I

Probability density function : An arbitrary continuous random variable X is similarly described by its probability density function f x = f X

The Exponential Distribution

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 18. A Brief Introduction to Continuous Probability

Statistics 100A Homework 8 Solutions

M2S1 Lecture Notes. G. A. Young ayoung

Metric Spaces. Chapter Metrics

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Section 6.1 Joint Distribution Functions

The Normal Distribution. Alan T. Arnholt Department of Mathematical Sciences Appalachian State University

UNIT I: RANDOM VARIABLES PART- A -TWO MARKS

Sums of Independent Random Variables

5.3 Improper Integrals Involving Rational and Exponential Functions

Math 151. Rumbos Spring Solutions to Assignment #22

Exponential Distribution

Chapter 4 Lecture Notes

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

Lecture 8. Confidence intervals and the central limit theorem

IEOR 6711: Stochastic Models, I Fall 2012, Professor Whitt, Final Exam SOLUTIONS

Math 4310 Handout - Quotient Vector Spaces

Notes on metric spaces

LEARNING OBJECTIVES FOR THIS CHAPTER

4. Continuous Random Variables, the Pareto and Normal Distributions

Continued Fractions and the Euclidean Algorithm

Normal distribution. ) 2 /2σ. 2π σ

Polynomial Invariants

Undergraduate Notes in Mathematics. Arkansas Tech University Department of Mathematics

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

The Binomial Distribution

Math/Stats 425 Introduction to Probability. 1. Uncertainty and the axioms of probability

Master s Theory Exam Spring 2006

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Hypothesis Testing for Beginners

Martingale Ideas in Elementary Probability. Lecture course Higher Mathematics College, Independent University of Moscow Spring 1996

Aggregate Loss Models

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Lecture 3: Continuous distributions, expected value & mean, variance, the normal distribution

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

Basic Proof Techniques

3.4. The Binomial Probability Distribution. Copyright Cengage Learning. All rights reserved.

THE BANACH CONTRACTION PRINCIPLE. Contents

Stats on the TI 83 and TI 84 Calculator

Mathematical Induction

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

Mathematical Methods of Engineering Analysis

Lies My Calculator and Computer Told Me

Lectures 5-6: Taylor Series

Stat 5102 Notes: Nonparametric Tests and. confidence interval

Experimental Uncertainty and Probability

ECE302 Spring 2006 HW3 Solutions February 2,

NOV /II. 1. Let f(z) = sin z, z C. Then f(z) : 3. Let the sequence {a n } be given. (A) is bounded in the complex plane

Probability Theory. Florian Herzog. A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T..

Practice with Proofs

1 Sufficient statistics

Probabilistic Strategies: Solutions

CONTINUED FRACTIONS AND FACTORING. Niels Lauritzen

e.g. arrival of a customer to a service station or breakdown of a component in some system.

6.3 Conditional Probability and Independence

Transcription:

8 Laws of large numbers 8.1 Introduction The following comes up very often, especially in statistics. We have an experiment and a random variable X associated with it. We repeat the experiment n times and let X 1,X 2,,X n be the values of the random variable we get. We can think of the n-fold repetition of the original experiment as a sort of super-experiment and X 1,X 2,,X n as random variables on for it. We assume that the experiment does not change as we repeat it, so that the X j are indentically distributed. (Recall that this means that X i and X j have the same pmf or pdf.) And we assume that the different perfomances of the experiment do not influence each other. So the random variables X 1,,X n are independent. In statstics one typically does not know the pmf or the pdf of the X j. The statistician s job is to take the random sample X 1,,X n and make conclusions about the distribution of X. For example, one would like to know the mean, E[X], of X. The simplest way to estimate it is to look at the sample mean which is defined to be X n = 1 n Note that X n is itself a random variable. Intuitively we expect that as n, X n will converge to E[X]. What exactly do we mean by convergence of a sequence of random variables? And what can we say about the rate of convergence and the error? These questions are the focus of this chapter. We already know quite a bit about X n. Its mean is µ = E[X]. And its variance is σ 2 /n where σ 2 is the common variance of the X j. The first theorems in this chapter will say that as n, X n converges to the constant µ in some sense. Results like this are called a law of large numbers. We will see two of them corresponding to two different notions of convergence. The other big theorem of this chapter is the central limit theorem. Suppose we shift and rescale X n by i X n µ σ/ n Note that the above random variable has mean zero and variance one. The 1 X i

CLT says that it converges to a standard normal under some very mild assumptions on the distribution of X. 8.2 Weak law of large numbers If we roll a fair six-sided die, the mean of the number we get is 3.5. If we roll the die a large number of times and average the numbers we get (i.e., compute X n ), then we do not expect to get exactly 3.5, but rather something close. So we could ask if X n 3.5 < 0.01. This is an event (for the super-experiment), so we can consider its probability: X n 3.5 < 0.01 In particular we might expect this probability to go to zero as n. This motivates the following definition. Definition 1. Let Y n be a sequence of random variables, and Y a random variable, all defined on the same probability space. We say Y n converges to Y in probability if for every ǫ > 0, lim P( Y n Y > ǫ) = 0 Theorem 1. (Weak law of large numbers) Let X j be an i.i.d. sequence with finite mean and variance. Let µ = E[X j ]. Then X n = 1 n X j µ inprobability There are better versions of the theorem in the sense that they have weaker hypotheses (you don t need to assume the variance is finite). There is also a stronger theorem that has a stronger form of convergence (strong law of large numbers). We will eventually prove the theorem, but first we introduce another notion of convergence. Definition 2. Let Y n be a sequence of random variables with finite variance and Y a random variable with finite variance. We say that Y n converges to Y in mean square if lim E[(Y n Y ) 2 ] = 0 2

In analysis this is often called convergence in L 2. Proposition 1. Let X n be a sequence of i.i.d. random variables with finite variance. Let µ = E[X n ]. Then X n coverges to µ in mean square. Proof. We have to show lim E[(X n µ) 2 ] = 0 But since the mean of X n is µ, E[(X n µ) 2 ] is the variance of X n. We know that this variance is σ 2 /n which obviously goes to zero as n. Next we show that convergence in mean square implies convergence in probability. The tool to show this is the following inequality: Proposition 2. (Chebyshev s inequality) P( X a) E[X2 ] a 2 Proof. To make things concrete we assume we have a continuous RV. Then letting f(x) be the pdf of X, E[X 2 ] = x 2 f(x)dx Since the integrand is non-negative, E[X 2 ] x 2 f(x)dx a 2 f(x)dx x a x a = a 2 f(x)dx = a 2 P( X a) x a Thus we have the inequality in the proposition. Proposition 3. Let Y n is a sequence of random variables with finite variance and Y is a random variable with finite variance. Suppose Y n converges to Y in mean square. Then it converges in probability. 3

Proof. Let ǫ > 0. We must show By Chebyshev s inequality, lim P( Y n Y > ǫ) = 0 P( Y n Y > ǫ) E[(Y n Y ) 2 ] ǫ 2 By hypothesis E[(Y n Y ) 2 ] 0 as n. So for a fixed ǫ, E[(Y n Y ) 2 ]/ǫ 2 0 as n. 8.3 Central limit theorem Let X n be an i.i.d. sequence with finite variance. Let µ the their common mean and σ 2 their common variance. As before we let X n = 1 n n X j. Define Z n = X n µ σ/ n = n X j nµ nσ Note that E[Z n ] = 0,var(Z n ) = 1. The best way to remember the definition is that it is X n shifted and scaled so that it has mean 0 and variance 1. The central limit theorem says that the distribution of Z n converges to a standard normal. There are several senses in which it might converge, so we have to make this statement more precise. We might ask if the density function of 1 Z n converges to that of a standard normal, ie., 2π exp( z 2 /2). We do not assume that X n is a continuous RV. If it is a discrete RV, then so is Z n. So it does not even have a density function. Let Φ(x) be the cdf of a standard normal: Φ(x) = x 1 2π e z2 /2 dz Theorem 2. (Central limit theorem) Let X n be an i.i.d. sequence of random variables with finite mean µ and variance σ 2. Define Z n as above. Then for all x R lim P(Z n x) = Φ(x) 4

In words the theorem says that the cdf of Z n converges pointwise to the cdf of the standard normal. This is an example of what is called convergence in distribution in probability. However, we caution the reader that the general definitionof convergence in distbribution involves some technicalities. This is the most important theorem in the course and plays a major role in statistics. In particular much of the theory of confidence intervals and hypothesis testing is the central limit theorem in disguise. Example: A computer has a random number generator that generates random numbers uniformly distributed in [0, 1]. We run it 100 times and let S n be the sum of the 100 numbers. Estimate P(S n 52). Let X j be the numbers. So E[X j ] = 1/2 and var(x j ) = 1/12. So So S n 52 is the same as Z n = S n nµ nσ = S n 100 1/2 10 1/12 Z n 52 100 1/2 10 1/12 = 2 12 10 = 0.6928 So P(S n 52) = P(Z n 0.6928). Now we approximate P(Z n 0.6928) by assuming Z n has a standard normal distribution. Then we can look up P(Z n 0.6928) in R. We use P(Z n 0.6928) = 1 P(Z n 0.6928). In R we get P(Z n 0.6928) from pnorm(0.6928). (This is the same as pnorm(0.6928, 0, 1). If you don t specify a mean and variance when you call pnorm() it assumes a standard normal.) So we find P(S n 52) = P(Z n 0.6928) = 1 P(Z n 0.6928) 1 0.7548 = 0.2442 Example: We flip a fair coin n times. How large must n be to have P( (fractionof H) 1/2 0.01) 0.05. According to R, qnorm(0.975) = 1.959964. This means that for a standard normal Z P(Z 1.959964) = 0.975. By symmetry, P( Z 1.959964) = 0.95. So P( Z 1.959964) = 0.05. Let X j be 1 if there is H on jth flip, 0 if it is T. Let X n be the usual. Then X n is the fraction of heads. For a single flip the variance is p(1 p) = 1/4. So σ = 1/2. So Z n = X n 1/2 σ n = X n 1/2 n/2 5

So X n 1/2 0.01 is the same as Z n 0.01n σ n, i.e., Z n 0.02 n. So P( (fractionof H) 1/2 0.01) 0.05 = P( Z n 0.02 n) So it this is to be less than 0.05 we need 0.02 n 2. So n 10, 000. End of Nov 9 lecture Confidence intervals: The following is an important problem in statistics. We have a random variable X (usually called the population). We know its variance σ 2, but we don t know its mean µ. We have a random sample, i.e., random variables X 1,X 2,,X n which are independent random varibles which all have the sample distribution as X. We want to use our one sample X 1,,X n to estimate µ. The natural estimate for µ is the sample mean X n = 1 n X j How close is X n to the true value of µ? This is a vague question. We make it precise as follows. For what ǫ > 0 will P( X n µ ǫ) = 0.95? We say the [X n ǫ,x n + ǫ] is a 95% confidence interval for µ. (The choice of 95% is somewhat arbitrary. We can use 98% for example. If n is large we can use the CLT to figure out what ǫ should be. As before we let Z n = X n µ σ/ n So X n µ ǫ is equivalent to Z n ǫ n/σ So we want P( Z n ǫ n/σ) = 0.95 The CLT says that the distribution for Z n is approximately that of a standard normal. If Z is a standard normal, then P( Z 1.96) = 0.95. So ǫ n/σ = 1.96. So we have found that the 95% confidence interval for µ is [µ ǫ,µ+ǫ] where ǫ = 1.96 σ/ n Example: Lightbulbs 6

The 1/2 game : Suppose X 1,X 2, are i.i.d. and integer valued. For example, we flip a coin repeatedly and let X n be 1 if we get heads on the nth flip and 0 if we get tails. We want to estimate P(a S n b), where S n = n X j. We take a and b to be integers. Then P(a δ S n b + δ) = P(a S n b) for any positive δ < 1. But the CLT will give us different answers depending on which δ we use. What is the best? Answer δ = 1/2. So we approximate P(a S n b) by a 1/2 nµ P(a 1/2 S n b + 1/2) = P( σ S n nµ n σ b + 1/2 nµ n σ ) n a 1/2 nµ P( σ b + 1/2 nµ Z n σ ) n where Z is a standard normal. End of Nov 18 lecture Example: Let X n be an i.i.d. sequence of standard normals. Let Y n = 1 n X 2 j Use the CLT to find n so that P(Y n 1.01) = 5%. Note that the the i.i.d. sequence that we apply the CLT to is Xn 2 not X n. The mean of Xn 2 is E[Xn] 2 = 1. The variance is E[Xn] (E[X 4 n]) 2 2 = 3 1 = 2. The mean of Y n is 1 n = 1. The variance is n So to standardize, var(y n ) = 1 n 2n2 Z n = Y n 1 2/n 7

So P(Y n 1.01) = P( Y n 1 2/n 1.01 1 2/n ) P(Z 1.01 1 2/n ) where Z is a standard normal. R says that qnorm(0.95) = 1.645. So P(Z 1.645) = 0.05. So 0.01 2/n = 1.645, n = (1.645) 2, 000 = 54, 121 8.4 Strong law of large numbers Suppose X n is an i.i.d. sequence. As before X n = 1 X j n The weak law involves probabilities that X n does certain things. We now ask if lim X n exists and what does it converge to. Each X n is a random variable, i.e., a function of ω, a point in the sample space. It is possible that this limit converges to µ for some ω but not for other ω. Example We flips a fair coin infinitely often. Let X n be 1 if the nth flip is heads, 0 if it is tails. Then X n is the fraction of heads in the first n flips. We can think of the sample space as sequences of H s and T s. So one possible ω is the sequence with all H s. For this ω, X n (ω) = 1 for all n. So X n (ω) = 1 for all n. So lim X n exists, but it equals 1, not the mean of X n which is 1/2. So it is certainly not true that lim X n = µ for all ω. Our counterexample where we get all heads is atypical. So we might hope that the set of ω for which the limit of the X n is not µ is small. Definition 3. Let Y n be a sequence of random variables and Y a random variable. We say X n converges to X with probability one if More succintly, P({ω : lim Y n (ω) = Y (ω)}) = 1 P(Y n Y ) = 1 8

This is a stronger form of convergence than convergence in probability. (This is not at all obvious.) Theorem 3. If Y n converges to Y with probability one, then Y n converges to Y in probability. Proof. Let ǫ > 0. We must show Define lim P( Y n Y ǫ) = 0 E = {ω : lim Y n (ω) = Y (ω)} Since Y n converges to Y with probability one, P(E) = 1. Let E n = k=n{ω : Y k (ω) Y (ω) < ǫ} As n gets larger, the intersection has fewer sets and so is larger. So E n is an increasing sequence. If ω E, then ω E n for large enough n. So E n=1e n Since P(E) = 1, this implies P( n=1e n ) = 1. By continuity of the probability measure, 1 = P( n=1e n ) = lim P(E n ) Note that E n { Y n Y < ǫ}. So P(E n ) P( Y n Y < ǫ). Since P(E n ) goes to 1 and probabilities are bounded by 1, P( Y n Y < ǫ) goes to 1. So P( Y n Y ǫ) goes to 0. 8.5 Proof of central limit theorem We give a partial proof of the central limit theorem. We will prove that the moment generating function of Z n converges to that of a standard normal. Let X n be an i.i.d. sequence. We assume that µ = 0. (Explain why we can do this.) Since the X n are identically distributed, they have the same mgf. Call it m(t). So m(t) = M Xn (t) = E[e txn ] 9

As before Z n = S n σ n, S n = X j Now M Zn (t) = E[exp(t S n σ n )] = E[exp( t σ n S t n)] = M Sn ( σ n ) Since the X j are independent, M Sn (t) = n M X j (t) = m(t) n. So above becomes [ ] n t M Zn (t) = m( σ n ) Now we do a Taylor expansion of m(t) about t = 0. We do the expansion to second order: We have So the Taylor expansion is m(t) = m(0) + m (0)t + 1 2 m (0)t 2 + O(t 3 ) m(0) = 1 m (0) = E[X j ] = 0 m (0) = E[X 2 j ] = var(x j ) = σ 2 m(t) = 1 + 1 2 σ2 t 2 + O(t 3 ) So [ ] n t m( σ n ) = = 12 t2 [1 + σ2 σ 2 n + O(t3 σ 3 n 3/2 ) ] n [1 + t2 2n + O(t3 σ 3 n 3/2 ) ] n We want to show that this converges to the mgf of the standard normal which is exp( 1 2 t2 ). So we need to show the ln of the above converges to 1 2 t2. The ln 10

of the above is ] ln(m Zn (t)) = n ln [1 + t2 2n + O(t3 σ 3 n 3/2 ) [ ] t 2 = n 2n + O(t3 σ 3 n 3/2 ) + = t2 2 + O(t3 σ 3 n 1/2 ) + which converges to t 2 /2. The following theorem (which we do not prove) completes the proof of the central limit under an additional hypothesis about momemnt generating functions. Theorem 4. (Continuity theorem) Let X n be a sequence of random variables, X a RV. Let F Xn and F X be their cdf s. Let M Xn and M X be their moment generating functions. Suppose there is an a > 0 such that M Xn (t) and M X (t) exist for t < a. Suppose that for t < a, M Xn (t) M X (t). Then for all x where F(x) is continuous, F Xn (x) F(x) The central limit theorem only requires that the random variables have a finite second moment, not that their mgf exist. To proof the CLT in this case we use characteristic functions instead of mgf s. Definition 4. The characteristic function of a random variable X is φ X (t) = E[e itx ] = E[cos(tX) + i sin(tx)] Since cos(tx), sin(tx) 1, the characteristic function is defined for all t for any random variable. For a continuous random variable with density f X (x), φ X (t) = e itx f X (x)dx This is just the Fourier transform of f X (x) For a normal random variable the computation of the characteristic function is almost identical to that of 11

the mfg. We find for a standard normal random variable X that φ Z (t) = exp( t 2 /2). If X and Y are independent, then φ X+Y (t) = φ X (t) + φ Y (t) We also note that for a constant c, the characteristic function of cx is φ cx (t) = φ X (ct). Now let X have the Cauchy distribution. An exercise in contour integration shows φ X (t) = 1 π e ixt dx = x 2 e t + 1 Now let X n be an iid sequence of Cauchy random variables. So if we let S n be their sum, then φ Sn (t) = exp n t This is the characteristic function of n times a Cauchy distribution. So has a Cauchy random variable. Note that the CLT would say 1 n 1 n X j converges to a normal. So instead of the CLT theorem we see that if we rescale differently then in the limit we get a Cauchy distributions. There are random variables other than Cauchy for which we also get convergence to the Cauchy distribution. Convergence of Binormial to Poisson???????????????? 8.6 Proof of strong law of large numbers Borel Cantelli X j 12