Chapter 6: Variance, the law of large numbers and the Monte-Carlo method



Similar documents
Properties of MLE: consistency, asymptotic normality. Fisher information.

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Overview of some probability distributions.

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

A probabilistic proof of a binomial identity

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

I. Chi-squared Distributions

Hypothesis testing. Null and alternative hypotheses

Chapter 7 Methods of Finding Estimators

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Asymptotic Growth of Functions

Maximum Likelihood Estimators.


Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

Output Analysis (2, Chapters 10 &11 Law)

Section 11.3: The Integral Test

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

1 Computing the Standard Deviation of Sample Means

Math C067 Sampling Distributions

Infinite Sequences and Series

THE ABRACADABRA PROBLEM

A PROBABILISTIC VIEW ON THE ECONOMICS OF GAMBLING

5: Introduction to Estimation

1. C. The formula for the confidence interval for a population mean is: x t, which was

Soving Recurrence Relations

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Exam 3. Instructor: Cynthia Rudin TA: Dimitrios Bisias. November 22, 2011

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

The Stable Marriage Problem

4.3. The Integral and Comparison Tests

Measures of Spread and Boxplots Discrete Math, Section 9.4

Our aim is to show that under reasonable assumptions a given 2π-periodic function f can be represented as convergent series

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

Normal Distribution.

Math 113 HW #11 Solutions

Lesson 17 Pearson s Correlation Coefficient

MARTINGALES AND A BASIC APPLICATION

Convexity, Inequalities, and Norms

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Chapter 5: Inner Product Spaces

Sampling Distribution And Central Limit Theorem

Universal coding for classes of sources

Incremental calculation of weighted mean and variance

1. MATHEMATICAL INDUCTION

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

3. Greatest Common Divisor - Least Common Multiple

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Confidence Intervals for One Mean

5 Boolean Decision Trees (February 11)

Factoring x n 1: cyclotomic and Aurifeuillian polynomials Paul Garrett <garrett@math.umn.edu>

3 Basic Definitions of Probability Theory

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

Department of Computer Science, University of Otago

Descriptive Statistics

Chapter 14 Nonparametric Statistics

1 The Gaussian channel

Modified Line Search Method for Global Optimization

Determining the sample size

Definition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean

Lesson 15 ANOVA (analysis of variance)

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

Lecture 4: Cheeger s Inequality

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Simple Annuities Present Value.

Sequences and Series

CS103X: Discrete Structures Homework 4 Solutions

Chair for Network Architectures and Services Institute of Informatics TU München Prof. Carle. Network Security. Chapter 2 Basics

Chapter 7: Confidence Interval and Sample Size

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

THE HEIGHT OF q-binary SEARCH TREES

Repeating Decimals are decimal numbers that have number(s) after the decimal point that repeat in a pattern.

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Statistical inference: example 1. Inferential Statistics

AP Calculus BC 2003 Scoring Guidelines Form B

NOTES ON PROBABILITY Greg Lawler Last Updated: March 21, 2016

Class Meeting # 16: The Fourier Transform on R n

FIBONACCI NUMBERS: AN APPLICATION OF LINEAR ALGEBRA. 1. Powers of a matrix

An Efficient Polynomial Approximation of the Normal Distribution Function & Its Inverse Function

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1)

How To Solve The Homewor Problem Beautifully

CHAPTER 3 THE TIME VALUE OF MONEY

Building Blocks Problem Related to Harmonic Series

A Mathematical Perspective on Gambling

Betting on Football Pools

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE The absolute value of the complex number z a bi is

AP Calculus AB 2006 Scoring Guidelines Form B

Practice Problems for Test 3

Section 8.3 : De Moivre s Theorem and Applications

Lecture 5: Span, linear independence, bases, and dimension

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER?

Transcription:

Chapter 6: Variace, the law of large umbers ad the Mote-Carlo method Expected value, variace, ad Chebyshev iequality. If X is a radom variable recall that the expected value of X, E[X] is the average value of X Expected value of X : E[X] = α α P (X = α) The expected value measures oly the average of X ad two radom variables with the same mea ca have very differet behavior. For example the radom variable X with P (X = +1) = 1/2, P (X = 1) = 1/2 ad the radom variable Y with have the same mea [P (X = +100) = 1/2, P (X = 100) = 1/2 E[X] = E[Y ] = 0 To measure the spread of a radom variable X, that is how likely it is to have value of X very far away from the mea we itroduce the variace of X, deoted by var(x). Let us cosider the distace to the expected value i.e., X E[X]. It is more coveiet to look at the square of this distace (X E[X]) 2 to get rid of the absolute value ad the variace is the give by Variace of X : var(x) = E [ (X E[X]) 2] We summarize some elemetary properties of expected value ad variace i the followig Theorem 1. We have 1. For ay two radom variables X ad Y, E[X + Y ] = E[X] + E[Y ]. 2. For ay real umber a, E[aX] = ae[x]. 3. For ay real umber c, E[X + c] = E[X] + c. 1

4. For ay real umber a, var(ax) = a 2 var(x). 5. For ay real umber c, var(x + c) = var(x). Proof. 1. should be obvious, the sum of averages is the average of the sum. For 2. oe otes that if X takes the value α with some probability the the radom variable ax takes the value aα with the same probability. 3 is a special case of 1 if we realize that E[a] = a. For 4. we use 2 ad we have var(x) = E [ (ax E[aX]) 2] = E [ a 2 (X E[X]) 2] = a 2 E [ (X E[X]) 2] = a 2 var(x). Fially for 5. ot that X + a E[X + a] = X E[x] ad so the variace does ot chage. Usig this rule we ca derive aother formula for the variace. var(x) = E [ (X E[X]) 2] = E[X 2 2XE[X] + E[X] 2 = E[X 2 ] + E[ 2XE[X]] + E[E[X] 2 ] = E[X 2 ] 2E[X] 2 + E[X] 2 = E[X 2 ] E[X] 2 So we obtai Variace of X : var(x) = E [ (X E[X]) 2] = E[X 2 ] E[X] 2 Example: The 0 1 radom variable. Suppose A is a evet the radom variable X A is give by { 1 if A occurs X A = 0 otherwise ad let us write The we have p = P (A) E[X A ] = 0 P (X A = 0) + 1 P (X A = 1) = 0 (1 p) + 1 p = p. 2

To compute the variace ote that X A E[X A ] = { 1 p if A occurs p otherwise ad so var(x) = ( p) 2 P (X A = 0) + (1 p) 2 P (X A = 1) = p 2 (1 p) + (1 p) 2 p = p(1 p) I summary we have The 0 1 radom variable P (X = 1) = p, P (X = 0) = (1 p) E[X] = p, var(x) = p(1 p) Chebyshev iequality: The Chebyshev iequality is a simple iequality which allows you to extract iformatio about the values that X ca take if you kow oly the mea ad the variace of X. Theorem 2. We have 1. Markov iequality. If X 0, i.e. X takes oly oegative values, the for ay a > 0 we have P (X a) E[X] α 2. Chebyshev iequality. For ay radom variable X ad ay ɛ > 0 we have P ( X E[X] ɛ) var(x) ɛ 2 Proof. Let us prove first Markov iequality. Pick a positive umber a. Sice X takes oly oegative values all terms i the sum givig the expectatios are oegative we have E[X] = α αp (X = α) α a αp (X = α) a α a P (X = α) = ap (X a) 3

ad thus P (X a) E[X]. a To prove Chebyshev we will use Markov iequality ad apply it to the radom variable Y = (X E[X]) 2 which is oegative ad with expected value We have the E[Y ] = E [ (X E[X]) 2] = var(x). P ( X E[X] ɛ) = P ((X E[X]) 2 ɛ 2 ) = P (Y ɛ 2 ) E[Y ] ɛ 2 = var(x) ɛ 2 (1) Idepedece ad sum of radom variables: Two radom variables are idepedet idepedet if the kowledge of Y does ot ifluece the results of X ad vice versa. This ca be expressed i terms of coditioal probabilities: the (coditioal) probability that Y takes a certai value, say β, does ot chage if we kow that X takes a value, say α. I other words Y is idepedet of X if P (Y = β X = α) = P (Y = β) for all α, β But usig the defiitio of coditioal probability we fid that or P (Y = β X = α) = P (Y = β X = α) P (X = α) = P (Y = β) P (Y = β X = α) = P (X = α)p (Y = β). This formula is symmetric i X ad Y ad so if Y is idepedet of X the X is also idepedet of Y ad we just say that X ad Y are idepedet. 4

X ad Y are idepedet if P (Y = β X = α) = P (X = α)p (Y = β) for all α, β Theorem 3. Suppose X ad Y are idepedet radom variable. The we have 1. E[XY ] = E[X]E[Y ]. 2. var(x + Y ) = var(x) + var(y ). Proof. : If X ad Y are idepedet we have E[XY ] = α,β = α,β αβp (X = αy = β) αβp (X = α)p (Y = β) = α αp (X = α) β βp (Y = β) = E[X]E[Y ] To compute the variace of X + Y it is best to ote that by Theorem 1, part 5, the variace is uchaged if we traslate the the radom variable. So we have for example var(x) = var(x E[X]) ad similarly for Y ad X + Y. So without loss of geerality we may assume that E[X] = E[Y ] = E[X + Y ] = 0. The var(x) = E[X 2 ], etc... var(x + Y ) = E [ (X + Y ) 2] = E [ X 2 + 2XY + Y 2] = E [ X 2] + E [ Y 2] + 2E [XY ] = E [ X 2] + E [ Y 2] + 2E [X] E [Y ] (X, Y idepedet) = E [ X 2] + E [ Y 2] (sice E[X] = E[Y ] = 0) = var(x) + var(y ) The Law of Large umbers Suppose we perform a experimet ad a measuremet ecoded i the radom variable X ad that we repeat this experimet times each time i the same coditios ad each time idepedetly of each other. We thus obtai idepedet copies of the radom variable X which we deote X 1, X 2,, X 5

Such a collectio of radom variable is called a IID sequece of radom variables where IID stads for idepedet ad idetically distributed. This meas that the radom variables X i have the same probability distributio. I particular they have all the same meas ad variace E[X i ] = µ, var(x i ) = σ 2, i = 1, 2,, Each time we perform the experimet tiimes, the X i provides a (radom) measuremet ad if the average value X 1 + + X is called the empirical average. The Law of Large Numbers states for large the empirical average is very close to the expected value µ with very high probability Theorem 4. Let X 1,, X IID radom variables with E[X i ] = µ ad var(x i ) for all i. The we have ( ) X 1 + X P µ ɛ σ2 ɛ 2 I particular the right had side goes to 0 has. Proof. The proof of the law of large umbers is a simple applicatio from Chebyshev iequality to the radom variable X 1+ X. Ideed by the properties of expectatios we have [ ] X1 + X E = 1 E [X 1 + X ] = 1 (E [X 1] + E [X ]) = 1 µ = µ For the variace we use that the X i are idepedet ad so we have ( ) X1 + X var = 1 var (X 2 1 + X ]) = 1 (var(x 1) + + var(x 2 )) = σ2 By Chebyshev iequality we obtai the ( X 1 + X P ) µ ɛ σ2 ɛ 2 Coi flip Suppose we flip a fair coi 100 times. How likely it is to obtai betwee 40% ad 60% heads? We cosider the radom variable X which is 1 if the coi lads o head 6

ad 0 otherwise. We have µ = E[X] = 1/2 ad σ 2 = var(x) = 1/4 ad by Chebyshev P (betwee 40 ad 60 heads) = P (40 X 1 + X 100 60) ( 4 = P 10 X 1 + X 100 6 ) 100 10 ( X 1 + X 100 = P 1 100 2 1 ) 10 ( X 1 + X 100 = 1 P 1 100 2 1 ) 10 1/4 1 =.75 (2) 100(1/10) 2 If we ow flip a fair coi ow 1000 obtaithe probability to obtai betwee 40% ad 60% heads ca be estimated by ( X 1 + X 1000 P (betwee 400 ad 600 heads) = P 1 1000 2 1 ) 10 ( X 1 + X 100 = 1 P 1 100 2 1 ) 10 1/4 1 =.975 (3) 1000(1/10) 2 Variace as a measure of risk: I may problems the variace ca be iterpreted as measurig how risky a ivestmet is. As a example let us put ourselves i the casio shoes ad try to figure out what is more risky for a casio? A player bettig o red/black at roulette or a player bettig o umbers? Suppose X is the expected wi o red or black. The we have E[X] = 1 18 1 18 38 2 ad 38 E[X2 ] = 1 18 + 1 18 = 1 so Var(X) = 0.99. 38 38 38 = Suppose Y is the expected wi o a umber. The E[X] = 35 1 1 37 = 2, ad 38 38 38 E[X 2 ] = (35) 2 18 + 1 18 = 33.21 so Var(X) = 33.20 38 38 It is obvious that that the riskier bet is to bet o umbers. To estimate the risk take by the casio, let us estimate usig Chebyshev iequality that the casio actually loses moey o bets of say $1. This is P {X 1 + + X > 0} 7

Usig Chebyshev we have P {X 1 + + X > 0} = P {X 1 + + X µ > µ} P { X 1 + + X µ > µ } σ 2 µ 2 (4) So for bets o red/black the estimates o the probability that the casio is aroud 33 times smaller for for a bet umbers. But of course, i ay case the probability that the casio lose at all is tiy ad i additio Chebyshev grossly overestimates these umbers. Probabilistic algorithms ad the Mote-Carlo method: Uder the ame Mote- Carlo methods, oe uderstads a algorithm which uses radomess ad the LLN to compute a certai quatity which might have othig to do with radomess. Such algorithm are becomig ubiquitous i may applicatios i statistics, computer sciece, physics ad egieerig. We will illustrate the ideas here with some very simple test examples. We start with a probabilistic algorithm which do ot use the LLN at all but use probability i a surprisig maer to make a decisio. Guessig the largest of two umber: Suppose you pick two distict itegers A < B, let us say betwee 1 ad 100. You ca do this i ay way you wish. You write the the 2 umbers o two pieces of papers ad put them face dow. I the pick oe of the two pieces of paper ad look at the umber o it. I should the decide whether this umber is the largest of the 2 or ot. We will describe a algorithm which always retur the largest of 2 with probability greater tha 1/2, o matter how you picked the umber. To describe the algorithm we let O be the observed umber by me. The I pick a radom umber N betwee 1 ad 100, for example uiformly, that is P (N = ) = 1 100 with = 1, 2, 100. I could pick N accordig to aother distributio ad it would still works. The my aswer is simply If O > N the I guess that O is the largest umber. if O N the I switch ad guess that the other uobserved umber is the largest. To see how this works we distiguish three cases 1. If N < A < B the N < 0 ad thus pickig O as the largest give me a probability 1/2 to pick the largest. 2. If N B > A, the I decide to switch ad agai I pick the largest with probability 1/2. 8

3. If A N < B it gets iterestig, sice if O = A, the N A ad so I switch ad pick B which is the largest. O the other had if O = B, the N < O ad so i guess that O = B is the largest ad wi. So I always wi. Usig coditioal probabilities, we fid that P (Wi) = P (Wi N < A)P (N < A) + P (Wi A N < B)P (A N < B) +P (Wi B N)P (B N) = 1 2 P (N < A) + P (A N < B) + 1 2 P (B N) > 1 2 (5) For example is N is uiformly distributed, we have P (Wi) = 1 A 1 2 100 + B A 100 + 1 100 B + 1 2 100 = 1 2 + 1 B A 2 100. Radom umbers: A computer comes equipped with a radom umber geerator, (usually the commad rad, which produces a umber which is uiformly distributed i [0, 1]. We call such a umber U ad such a umber is characterized by the fact that P (U [a, b]) = b a for ay iterval [a, b] [0, 1]. Every Mote-Carlo method should be i priciple costructed with radom umber so as to be easily implemetable. For example we ca geerate a 0 1 radom variable X with P (X = 1) = p ad P (X = 0) = 1 p by usig a radom umber. We simply set { 1 if U p X = 0 if U > p The we have P (X = 1) = P (U [0, p]) = p. A algorithm to compute the umber π: To compute the umber π we draw a square with side legth 1 ad iscribe i it a circle of radius 1/2. The area of the square of 1 while the area of the circle is π/4. To compute π we geerate a radom poit i the square. If the poit geerated is iside the circle we accept it, while if it is outside we reject it. The we repeat the same experimet may times ad expect by the LLN to have a proportio of accepted poits equal to π/4 More precisely the algorithm ow goes as follows 9

Geerate two radom umbers U 1 ad V 1, this is the same as geeratig a radom poit i the square [0, 1] [0, 1]. If U 2 1 + V 2 1 1 the set X 1 = 1 while if U 2 + V 2 > 1 set X 1 = 0. Repeat to the two previous steps to geerate X 2, X 3,, X. We have P (X 1 = 1) = P (U 2 1 + V 2 1 1) = ad P (X = 0) = 1 π/4. We have the area of circle area of the square = π/4 1 E[X] = µ = π/4 var(x) = σ 2 = π/4(1 π/4) So usig the LLN ad Chebyshev we have ( X 1 + X P π ) 4 ɛ π/4(1 π/4) ɛ 2 I order to get quatitative iformatio suppose we wat to compute π with a accuracy of ±1/1000, that is we take ɛ = 1/000. This is the same as computig π/4 with a accuracy of 1/4000. O the right had side we have the variace π/4(1 π/4) which is a umber we do t kow. But we ote that the fuctio p(1 p) o [0, 1] has its maximum at p = 1/2 ad the maximum is 1/4 so we ca obtai ( X 1 + X P π ) 4 4/100 4, 000, 000 That we eed to compute ru the algorithm 80 millios times to make this probability 5/100. The Mote-carlo method to compute the itegral b f(x) dx. We cosider a fuctio f o the iterval [a, b] ad we wish to a compute I = b a f(x) dx. for some bouded fuctio f. Without loss of geerality we ca assume that f 0, otherwise we replace f by f + c for some costat c. Next we ca also assume that f 1 otherwise we replace f by cf for a sufficietly small c. Fially we may assume that a = 0 10

ad b = 1 otherwise we make the chage of variable y = (x a)/(b a). For example suppose we wat to compute the itegral 1 0 e si(x3 ) 3(1 + 5x 8 ) dx This caot be doe by had ad so we eed a umerical method. A stadard method would be to use a Riema sum, i.e. we divide the iterval [0, 1] i subiterval ad set x i = i the we ca approximate the itegral by f(x)dx 1 f(x i ) that is we approximate the area uder the graph of f by the sum of areas of rectagles of base legth 1/ ad height f(i/). We use istead a Mote-Carlo method. We ote that i=1 I = Area uder the graph of f ad we costruct a 0 1 radom variable X so that E[X] = I. computig π. More precisely the algorithm ow goes as follows We proceed as for Geerate two radom umbers U 1 ad V 1, this is the same as geeratig a radom poit i the square [0, 1] [0, 1]. If V 1 f(u 1 ) the set X 1 = 1 while if V 1 > f(u 1 ) set X 1 = 0. Repeat to the two previous steps to geerate X 2, X 3,, X. We have P (X = 1) = P (V f(u)) = Area uder the graph of f Area of [0, 1] [0, 1] = I = 1 0 f(x) dx ad so E[X] = I ad var(x) = I(1 I). 11