Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

Chapter 6: Variace, the law of large umbers ad the Mote-Carlo method Expected value, variace, ad Chebyshev iequality. If X is a radom variable recall that the expected value of X, E[X] is the average value of X Expected value of X : E[X] = α α P (X = α) The expected value measures oly the average of X ad two radom variables with the same mea ca have very differet behavior. For example the radom variable X with P (X = +1) = 1/2, P (X = 1) = 1/2 ad the radom variable Y with have the same mea [P (X = +100) = 1/2, P (X = 100) = 1/2 E[X] = E[Y ] = 0 To measure the spread of a radom variable X, that is how likely it is to have value of X very far away from the mea we itroduce the variace of X, deoted by var(x). Let us cosider the distace to the expected value i.e., X E[X]. It is more coveiet to look at the square of this distace (X E[X]) 2 to get rid of the absolute value ad the variace is the give by Variace of X : var(x) = E [ (X E[X]) 2] We summarize some elemetary properties of expected value ad variace i the followig Theorem 1. We have 1. For ay two radom variables X ad Y, E[X + Y ] = E[X] + E[Y ]. 2. For ay real umber a, E[aX] = ae[x]. 3. For ay real umber c, E[X + c] = E[X] + c. 1

4. For ay real umber a, var(ax) = a 2 var(x). 5. For ay real umber c, var(x + c) = var(x). Proof. 1. should be obvious, the sum of averages is the average of the sum. For 2. oe otes that if X takes the value α with some probability the the radom variable ax takes the value aα with the same probability. 3 is a special case of 1 if we realize that E[a] = a. For 4. we use 2 ad we have var(x) = E [ (ax E[aX]) 2] = E [ a 2 (X E[X]) 2] = a 2 E [ (X E[X]) 2] = a 2 var(x). Fially for 5. ot that X + a E[X + a] = X E[x] ad so the variace does ot chage. Usig this rule we ca derive aother formula for the variace. var(x) = E [ (X E[X]) 2] = E[X 2 2XE[X] + E[X] 2 = E[X 2 ] + E[ 2XE[X]] + E[E[X] 2 ] = E[X 2 ] 2E[X] 2 + E[X] 2 = E[X 2 ] E[X] 2 So we obtai Variace of X : var(x) = E [ (X E[X]) 2] = E[X 2 ] E[X] 2 Example: The 0 1 radom variable. Suppose A is a evet the radom variable X A is give by { 1 if A occurs X A = 0 otherwise ad let us write The we have p = P (A) E[X A ] = 0 P (X A = 0) + 1 P (X A = 1) = 0 (1 p) + 1 p = p. 2

To compute the variace ote that X A E[X A ] = { 1 p if A occurs p otherwise ad so var(x) = ( p) 2 P (X A = 0) + (1 p) 2 P (X A = 1) = p 2 (1 p) + (1 p) 2 p = p(1 p) I summary we have The 0 1 radom variable P (X = 1) = p, P (X = 0) = (1 p) E[X] = p, var(x) = p(1 p) Chebyshev iequality: The Chebyshev iequality is a simple iequality which allows you to extract iformatio about the values that X ca take if you kow oly the mea ad the variace of X. Theorem 2. We have 1. Markov iequality. If X 0, i.e. X takes oly oegative values, the for ay a > 0 we have P (X a) E[X] α 2. Chebyshev iequality. For ay radom variable X ad ay ɛ > 0 we have P ( X E[X] ɛ) var(x) ɛ 2 Proof. Let us prove first Markov iequality. Pick a positive umber a. Sice X takes oly oegative values all terms i the sum givig the expectatios are oegative we have E[X] = α αp (X = α) α a αp (X = α) a α a P (X = α) = ap (X a) 3

ad thus P (X a) E[X]. a To prove Chebyshev we will use Markov iequality ad apply it to the radom variable Y = (X E[X]) 2 which is oegative ad with expected value We have the E[Y ] = E [ (X E[X]) 2] = var(x). P ( X E[X] ɛ) = P ((X E[X]) 2 ɛ 2 ) = P (Y ɛ 2 ) E[Y ] ɛ 2 = var(x) ɛ 2 (1) Idepedece ad sum of radom variables: Two radom variables are idepedet idepedet if the kowledge of Y does ot ifluece the results of X ad vice versa. This ca be expressed i terms of coditioal probabilities: the (coditioal) probability that Y takes a certai value, say β, does ot chage if we kow that X takes a value, say α. I other words Y is idepedet of X if P (Y = β X = α) = P (Y = β) for all α, β But usig the defiitio of coditioal probability we fid that or P (Y = β X = α) = P (Y = β X = α) P (X = α) = P (Y = β) P (Y = β X = α) = P (X = α)p (Y = β). This formula is symmetric i X ad Y ad so if Y is idepedet of X the X is also idepedet of Y ad we just say that X ad Y are idepedet. 4

X ad Y are idepedet if P (Y = β X = α) = P (X = α)p (Y = β) for all α, β Theorem 3. Suppose X ad Y are idepedet radom variable. The we have 1. E[XY ] = E[X]E[Y ]. 2. var(x + Y ) = var(x) + var(y ). Proof. : If X ad Y are idepedet we have E[XY ] = α,β = α,β αβp (X = αy = β) αβp (X = α)p (Y = β) = α αp (X = α) β βp (Y = β) = E[X]E[Y ] To compute the variace of X + Y it is best to ote that by Theorem 1, part 5, the variace is uchaged if we traslate the the radom variable. So we have for example var(x) = var(x E[X]) ad similarly for Y ad X + Y. So without loss of geerality we may assume that E[X] = E[Y ] = E[X + Y ] = 0. The var(x) = E[X 2 ], etc... var(x + Y ) = E [ (X + Y ) 2] = E [ X 2 + 2XY + Y 2] = E [ X 2] + E [ Y 2] + 2E [XY ] = E [ X 2] + E [ Y 2] + 2E [X] E [Y ] (X, Y idepedet) = E [ X 2] + E [ Y 2] (sice E[X] = E[Y ] = 0) = var(x) + var(y ) The Law of Large umbers Suppose we perform a experimet ad a measuremet ecoded i the radom variable X ad that we repeat this experimet times each time i the same coditios ad each time idepedetly of each other. We thus obtai idepedet copies of the radom variable X which we deote X 1, X 2,, X 5

Such a collectio of radom variable is called a IID sequece of radom variables where IID stads for idepedet ad idetically distributed. This meas that the radom variables X i have the same probability distributio. I particular they have all the same meas ad variace E[X i ] = µ, var(x i ) = σ 2, i = 1, 2,, Each time we perform the experimet tiimes, the X i provides a (radom) measuremet ad if the average value X 1 + + X is called the empirical average. The Law of Large Numbers states for large the empirical average is very close to the expected value µ with very high probability Theorem 4. Let X 1,, X IID radom variables with E[X i ] = µ ad var(x i ) for all i. The we have ( ) X 1 + X P µ ɛ σ2 ɛ 2 I particular the right had side goes to 0 has. Proof. The proof of the law of large umbers is a simple applicatio from Chebyshev iequality to the radom variable X 1+ X. Ideed by the properties of expectatios we have [ ] X1 + X E = 1 E [X 1 + X ] = 1 (E [X 1] + E [X ]) = 1 µ = µ For the variace we use that the X i are idepedet ad so we have ( ) X1 + X var = 1 var (X 2 1 + X ]) = 1 (var(x 1) + + var(x 2 )) = σ2 By Chebyshev iequality we obtai the ( X 1 + X P ) µ ɛ σ2 ɛ 2 Coi flip Suppose we flip a fair coi 100 times. How likely it is to obtai betwee 40% ad 60% heads? We cosider the radom variable X which is 1 if the coi lads o head 6

ad 0 otherwise. We have µ = E[X] = 1/2 ad σ 2 = var(x) = 1/4 ad by Chebyshev P (betwee 40 ad 60 heads) = P (40 X 1 + X 100 60) ( 4 = P 10 X 1 + X 100 6 ) 100 10 ( X 1 + X 100 = P 1 100 2 1 ) 10 ( X 1 + X 100 = 1 P 1 100 2 1 ) 10 1/4 1 =.75 (2) 100(1/10) 2 If we ow flip a fair coi ow 1000 obtaithe probability to obtai betwee 40% ad 60% heads ca be estimated by ( X 1 + X 1000 P (betwee 400 ad 600 heads) = P 1 1000 2 1 ) 10 ( X 1 + X 100 = 1 P 1 100 2 1 ) 10 1/4 1 =.975 (3) 1000(1/10) 2 Variace as a measure of risk: I may problems the variace ca be iterpreted as measurig how risky a ivestmet is. As a example let us put ourselves i the casio shoes ad try to figure out what is more risky for a casio? A player bettig o red/black at roulette or a player bettig o umbers? Suppose X is the expected wi o red or black. The we have E[X] = 1 18 1 18 38 2 ad 38 E[X2 ] = 1 18 + 1 18 = 1 so Var(X) = 0.99. 38 38 38 = Suppose Y is the expected wi o a umber. The E[X] = 35 1 1 37 = 2, ad 38 38 38 E[X 2 ] = (35) 2 18 + 1 18 = 33.21 so Var(X) = 33.20 38 38 It is obvious that that the riskier bet is to bet o umbers. To estimate the risk take by the casio, let us estimate usig Chebyshev iequality that the casio actually loses moey o bets of say $1. This is P {X 1 + + X > 0} 7

Usig Chebyshev we have P {X 1 + + X > 0} = P {X 1 + + X µ > µ} P { X 1 + + X µ > µ } σ 2 µ 2 (4) So for bets o red/black the estimates o the probability that the casio is aroud 33 times smaller for for a bet umbers. But of course, i ay case the probability that the casio lose at all is tiy ad i additio Chebyshev grossly overestimates these umbers. Probabilistic algorithms ad the Mote-Carlo method: Uder the ame Mote- Carlo methods, oe uderstads a algorithm which uses radomess ad the LLN to compute a certai quatity which might have othig to do with radomess. Such algorithm are becomig ubiquitous i may applicatios i statistics, computer sciece, physics ad egieerig. We will illustrate the ideas here with some very simple test examples. We start with a probabilistic algorithm which do ot use the LLN at all but use probability i a surprisig maer to make a decisio. Guessig the largest of two umber: Suppose you pick two distict itegers A < B, let us say betwee 1 ad 100. You ca do this i ay way you wish. You write the the 2 umbers o two pieces of papers ad put them face dow. I the pick oe of the two pieces of paper ad look at the umber o it. I should the decide whether this umber is the largest of the 2 or ot. We will describe a algorithm which always retur the largest of 2 with probability greater tha 1/2, o matter how you picked the umber. To describe the algorithm we let O be the observed umber by me. The I pick a radom umber N betwee 1 ad 100, for example uiformly, that is P (N = ) = 1 100 with = 1, 2, 100. I could pick N accordig to aother distributio ad it would still works. The my aswer is simply If O > N the I guess that O is the largest umber. if O N the I switch ad guess that the other uobserved umber is the largest. To see how this works we distiguish three cases 1. If N < A < B the N < 0 ad thus pickig O as the largest give me a probability 1/2 to pick the largest. 2. If N B > A, the I decide to switch ad agai I pick the largest with probability 1/2. 8

3. If A N < B it gets iterestig, sice if O = A, the N A ad so I switch ad pick B which is the largest. O the other had if O = B, the N < O ad so i guess that O = B is the largest ad wi. So I always wi. Usig coditioal probabilities, we fid that P (Wi) = P (Wi N < A)P (N < A) + P (Wi A N < B)P (A N < B) +P (Wi B N)P (B N) = 1 2 P (N < A) + P (A N < B) + 1 2 P (B N) > 1 2 (5) For example is N is uiformly distributed, we have P (Wi) = 1 A 1 2 100 + B A 100 + 1 100 B + 1 2 100 = 1 2 + 1 B A 2 100. Radom umbers: A computer comes equipped with a radom umber geerator, (usually the commad rad, which produces a umber which is uiformly distributed i [0, 1]. We call such a umber U ad such a umber is characterized by the fact that P (U [a, b]) = b a for ay iterval [a, b] [0, 1]. Every Mote-Carlo method should be i priciple costructed with radom umber so as to be easily implemetable. For example we ca geerate a 0 1 radom variable X with P (X = 1) = p ad P (X = 0) = 1 p by usig a radom umber. We simply set { 1 if U p X = 0 if U > p The we have P (X = 1) = P (U [0, p]) = p. A algorithm to compute the umber π: To compute the umber π we draw a square with side legth 1 ad iscribe i it a circle of radius 1/2. The area of the square of 1 while the area of the circle is π/4. To compute π we geerate a radom poit i the square. If the poit geerated is iside the circle we accept it, while if it is outside we reject it. The we repeat the same experimet may times ad expect by the LLN to have a proportio of accepted poits equal to π/4 More precisely the algorithm ow goes as follows 9

Geerate two radom umbers U 1 ad V 1, this is the same as geeratig a radom poit i the square [0, 1] [0, 1]. If U 2 1 + V 2 1 1 the set X 1 = 1 while if U 2 + V 2 > 1 set X 1 = 0. Repeat to the two previous steps to geerate X 2, X 3,, X. We have P (X 1 = 1) = P (U 2 1 + V 2 1 1) = ad P (X = 0) = 1 π/4. We have the area of circle area of the square = π/4 1 E[X] = µ = π/4 var(x) = σ 2 = π/4(1 π/4) So usig the LLN ad Chebyshev we have ( X 1 + X P π ) 4 ɛ π/4(1 π/4) ɛ 2 I order to get quatitative iformatio suppose we wat to compute π with a accuracy of ±1/1000, that is we take ɛ = 1/000. This is the same as computig π/4 with a accuracy of 1/4000. O the right had side we have the variace π/4(1 π/4) which is a umber we do t kow. But we ote that the fuctio p(1 p) o [0, 1] has its maximum at p = 1/2 ad the maximum is 1/4 so we ca obtai ( X 1 + X P π ) 4 4/100 4, 000, 000 That we eed to compute ru the algorithm 80 millios times to make this probability 5/100. The Mote-carlo method to compute the itegral b f(x) dx. We cosider a fuctio f o the iterval [a, b] ad we wish to a compute I = b a f(x) dx. for some bouded fuctio f. Without loss of geerality we ca assume that f 0, otherwise we replace f by f + c for some costat c. Next we ca also assume that f 1 otherwise we replace f by cf for a sufficietly small c. Fially we may assume that a = 0 10

ad b = 1 otherwise we make the chage of variable y = (x a)/(b a). For example suppose we wat to compute the itegral 1 0 e si(x3 ) 3(1 + 5x 8 ) dx This caot be doe by had ad so we eed a umerical method. A stadard method would be to use a Riema sum, i.e. we divide the iterval [0, 1] i subiterval ad set x i = i the we ca approximate the itegral by f(x)dx 1 f(x i ) that is we approximate the area uder the graph of f by the sum of areas of rectagles of base legth 1/ ad height f(i/). We use istead a Mote-Carlo method. We ote that i=1 I = Area uder the graph of f ad we costruct a 0 1 radom variable X so that E[X] = I. computig π. More precisely the algorithm ow goes as follows We proceed as for Geerate two radom umbers U 1 ad V 1, this is the same as geeratig a radom poit i the square [0, 1] [0, 1]. If V 1 f(u 1 ) the set X 1 = 1 while if V 1 > f(u 1 ) set X 1 = 0. Repeat to the two previous steps to geerate X 2, X 3,, X. We have P (X = 1) = P (V f(u)) = Area uder the graph of f Area of [0, 1] [0, 1] = I = 1 0 f(x) dx ad so E[X] = I ad var(x) = I(1 I). 11