Normal Approximation to Binomial Distributions

Normal Approximation to Binomial Distributions Charlie Vollmer Department of Statistics Colorado State University Fort Collins, CO charlesv@rams.colostate.edu September 18, 2015 Abstract This document is a supplement to class lectures for STAT 307-003, Fall, 2015. It details how the Normal Distribution can approximate the Binomial Distribution as the number of trials, n, gets large. How large does n need to be? How well does the Normal Distribution approximate a Binomial Distribution? Let us find out... 1 Setup: Defining some terms 1.1 Expected Value If we go to wikipedia, the following is the very first sentence that we ll see: In probability theory, the expected value of a random variable is intuitively the long-run average value of repetitions of the experiment it represents. Great! It is simply what we expect to see most often if we did something over and over and over again! And if we go down a few more sentences on the wikipedia page, we find something even more useful: The expected value is also known as the expectation, mathematical expectation, EV, mean, or first moment. Bam! Look at that fourth synonym: the mean! That is exactly what I would expect to see most often if I did an experiment over and over and over lots of times! Note: if a Random Variable is Binomially Distributed, its mean is: np. 1

1.2 Standard Error, SE If we go to wikipedia, the following is the first two sentences that we ll see: The standard error (SE) is the standard deviation of the sampling distribution of a statistic,[1] most commonly of the mean. The term may also be used to refer to an estimate of that standard deviation, derived from a particular sample used to compute the estimate. Ok, this is wordy but it s actually very accurate and descriptive. It s just saying that the standard error, SE, is the standard deviation of our statistic. So... if our statistic is: S n = Where X i is simply a 1 or 0, in the case of a coin toss (heads or tails), then our SE of this statistic is its standard deviation. n i=1 Now, we know -from class- that this statistic, S n, is a Binomially Distributed Random Variable (it follows a Binomial Distribution). In the case of a binomial, we (humans... and now you, too!) know that the variance of a Binomially Distributed Random Variable is simply: npq Do you remember how to find the standard deviation from the variance? Well, if you need the SE, it s just the standard deviation. So, now we know how to get the variance from a binomial, which means we have the standard deviation or -in other words- the standard error, SE! X i 2

2 The Approximation: Toss a coin 100 times The object of this section is to illustrate how if we plot out the histogram of the number of heads from a coin toss will be well approximated by a Normal Distribution as the number of tosses, n gets large. For instance, say that we toss a fair coin 100 times and see how many times that we get heads. We could do this and get 88 heads. We could also do it and get 45 heads. Let s say we do the entire experiment (toss the coin 100 times) 50 times. Thus, we ll get 50 different numbers. Let s see what that plot looks like: 6 4 count 2 0 40 45 50 55 60 65 Number of Heads It looks like on one experiment we got 40 heads and one time we got 41 heads. On another experiment we got 61 heads. Yet again, on 6 experiments we got 48 heads, and 6 more experiments we got 59 heads. You get the picture. So, in this situation, we only did this experiment (toss a coin 100 times) 50 times. And the plot above shows our results from those 50 experiments. What happens if we did this experiment 100 times? Or a thousand times?? Let us see... 3

Below, we see what happens when we do this experiment 500, 1000, and 5000 times: 0.10 0.075 0.05 0.050 5 40 45 50 55 60 fifty 0 40 50 60 thousand 0.08 0.06 0.075 0.04 0.050 5 40 50 60 five_hundred 0 30 40 50 60 70 five_thousand Whoa! We see that our histograms start to look like a bell curve! Clearly, this is no coincidence! This is because a Binomial Random Variable begins to look like a Normally Distributed Random Variable as the number of trials, n, grows large! Careful!! Take notice that we did NOT increase n yet, only the number of times that we did the experiment! So, now if we increase n, we would expect to see this bell-shaped-looking curve actually start to look more and more like a Normal Distribution. As of now, you can notice that it doesn t quite look like a normal distribution, but rather just a similar-looking curve. 4

3 The Approximation: Toss a coin 100 times Now, we do the same thing as above, but each experiment is tossing the coin 1000 times. What do you think this does to the Expected Value? Ponder this question: Is it easier to get all heads if I only toss the coin 10 times? Would it be harder to get all heads if I tossed the coin 1000 times? These questions have us think about the expected value and the standard error. As we do more and more trials, do we expect the mean of our sample to get closer to the true mean more often? So, let s do the experiment where we toss the coin 1,000 times. experiment 50 times, as we did before. These are our results: And let s do this 6 4 count 2 0 460 480 500 520 Number of Heads And we see that it s centered around 500 heads, as per our intuition of the outcome, and goes from around 450 heads in some experiments to about 550 heads in others. Does it look -upon quick glance- that it s about the same as our first plot?? 5

4 Examine the difference between n = 100 and n = 1000: As per the first section of these notes, we know what the variance of a Binomially Distributed Random variable is: npq. So, if we look at our two different situations, we see that our variances/standard deviations are: V ar(s n ) = npq = 100 1 2 1 2 = 25 in our first context of n = 100, and we have: in our second context of n = 1000. V ar(s n ) = npq = 1000 1 2 1 2 = 250 Careful! What we care about is our standard error, SE! We actually have that our standard errors are: V ar(sn ) = npq = 100 1 2 1 2 = 25 = 5 in our first context of n = 100, and we have: V ar(sn ) = npq = 1000 1 2 1 2 = 250 15 in our second context of n = 1000. Take a second to examine this further... this is actually striking! We know that most (95%) of our data will lie between 2 standard deviations (standard errors) in the context of a Normal Distribution. And here, that means in our first context that most will lie between 40 and 60 heads, while it will be between 470 and 530 in the second. However!... An interval of length 20 is actually 20% of the possible values of the first context (since we could get anywhere between 0 and 100 heads in 100 coin tosses) and an interval of length 60 is only 6% of the possible values in the second context (since we could get anywhere between 0 and 1000 heads if we flip a coin 1000 times). That means our distribution is MUCH tighter about the mean when we made 1000 tosses (as n got larger) than when we only made 100 tosses. 6

5 Visualize 1000 tosses: Let s see what it looks like when we do the 1000 toss experiment many times. Below is for 50, 500, 1000, and 5000 experiments of 1000 tosses: 0.08 0.06 0.03 0.04 0.01 475 500 525 fifty 450 475 500 525 550 thousand 0.03 0.01 0.01 450 475 500 525 five_hundred 480 520 560 five_thousand The important thing to look at is the five thousand experiment plot in the lower-right corner. If we compare this to the same plot in the previous 100-toss experiment, this should look more similar to a Normal Distribution. Let s see as n gets even larger... 7

6 As n gets larger and larger: We see what happens when n = 10, 000 below: 1250 1000 750 count 500 250 0 498000 499000 500000 501000 502000 Number of Heads And again for n = 100, 000: 4000 3000 count 2000 1000 0 497000 498000 499000 500000 501000 502000 Number of Heads And this looks pretty Normal to me! Note: In fact... we can check that this is extremely close to a Normal Curve. 8

7 Is n = 1000 a good Approximation? If we perform the n = 1, 000 experiment many, many times, we can actually get a good idea of how well it is approximated by a Normal Distribution. We plot the 1000-toss experiment done 100,000 times below: 0.01 450 500 550 heads What does this show us? Well... if we have a random variable that follows a Binomial Distribution where the n is at least 1,000... that we find that it is almost a Normal Distribution! This is a very important discovery of ours! Careful! Recall that a Normal Distribution is defined by two things: its mean and variance. If that s all we need, the mean and variance... well, we re gold! We have both of those things! 9