Discrete vs Continuous Data The Normal Curve and The Sampling Distribution We have seen examples of probability distributions for discrete variables X, such as the binomial distribution. We could use it to answer questions such as what is the probability that I win exactly three times in five games, i.e. P(X=3)=? Similar questions make less sense with continuous variables, such as a person s height. What is the probability that a randomly selected person is exactly 180cm tall? On a continuous scale, no person is truly exactly 180cm tall. However, we could as questions such as How likely is it to select a person that is 180cm or taller?, i.e. P(X 180)=? or How likely is it to select a person approximately 180cm tall?, by which we might mean between 179.5cm and 180.5cm tall, i.e. P(179.5<X<180.5)=? The Normal Curve One particular probability distribution is the normal curve (also called the bell curve or Gaussian curve). This particular curve is important to us, as it a) models many natural phenomena well (e.g. heights, IQ test scores, etc. are often approximately bell shaped). b) is well understood and studied c) is an essential theoretical tool for inferential statistics Notes: * developed by C.F. Gauss * single peak, symmetrical * mean=median=mode (all in centre of distribution) * stretches horizontally to infinity, although probabilities quickly becomes negligible * the curve is completely described by just (centre) and (spread, roughly to the inflection point) As a result, when working with continuous probability distribution curves (or density curves ), we only work with probabilities of entire ranges of scores, not individual scores (which by default have probability zero). Consider Data Set B from the sample data sets. Here a discrete probability distribution (the histogram) is compared to a continuous probability distribution (the curve). One can use one to approximate the other, and in both cases the probability of events is equivalent to the area under the graph. Page 1 of 24 Page 2 of 24
Example Here are two normal curves with different values of mean and standard deviation : In a 1980 Australian Study on male blood pressure on two populations (one with and one without medication), both data sets were found to be approximately normally distributed. Recall: The Empirical Rule When we introduced z-scores, we defined the position of a data point relative to the mean ( ) in terms of standard deviations ( ). We already know that if data is normally distributed (bell-shaped), then we can approximate the proportion of data that will fall within a certain distance from mean (i.e. within a certain range of z-scores): 1. Approximately 68% of data will fall within one standard deviation of the mean, i.e. will have z-scores in the range -1<z<1 2. Approximately 95% of data will fall within two standard deviation of the mean, i.e. -2<z<2 3. Approximately 99.7% of data will fall within three standard deviation of the mean, i.e. -3<z<3 Example: What proportion of Australian Males without medication (Population 1) has blood pressure a) between 71.5 and 89.9 mmhg Population 1: healthy without medication =80.7 mmhg =9.2 mmhg Population 2: hypertension with medication =94.9 mmhg =11.5 mmhg b) above 99.1 mmhg Note: * Both curves have the same area, but a curve with higher value of (Population 2) is wider but flatter. * A different value of simply shifts the curve on the horizontal axis Note again that we cannot use a density curve to directly give a proportion for an individual data point (e.g. have exactly 85 mmhg ) but only a range of values. Page 3 of 24 Page 4 of 24
Finding Other Normal Proportions While the empirical rule gives us the proportion of population data lying above/below/between z-values of exactly z=1, z=2, z=3, it does not allow us to compute data proportions for any z-values, Use the standard normal table to find the following: i) P(z<1.5) e.g. Find the proportion of data in a normal curve that lies between z=-.5 and z=2.1. We will write this as P(-.5 < z < 2.1), i.e. the proportion (and later, probability) of data that has a z-score between -.5 and 2.1. ii) P(z>-0.85) To do this, we look at a normal distribution with =0 and =1, called the standard normal distribution. We could use Calculus to find the area under the normal curve, which has the function equation for any range of z-values. Thankfully, somebody has already done this for us and compiled the results in a table. iii) P(-2<z<1) Page 5 of 24 Page 6 of 24
This table gives the standard normal curve, i.e. distributions with =0 and =1. However, we can convert any other normally distributed variable to z-scores and use the same techniques. Examples: 1. The length of flights from Toronto to Frankfurt are normally distributed with =465 minutes (or 7 3/4 hours) and =23 minutes. a) What proportion of flights take longer than 7 hours? 2. Scores in a college entrance exam are normally distributed with =71 and =10.2. You got an 85. If only the top 10% of applicants are accepted, is it time to celebrate? 3. For a particular group of patients, systolic blood pressure is normally distributed with a mean of 120 mmhg and a standard deviation of 8 mmhg. a) What proportion of this population has blood pressure above 135? b) If we randomly select 200 flights, how many would we expect to last between 8 to 8½ hours? b) Find the interquartile range for this data. th c) If a flight s time was in the 80 percentile of all flights, how long did it take? 4. Assume TOEFL test scores are normally distributed with mean 570 (out of 660). If Andi got a score of 610, and is in the 90 th pecrentile, what % of scores lie above 600? Page 7 of 24 Page 8 of 24
Using the Normal Curve to Approximate the Binomial At a union rally, 60% of members strongly support strike action, 30% somewhat support strike action, and 10% do not support strike action. If 60 members are selected at random, what is the probability that more than half of them strongly support strike action? This is actually a binomial distribution question. - repeated trials: n=60 - success or failure: p=.6, q=.4 (lump other two possibilities into a general failure ) - independent trials Hence we would like to know the probability P(X>30). We want to convert the binomial P(X>30) into a normal probability question, i.e. convert X=30 into a z-score. We need to Step 1: convert the discrete variable into a continuous variable by ±0.5, the continuous correction factor. A discrete X>30 converts to a continuous X>30.5 Step 2: get the mean and standard deviation From earlier: =np Step 3: Convert to a z-score, using = npq. = 36 = 3.79 Hmmm... this is going to take a while! P(X>30) = P(x=31) + P(x=32) +...+ P(x=60). However, for large enough n, the binomial curve looks very much like a normal curve, hence we can use the normal to approximate the binomial. Condition: Need np 10 and nq 10. In our case, np=36 and nq=24 > We re OK! Step 4: Now find the normal probability, as per usual. P(z > -1.45) = 1 -.0735=.9265 Hence you have a 92.65% change of having at least half be strong supporters. Note: Some texts do not use the continuous correction factor, instead just using X=30 in the above. Page 9 of 24 Page 10 of 24
Example According to Statistics Canada, the mother tongue of 22.7% of Canadians is French. If a random sample of 100 Canadians is selected, find the probability that the number of people with French as their mother tongue a) is more than 20? b) is between 20 and 30? Page 11 of 24 Page 12 of 24
Inferential Statistics: Introduction to the Sampling Distribution and the Central Limit Theorem Our main goal in inferential statistics is to make predictions on the value of population parameters using a sample statistic. First, let s look at a tiny population of just N=10 houses. House #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 Lead (ppb) 4 2 7 3 10 12 1 5 2 7 The population average can be calculated as For example, we wish to estimate the average exposure to lead through tap water in Regina households (Health Canada has established a maximum safe concentration of 10 parts per billion). We cannot easily access all Regina households (the entire population with size N) to test the water and calculate the population mean. Instead, we can only consider a small sample (of size n) of Regina households and calculate a sample mean. Our goal will be to make an inference on the population parameter from the collected sample statistic. For example, we may wish to estimate the value of given just our calculated value of. The first step will be to consider the nature and (mathematical) behaviour of possible samples. If we take a random sample of three households, say #3, #6 and #9, we get a sample average of Let s add a fourth house to the sample (e.g. #8) Now = Add a fifth house (e.g. #2). Now = Clearly,. Even with this small population, we can see that as the size of the sample grows, its sample mean will (in general) tend closer and closer to the value of the population mean (with some ups and downs ). This is known as the law of large numbers. Page 13 of 24 Page 14 of 24
Let s now consider an even smaller scale example in far more detail. Instead of a city of 200,000 people, we will look at a population with only four households. The data for lead content (in parts per billion) is given as: Let s now consider all (!) possible samples of size n=2 taken from this population, that is, any combination of 2 households selected from the population of 4 households. How many possible samples do we have? Household #1 #2 #3 #4 Lead (in ppb) 8 2 3 12 Again, we can compute the population mean without much difficulty. Can we find this value by just considering samples (since, in large populations, we might not be able to calculate )? Consider, for example, a (tiny) sample of size n=2. What is a possible sample? Perhaps we selected household #1 and #4. The average for this sample is = Clearly, for this particular sample,. Pick a different sample and calculate its average: List the samples and their averages: Sample Houses Sample ppb Sample Average This distribution, listing all possible sample means of a given size (here n=2) is called the sampling distribution of the mean (or simply the sampling distribution). Note: 1) None of the sample averages equal (in this case). 2) However,... Page 15 of 24 Page 16 of 24
Let s try this again. Construct the sampling distribution of size n=3 taken from this population, that is, any combination of 3 households selected from the population of 4 households. Now there are Again, list the samples and their averages: Sample Houses possible samples. If we consider larger populations, with larger possible samples, a second trend emerges. For example, consider a population of 10,000 randomly generated numbers between 0.0 and 10.0. We would expect their average to be very close to =5.0. Now consider sampling distributions for sizes n=1, n=2, n=5, and n=100: Sample ppb Sample Average Calculate the average of all four sample averages: Again, it appears, and this will always be the case, that the average of all sample averages equals the population average, i.e. the mean of the sampling distribution of means equals the population mean. We write this as = Page 17 of 24 Page 18 of 24
We can now see the following: - each sampling distribution really does have mean = 5.0 = - as the sample size increases, the shape of the distribution seems to increasingly approximate the normal distribution. - as the sample size increases, the standard deviation of the sampling distribution decreases. In fact, the standard deviation of the sampling distribution,, can be expressed in terms of the population standard deviation as well: This term is also called the Standard Error of the Mean. Note again that the Central Limit Theorem holds, regardless of the shape of the population distribution!!! Even if the original population is very skewed, the sampling distribution will approximate a normal curve. In practice, we can assume that if the sample size is large enough (and we will often use n>30), the sampling distribution will be close enough to the normal distribution N(, / n) for us to use the normal tables. Careful: Sample Distribution: Sampling Distribution: the distribution of values in one particular sample. the distribution of the mean values of all possible samples with a given size. These results are known as the Central Limit Theorem: If a sample of size n is selected from any population, the distribution of sample means will increasingly approximate a normal curve with mean = and standard deviation = / n, as n increases. Also note: While theoretically we can find (the value we re after) by finding a sampling distribution and calculating its mean, the latter is far, far more cumbersome than just calculating. While this may seem frustrating at first ( then why are we studying this?!?!? ), the idea behind the sampling distribution becomes a central tool for our future work. Page 19 of 24 Page 20 of 24
An Example We wish to investigate average phone bills for cellphone users. Suppose, for sake of this example, that we already know the population mean =$102.30 and standard deviation =$32.10 a) Assuming the population is normally distributed, find the probability that a randomly selected phone bill will be above $100? b) In reality, we can be almost certain that this distribution won t be normal. In fact, it will be likely be... Without the assumption of normality we couldn t actually calculate the probability that a random phone bill is above $100, since the normal table doesn t apply. That is, our answer in part (a) is likely incorrect. c) Suppose we now take a random sample of 35 cellphone users. Describe the nature of the sampling distribution: Page 21 of 24 Page 22 of 24
d) Now find the probability that a sample of 35 phone bills will have a sample mean above $100? d) What it the probability that the mean of a sample of 35 phone bills will be within $5 of the population mean? Another Example Again, while we cannot calculate probabilities for individuals under these circumstances, we can investigate probabilities for sample means (of sufficiently large samples), regardless of the shape of the original distribution! Assume that the weight a certain type of plastic bag can carry is normally distributed with =12.2 kg and = 1.9 kg. a) How likely is it that a randomly selected bag will be capable of holding 15 kg? b) If a sample of 40 bags is selected for testing, find the probability that the mean carrying weight is between 12kg and 13 kg. Page 23 of 24 Page 24 of 24