Statistics 522: Sampling and Survey Techniques. Topic 6. There are situations where this can give much better results.

Transcription

1 Topic Overview Statistics 522: Sampling and Survey Techniques This topic will cover Sampling with unequal probabilities Sampling one primary sampling unit One-stage sampling with replacement Topic 6 Unequal probabilities Recall π i is the probability that unit i is selected as part of the sample. Most designs we have studied so far have the π i equal. Now we consider general designs where the π i can vary with i. There are situations where this can give much better results. Example 6.1 Survey of nursing home residents in Philadelphia to determine preferences on lifesustaining treatments 294 nursing homes with a total of 37,652 beds (number of residents not known at the planning stage) Use cluster sampling Suppose we choose an SRS of the 294 nursing homes and then an SRS of 10 residents of each selected home. A nursing home with 20 beds has the same probability of being sampled as a nursing home with 1000 beds. 10 residents from the 20 bed home represent fewer people than 10 residents from 1000 bed home. 1

2 Self-weighting This procedure gives a sample that is not self-weighted. Alternatives that are self-weighted. A one-stage cluster sample Sample a fixed percentage of the residents of each selected nursing home. The two-stage cluster design The two-stage cluster design (SRS of homes, then equal proportion SRS of residents in each selected home) Gives a mathematically valid estimator SRS at first stage Three shortcomings: We would expect t i to be proportional to the number of beds in nursing home i, so estimators will have large variance (M i ). Equal percentage sampling in each selected home may be difficult to administer. Cost is not known in advance (dont know if you will get large or small homes in sample). The study They drew a sample of 57 nursing homes with probabilities proportional to the number of beds. Then they took an SRS of 30 beds (and their occupants) from a list of all beds within each selected nursing home. Properties Each bed is equally likely to be in the sample (note beds vs occupants). The cost is known before selecting the sample. The same number of interviews is taken at each nursing home. The estimators will have smaller variance 2

3 Key ideas When sampling with unequal probabilities, we deliberately vary the selection probabilities. We compensate by using weights in the estimation. The key is that we know the selection probabilities Notation The probability that psu i is in the sample is π i. The probability that psu i is selected on the first draw is ψ i. We will consider an artificial situation where n =1,soπ i = ψ i. Sampling one psu Sample size is n =1. Suppose we are interested in estimating the population total. t i is the total for psu i. To illustrate the ideas, we will assume that we know the whole population. The Example N = 4 supermarkets Size (in square meters) varies. Select n = 1 with probabilities proportional to size. Record total sales Using the data from one store we want to estimate total sales for the four stores in the population. The population Store Size ψ i t i A 100 1/16 11 B 200 2/16 20 C 300 3/16 24 D / Total

4 Weights The weights w i are the inverses of the selection probabilities ψ i. The weighted estimator of the population total is ˆt ψ = w i t i. There are four possible samples. We calculate ˆt ψ for each. The samples Sample ψ i w i t i ˆt ψ A 1/ B 2/ C 3/16 16/ D 10/16 16/ Sampling distribution of the estimate ˆt ψ Sample ψ i ˆt ψ 1 1/ / / / Mean of the sampling distribution of ˆt ψ So ˆt ψ is unbiased. This will always be true. Eˆt ψ = = 300 = t 16 Eˆt ψ = ψ i w i t i = t i Variance of the sampling distribution ˆt ψ Var(ˆt ψ )= 1 16 ( ) ( ) ( ) ( )2 = Compare with the variance for an SRS: Var(ˆt SRS )= 1 4 ( ) ( ) ( ) ( )2 =

5 Interpretation Store D is the largest and we expect it to account for a large portion of the total sales. Therefore, we give it a higher probability of being in the sample (10/16) than it would have with an SRS (1/4). If it is selected, we multiply its sales by (16/10) to estimate total sales. One-stage sampling with replacement Suppose n>1 and we sample with replacement. This implies π i =1 (1 ψ i ) n. Probability that item i is selected on the first draw is the same as the probability that item i is selected on any other draw. Sampling with replacement gives us n independent estimates of the population total, one for each unit in sample. We average these n estimates. Estimated variance is variance of the estimates divided by n Example 6.2 N = 15 classes of elementary stat M i students in class i (i = 1 to 15) Values of M i range from 20 to 100. We want a sample of 5 classes. Each student in the selected classes will fill out a questionnaire. (It is possible for the same class to be selected more than once.) Randomization There are a total of 647 students in these classes. Select 5 random numbers between 1 and 647. Think about ordering the students by class. Each random number corresponds to a student and the corresponding class will be in the sample. 5

6 This method This method is called the cumulative-size method. It is based on M 1, M 1 + M 2, M 1 + M 2 + M 3,... Analternativeistousethecumulativesumsoftheψ i and select random numbers between 0 and 1. For this example, ψ i = M i /647 Alternative Systematic sampling is often used as an alternative in this setting. The basic idea is the same. Not technically sampling with replacement Works well as systematic sampling works well. See page 186 for details. Lahiris method Involves two stages of randomization Rejection sampling: corresponds to classroom problem in Problem Set 2. Can be inefficient. See page 187 for details Estimation Theory Let Q i be the number of times unit i occurs in the sample. Then ˆt ψ = 1 n Qi t i /ψ i. The estimated variance of ˆt i is 1 Qi ( t i ˆt ψ ) 2 n(n 1) ψ i The estimate and its estimated variance are both unbiased. Choosing the selection probabilities We want small variance for our estimator. Often, t i is related to the size of the psu. We can take ψ i proportional to M i or some other measure of the size of psu i. 6

7 PPS This procedure is called sampling with probability proportional to size (pps). The formulas for the estimate and variance can be simplified for this special case. See page 190 for details See Example 6.5 on pages ψ i = M i K t i ψ i = Kȳ i Two-stage sampling with replacement Basic ideas are very similar to one-stage sampling. ψ i is the probability that psu i is selected on the first (or any) draw. We take a sample of m i ssus from each selected psu. Sampling ssu s Usually we use an SRS. Alternatives include systematic sampling any other probability sampling method Note if a psu is selected more than once, a separate independent second stage sample is required. Estimates and SE s Weights are used to make the estimators unbiased. Formulas are similar to those for one-stage. See (6.8) and (6.9) on page 192 7

8 Outline of the procedure 1. Determine the ψ i. 2. Select the n psus (with replacement). 3. Select the ssus. 4. Estimate the t for each selected psu, ˆt ψ = weight ˆt 5. The average of these is ˆt ψ. 6. SE is the standard error of these (sd/ n). Unequal probability sampling without replacement ψ i is the probability of selection on the first draw. The probability of selection on later draws depends on which units were selected on earlier draws. Estimation π i is called the inclusion probability. ( pop π i = n) π i,j is the probability that both psu i and psu j are in the sample. ( j i π i,j =(n 1)π i ) Weights (inverse of selection probability) we use π i /n in place of ψ i (with replacement) The recommended procedure is to use the Horvitz-Thompson (HT) estimator and the associated SE. (ˆt HT = sam ˆt i /π i ) See page for details. This estimator can be generalized to other designs that do not use replacement. Randomization Theory Framework is Probability sampling without replacement for the psus for the first stage Sampling at the second stage is independent of sampling at the first stage 8

9 Horvitz-Thompson Randomization theory can be used to prove the Horvitz-Thompson Theorem. Expected value of the estimator is t. Formula for the variance of the estimator The estimator ˆt HT = ˆt i /π i where the sum is over the psu s selected in the first stage. Idea behind proofs is to condition on which psus are in the sample. Study pages Model One-way random effects anova model where Y i,j = A i + ɛ i,j the A i are random variables with mean µ and variance σa 2 the ɛ i,j are random variables with mean 0 and variance σ 2. the A i and the ɛ i,j are uncorrelated The pps estimator π i = nm i /K the inclusion probability We rewrite this as a weighted estimator. where w i,j = K nm i ˆT P = K nm i ˆTi ˆt i = M i m i Yi,j ˆt P = w i,j Y i,j Take expected values to show that the estimator is unbiased. 9

10 Variance The variance can be computed. See page 211 The variance depends on which psu s are selected through the M i. The variance is smallest when psu s with the largest M i are chosen. Recall Estimate of population total is the weighted average of the ˆt i for the selected psus. The weights w i are the inverses of the probabilities of selection. Elephants A circus needed to ship its 50 elephants. They needed to estimate the total weight of the animals. It is not easy to weigh 50 elephants and they were in a hurry. They had data from three years ago. Sample NO The owner wanted to base the estimate on a sample. Dumbo had a weight equal to the average three years ago. The owner wanted to weigh Dumbo and multiply by 50. The statistician said: You have to use probability sampling and the Horvitz-Thompson estimator. They compromised: The probability of selecting Dumbo was set as 99/100. The probability of selecting each of the other elephants was 1/

11 Who was selected NO Dumbo, of course. The owner was happy and said now we can estimate the weight of the 50 elephants as 50 times Dumbos weight, 50y. The statistician said The estimate of the total weight of the 50 elephants should be Dumbos weight divided by his probability of selection. This is y/(99/100) or 100y/99. The theory behind this estimator is rigorous What if The owner asked What if the randomization had selected Jumbo the largest elephant in the herd? The statistician replied 4900y, where y is Jumbos weight. Conclusion The statistician lost his circus job and became a teacher of statistics. bad model; highly variable estimator Due to Basu (1971). 11