Weighting Methods in Survey Sampling

Setion on Survey Researh Methods JSM 01 Weighting Methods in Survey Sampling Chiao-hih Chang Ferry Butar Butar Abstrat It is said that a well-designed survey an best prevent nonresponse. However, no matter how well a survey is designed, in pratie, nonresponse almost always ours. The easiest way to deal with nonresponse is to ignore it, but frequently, ignoring nonresponse results in poor survey quality. Item nonresponse and unit nonresponse are two types of nonresponse. Imputation proedures are popular remedies for the former while weighting methods are ommonly used to ompensate for the latter. This paper fouses on weighting methods that help redue nonresponse bias. Key Words: nonresponse, adjustment, alibration, missing data 1. Introdution In survey sampling, a good sample is desirable for making good inferenes about a population. The bias of a properly alulated estimate results from sampling error and nonsampling errors. Pratially, sampling error always exists beause of sample-to-sample variation. The only sure way to avoid sampling error is to study the entire population, i.e., ensus, and that is impratial for a huge population. Seletion bias is an example of nonsampling error, and it inludes nonresponse whih might greatly bias estimates alulated without adjustments. If there is nonresponse together with under-overage in survey sampling, missing data will result. Under-overage ours when not all of the elements in the target population are inluded in the sampling frame. Nonresponse an be divided into item nonresponse and unit nonresponse. In item nonresponse, at least one but not all of the measurements of interest are obtained from the sampled unit. In unit nonresponse, the sampled unit provides nothing. However, some information usually is available by other means. Example 1.1. Suppose a telephone survey is onduted to estimate household eletriity onsumption in a ertain region, and a sample of n individuals is drawn from the residential phone diretory. Therefore, the target population inludes all households in the region and the sampling frame is the list of people in the telephone diretory. Eah of the n individuals is asked the type of housing unit, the number of bedrooms, and the average monthly eletrial use see Table 1. Lohr 009 stated that missing data and haphazard mistakes in data olletion are often the biggest auses of error in a survey. Designing a good survey and questionnaire, and being areful in olleting data an prevent poor response rates and mistakes in data olletion. Follow-up and allbak are ommon proedures used to inrease response rate. In the telephone survey example, it is possible that the seleted person unavailable at the first all is ontated and there is a response at the kth allbak. k is a positive integer. Sine making phone alls may be expensive and time-onsuming, it is impratial to try to ontat someone a number of times. Furthermore, it is usually the ase that some people refuse to be interviewed no matter how many times an interviewer tries. Department of Mathematis and Statistis Sam Houston State University, Huntsville, TX 77341 Department of Mathematis and Statistis Sam Houston State University, Huntsville, TX 77341 4768

Setion on Survey Researh Methods JSM 01 Table 1: Hypothetial data for example Housing Unit Number of Average Monthly Individual Type Bedrooms Eletri Usage kw h a House 3,000 b House - - - - 3,000 d Apartment - 500 f - - - g - 1,000 h Other - i - 3 -.. Unit nonresponse: f. Item nonresponse: b,, d, g, h, i... Sine nonresponse almost always ours in a survey, one should be autioned if the nonrespondents differ ritially from the respondents, espeially when the response rate is low. Due to the presene of nonresponse and the plausible differene between respondents and nonrespondents, the sample is onsidered to be representative for the population of people who will answer survey questions but not for the target population. Therefore, inferenes based only on the respondents do not seem to be valid. When nonresponse is inevitable and non-ignorable, there are ways to ompensate.. Simple Random Sampling SRS Simple random sampling is the most ommon probability sampling proedure. Although it is referred to as simple random sampling, this proedure is very important beause fully understanding simple random sampling is a prerequisite for further studies in other sampling tehniques. In simple random sampling, every element in the population has an equal probability of being seleted in the sample. Intuitively, eah subset of a fixed number of elements in the population has an equal probability of being seleted as a sample..1 Estimation in Simple Random Sampling In many sampling studies, the most ommon objetive is to estimate the population total, mean, or proportion. The estimation of population total is emphasized here beause estimated population total divided by the population size yields the estimated population mean and estimating proportion is a speial form of estimating mean. To derive the estimator of the population total, let U = {1,,..., N} and let S be a set of n elements hosen from U. Let U y = {y i i U} denote the population of size N, then y i is the survey harateristi of the ith unit and the simple random sample is denoted by S y = {y i i S}. Let t and t s be population total and sample total, respetively, then t = N y i and t s = y i. The i=1 i S 4769

Setion on Survey Researh Methods JSM 01 unbiased estimator of the population mean, Y = t N, 1 is simply the sample mean, y = t s. The population total is estimated by substituting y for n Y in 1, so ˆt = N n t s. Note that t and Y are unknown sine, in a survey, a sample is a small portion of the population, i.e., n < N. In fat, n is muh smaller than N, and this is the idea of sampling using n units to represent N units, i.e, one unit in the sample represents N/n units in the population. Suppose that the perfet representative sample is given, then a logial relation between the sample total and the population total should be t s : t = n : N. From this relation, one an see that the estimation of the population total is Let w i = N/n for all i S, then may be rewritten as ˆt = N n t s. ˆt = i S w i y i, 3 where w i is alled the sampling weight. Sampling weights are alulated to help simplify the alulation in many sample surveys. More importantly, making adjustments on sampling weights may help redue nonresponse bias. Example.1. A simple random sample of size 40 drawn from a population of size 10 is illustrated in Figure.1. In this sample design, w i = 10 40 for all i S, i.e., one seleted unit represents 3 units inluding itself in the population. Cornfield 1944 introdued a useful method of deriving the expeted value and the variane of an estimator in sampling without replaement from a finite population. In order to use the method the researher must show that ˆt is an unbiased estimator of t and to obtain the variane of ˆt. Let Z i be a random variable suh that { 1, i S Z i = 0, i S. Note that E Z i = n/n, and E Z i Z j = n N N 1. By definition, given n and N, the variane of ˆt is V ar ˆt n, N [ = E ˆt t ] n, N, and it an be shown V ar ˆt n, N = N 1 n N S n 1 n, where S = N 1 1 N y i Y, where the unknown S an be i=1 4770

Setion on Survey Researh Methods JSM 01 estimated by the sample variane, s = n 1 1 y i y. Thus, i S V ar ˆt n, N = N 1 n s N n 4 is an unbiased estimator of V ar ˆt n, N.. Nonresponse in Simple Random Sampling Two widely disussed remedies for nonresponse are imputation and weighting adjustment. In imputation, the researher may fill in missing values by plausible values whih are generated from observed variables, hene imputation is the ommon remedy for item nonresponse. There are many imputation proedures used suh as hot-dek imputation, olddek imputation, regression imputation, multiple imputation, and frational imputation. However, our researh fouses on weighting adjustment, in whih sampled units are lassified into lasses by auxiliary information, and then eah responding sampled unit is assigned an adjustment weight alulated by taking the inverse of the response rate of the orresponding lass. Responding sampled units in a lass with low response rate are assigned higher weights than those in a lass with a high response rate. Weighting adjustment is the ommon remedy for unit nonresponse, and it does not require filling in gaps in the data. A weighted estimator is unbiased under the assumption the probability of nonresponse is the same for all units within a lass. Although the assumption is usually not satisfied in pratie, it is more reasonable than to assume that the probability of nonresponse is the same for all units in the sample, hene nonresponse is ignorable. As a result, weighting adjustment might not eliminate nonresponse bias but it is useful for reduing nonresponse bias. Example.. The small data set in Table. is reated to mimi the situation in Example 1.1 in whih one an see both item nonresponse and unit nonresponse. This sample of size, n = 40, is drawn from the population of size, N = 10. The goal is to estimate t, the total average monthly eletriity onsumption in the region of interest. Y is the study variable; X 1, X, and X 3 are the auxiliary variables. X 3 is obtained from other available soures but not from the sampled units, hene it is always observed. Y, X 1, and X are subjet to nonresponse. Suppose that nonresponse an be totally ignored and only observed Y s are onsidered in the estimation, then ˆt and V ar ˆt n, N an be alulated by using 3 and 4, respetively. Although ˆt may not be affeted by the nonresponse, V ar ˆt n, N beomes large beause n is 6 instead of 40. Consequently, the onfidene interval may be too wide and does not provide muh information. Ignoring nonresponse is usually not a good idea. The following are some possible approahes to handling this data set. Fill in every gap in the data set by some imputation tehniques. Form lasses by X 3 and perform weighting adjustment. Fill in the gaps in auxiliary information and form lasses by the best possible set of auxiliary variables. Then, perform weighting adjustment. Fill in the gaps for the data points with suffiient auxiliary information. Weighting adjustment an then ompensate for the remaining nonresponse. Note that item nonresponse may be treated as unit nonresponse when a respondent answers too few questions. 4771

Setion on Survey Researh Methods JSM 01 Table : Hypothetial sample of 40 from a population of 10 ID X 1 X X 3 Y 1 3-1 500 1 1 500 9 3 3 1-10 1-1 - 1 1 4 4000 19 3 1 1 1500 3 3-8 4 1 500 30 3-1 1000 3 1 1 000 33 1 3000 36 3 1 3000 37 - - 1-43 - - - 44 1 3 1 000 5 3 1 1 1000 53 1-1 4500 58 4-1 500 60 3 1 000 63 3 1 1500 ID X 1 X X 3 Y 65 1 1 1500 66 3 1 500 70 3 1 1 500 73 3 1-75 1-000 79 4-1 - 83 1 3 1 3500 84 1 3-85 3 1 1 1000 86 3 1 3000 88 - - - 89-95 1 - - 96 1 1 500 99 - - - 101 3 1 1 000 104 3 1 1 1500 110-1 500 11 1-1 - 117 3 - Note: X 1 =1 house, = mobile home, =3 apartment,=4 other; X =1 one bedroom, = two bedrooms,= 3 three bedrooms,=4 at least four bedrooms; X 3 =1 the ommunity was establishd after 1950, = otherwise; Y denotes average monthly eletrial usage in kw h. Lohr 009 noted that response rates an be manipulated by defining them differently and presented several formulas that are used in surveys for alulating response rates. Here, for demonstration purposes, assume that only one variable is to be observed in a survey. Therefore, the alulation of response rates is straightforward. Let Y be the study variable subjet to nonresponse, and the response rates are alulated by dividing the number of observed Y s by the number of units in sample. Let m be the number of observations obtained from the simple random sample of size n drawn from the population of size N and let r 100% denote the response rate, where r = m n. Figure.1 displays the sample with 100%, a response rate that is nearly impossible to ahieve in surveys. The following example is an altered Example.1 in whih nonresponse ours. It shows the effet of ignoring nonresponse. Example.3. Consider the situation in Example 1.1 again. The goal is to estimate the total average monthly eletriity onsumption in the region of interest. Let Y denote the average monthly eletrial usage. Suppose that the sample is shown in Figure.3 in whih and represent seleted persons who do not give information about their average monthly eletrial usage while and represent seleted persons who answer the question about their average monthly eletrial usage. The response rate is 65% beause m = 6 and n = 40. Sine the sample is drawn from a telephone diretory, seleted persons addresses are usually available notwithstanding nonresponse. This additional information enables the researher to find out in what ommunity a person lives. Suppose it is observed that housing units in old ommunities are generally larger than those in new ommunities. Also, it is reasonable to state that people who live in large properties normally have higher eletrial 477

Setion on Survey Researh Methods JSM 01 usage than those who live in small properties. Ideally, one would like to form lasses by some harateristis and then view the group of respondents in eah lass as a simple random sample of the orresponding lass. Consider only Y and X 3, then the data set in Table. an be lassified into two groups by the age of the ommunity, X 3. One an visually tell the differene in response rate between the two lasses in Figure.4. Class I is omprised of n 1 = 30 people and m 1 = 4 out of 30 people are respondents. Thus, the response rate in lass I is 80%, i.e., r 1 = 0.8. Class II is omprised of n = 10 people and only m = out of 10 people are respondents. Thus, the response rate in lass II is 0%, i.e., r = 0.. Suppose that people in lass II onsume more eletriity than people in lass I, then ignoring nonresponse an seriously bias the estimate. To show that, use the data of the 6 respondents to estimate the total average monthly eletriity onsumption, t. This an be thought of as taking a simple random sample of size, m = 6, drawn from the population of size, N = 10, so here, S is a set ontaining 6 respondents IDs. The sampling weight is N/m. Use 3 to find ˆt 49, 31. As an example, assume that the sample is representative and the lassifiation is perfet, then the group of respondents in eah lass an be viewed as a simple random sample obtained from a orresponding subpopulation. Let N 1 and N be the number of members in subpopulation I and the number of members in subpopulation II, respetively. Sine the sample is representative, then n N = n N holds for = 1,. Therefore, N 1 = 90 and N = 30. Let y 1 and y be the mean average monthly eletrial usage in lass I and II, respetively. So, y 1 =, 000 and y = 3, 000. Let Y 1 and Y be the mean average monthly eletrial usage in subpopulation I and II, respetively, then t = N 1 Y 1 + N Y. The reasonable estimate of t should be: ˆt adj = N 1 y 1 + N y = 70, 000. If y is lose to Y for = 1,, then the differene between ˆt adj and t will be small. Assume that y 1 = Y 1 and y = Y, then ˆt adj = t. Compare ˆt with ˆt adj, one may onlude that ˆt underestimates t, and ˆt t is large if: for a fixed r r 1 0, Y Y 1 is large. for a fixed Y Y 1 0, r r 1 is large. If r = r 1 or Y = Y 1, then ˆt = t..3 Weighting Adjustment in Simple Random Sampling In general, suppose that the lassifiation forms C lasses. The set, S, ontaining the identifiation numbers of sampled units an be partitioned so that S = C S. Let U = 4773

Setion on Survey Researh Methods JSM 01 Table 3: The omposition of sample response and nonresponse 1... C Note m m 1 m m C m = C n m n 1 m 1 n m n C m C n m n n 1 n n C n = C m n N N 1 N N C N = C r m 1 n 1 m n m C n C m n N S S. There exist partitions U 1, U,, U C, suh that S U for = 1,,, C. Let A be the set that ontains the identifiation numbers of responding units in lass, so A S for = 1,,, C. Let N, n, and m denote the number of elements in U, S, and A, respetively, and assume that m > 1 for = 1,,, C. So, m = C m, n = C n, and N = C N. The overall response rate and response rate for eah lass an be alulated and ompared by onstruting a table similar to Table 3. Suppose that eah of the n units has the same probability of responding, then m responding units may be interpreted as a simple random sample drawn from the n sampled units. To interpret in terms of weighting, the m units is to represent the n units, so eah of the m units has a weight of n. For = 1,,, C, let y i be the observation of the study having the m identifiation number, i, and let t s = y i. Let t = C t and t = y i, then t i S i U an be estimated by N t s. Sine not every y i where i S is available, t s should be n estimated. Let t a = y i, then ˆt s = n t a. If N is known for = 1,,, C, i A m then ˆt = N ˆt s = N n t a = N y i. n n m m i A It an be regarded as using a simple random sample of size m to represent the population of size N. Therefore, the adjusted estimator of t is ˆt adj = C i A w i y i 5 where w i = N /m, and it an be expressed as w i = w bi w adji where w bi = N /n is alled the base weight in lass and w adji = n /m is alled the nonresponse adjustment weight in lass. In 5, N should be known for all = 1,,, C, and ˆt adj is also alled the post-stratified estimator of t that may also be employed in adjusting under-overage. Generally, ˆt adj should provide a better estimate than ˆt = N i A m y i where nonresponse is totally ignored. To show that ˆt is an unbiased estimator of t, follow the method proposed by Cornfield 4774

Setion on Survey Researh Methods JSM 01 1944 and define: Z i = { 1, i S 0, i S { and A i = 1, i A 0, i A. Thus, E Z i = n /N, E A i = m /n, and E Z i A i = m /N. For = 1,,, C, let u = m, n, N, then it an be shown that E ˆt u = t. Sine the m responding units an be regarded as a simple random sample of size m representing the population of size N for = 1,,, C, then N V ar ˆt u = N 1 m N S m 6 where S = N 1 1 y i t /N. Note that 6 is just the analogous to the i U variane of t seen in Setion.1. Therefore, E ˆt adj u C N = E i A y i u = m C E ˆt C u = t = t, and V ar ˆt adj u C = V ar ˆt u = C V ar ˆt u = C 1 m N S m, where u = m 1,, m C, n 1,, n C, N 1,, N C. In the poststratified estimator, the estimated variane 5 is where s = m 1 1 V ar ˆt adj u = C N 1 m s 7 N m = S. y i t a /m ; E s i A So far, the estimation an be done when N is available for = 1,,, C. However, if this is not the ase, onsider the example of the perfet simple random sample in Example.3 in whih the ratio of eah sub sample size to its orresponding subpopulation size is equal to the ratio of the sample size to the population size. Although this ondition is almost impossible to satisfy in a single sampling ativity, the sampling distribution of n /N should be entered at n/n. That is En /N = n/n and EN /n = N/n. Therefore, t an be estimated by t adj = C i A w iy i 8 where wi = N n and t adj is alled the weighting lass estimator of t. n m Let u = m 1,, m C, n 1,, n C ; therefore, it an be shown that the expeted value of t adj given u, that is E t adj u = t + C n N t t t. The variane n N of t adj given u an be estimated by substituting n n N for N in 7. Thus, V ar t adj u = C N n n 1 n N m s. 9 n m Consider the weighting lass estimator. Oh and Sheuren 1983 proposed that an approximate 1001 α% onfidene interval for t may be onstruted by t adj ± z α/ MSE t adj u 10 4775

Setion on Survey Researh Methods JSM 01 where z α/ is the 1 α/th perentile of the standard normal distribution and MSE t adj u = V ar t adj u [ + Bias t adj u ] C = N n 1 n n N m s + N n n m N 1 3. Stratified Random Sampling ST R C n n N ta t adj. m In simple random sampling, it is possible that some ertain groups of sampling units in the population are underrepresented and hene, will result a bad sample. Stratified random sampling enables one to avoid having a bad sample and to improve the preision of estimates for a fixed sample size or to ahieve the same level of preision with a sample size not as large as in simple random sampling. In stratified random sampling, the population of size N is divided into H subpopulations alled strata. Let stratum h have size N h for h = 1,,, H. The sample is obtained by randomly seleting n h sampling units from stratum h for h = 1,,, H. Therefore, H simple random samples are obtained in the stratified random sampling. Stratified random sampling works well when the variation in the measurement of interest is low within the strata but is high between the strata. Suppose that the total sample size is n so n = H n h. To alloate the sample among the strata may depend on the sample design or some onstraints suh as osts. Disussions of the alloation of sampling units among the strata an be found in Cohran 1977 and Lohr 009. 3.1 Estimation in Stratified Random Sampling Sine a proedure of simple random sampling is implemented in eah stratum, the results of the preeding setions an be applied diretly to eah stratum. Let index set, U = {1,,, N}, denote the population of size N. Let U h denote stratum h of size N h for h = 1,,, H suh that U = H U h, N = H N h, and U h U k = for h k. Let S h denote the simple random sample of size n h drawn form U h for h = 1,,, H and let S = H S h. Therefore the total sample size is n = H n h. Let y hi be the measurement of interest in stratum h, then t h = y hi and t str = H t h are the total in stratum h and the i U h population total, respetively. The unbiased estimator of t h is ˆt h = i Sh w strhi y hi 11 where w strhi = N h /n h, and the variane of ˆt h given n h and N h is V ar ˆt h n h, N h = Nh 1 n h/n h Sh /n h where Sh = N h 1 1 yhi Y h and Y h = t h /N h. i U h The sample variane obtained from stratum h is s h = n h 1 1 y hi y h where i S h y h = n 1 h y hi and E s h = S h. Therefore, i S h V ar ˆt h n h, N h = N h 1 n h s h. 1 N h n h 4776

Setion on Survey Researh Methods JSM 01 Sine the population total is t str = H t h, then it an be estimated by ˆt str = H ˆt h. Thus, ˆt str = H w strhi y hi, and E ˆt str q = H E ˆt H h n h, N h = t h = t str, where i S h q = n 1,, n H, N 1,, N H. The variane of ˆt str given q an be estimated by V ar ˆt str q = H beause V ar ˆt str q H = V ar ˆt h q N h = H 1 n h s h 13 N h n h Nh 3. Nonresponse in Stratified Random Sampling 1 n h S h. N h n h In stratified random sampling, the researher deals with multiple simple random samples. Suppose that there are H strata and a simple random sample of size n h is obtained from eah of the H strata having size N h. Thus, there are H simple random samples. Consider only unit nonresponse and for h = 1,,, H, let m h denote the number of the observations obtained from the n h sampled units. If m h < n h whih is almost always the ase in surveys then nonresponse ours in the sample obtained from stratum h and estimation without adjustments might not produe the best results for making inferenes about the subpopulation as well as the population. 3.3 Weighting Adjustment in Stratified Random Sampling Suppose that a stratified random sampling proedure is arried out and there exists nonresponse. Assume that it is possible to divide the sample seleted from stratum h into C h lasses by some available auxiliary information for h = 1,,, H suh that every sampled units in the same lass has an equal probability of responding. Let set, S h, denote lass having size n h in the sample seleted from stratum h for = 1,,, C h and h = 1,,, H, then S h = C h S h where S h S hk = for h k and n h = C h n h. For h = 1,,, H, let U h1, U h,, and U hch be the substrata of U h and eah of them has size N h1, N h,, and N hch, respetively, so that N h = C h N h and S h U h. Let A h denote the group of responding units in lass of the sample drawn from stratum h and let m h be the size of A h. For h = 1,,, H, r h denotes the response rate in S h. In post-stratified approah, the base weight, w bhi = N h /n h, indiates that one unit in S h is to represent N h /n h units in substratum h. The nonresponse adjustment weight, w adjhi = n h /m h, indiates that one unit in A h is to represent n h /m h units in S h. The final weight, w hi = N h /m h is w hi = w bhi w adjhi. 14 This indiates that one unit in A h represents N h /m h units in substratum h. Using the tehniques in the preeding setions, one an find an unbiased estimator, given u h = m h, n h, N h, for t h, the total of substratum h: ˆt h = i A h w hi y hi. 15 4777

Setion on Survey Researh Methods JSM 01 The variane of ˆt h given u h may be estimated by V ar ˆt h u h = N h 1 m h s h N h m h where s h = 1 y hi t a h and t m h 1 i A h m = ah y hi. Moreover, for h = h i A h 1,,, H, the post-stratified estimator of the total of stratum h is C h ˆt h adj = i A h w hi y hi 16 and E [ ] ˆt h adj u h = th where u h = m h1,, m hch, n h1,, n hch, N h1,, N hch sine E ˆt h u h = th. Thus, for h = 1,,, H, the variane of ˆt h adj given u h [ ] an be estimated by V ar ˆt h adj u h, by summing V ar ˆt h u h from = 1 to C. Therefore, using a post-stratified approah, the population total an be estimated by C h ˆt str adj H = i A h w hi y hi 17 [ ] and it is unbiased given u = u 1 ; u ; ; u H sine E ˆt h adj u h = t h. The variane of ˆt str given u may be estimated by adj [ ˆt V ar adj ] str u = H C h N h 1 m h s h. 18 N h m h Suppose that the N h is not available for = 1,,, C h and h = 1,,, H, then onsider the weighting lass estimator: t h = i A h w hi y hi where w hi = w strhi w adjhi. That is replaing w bhi, = 1,,, C h with w strhi t h adj = C h t h and t str adj = H t h the population total may be estimated by adj = N h n h in 14. Furthermore,. Therefore, using weighting lass approah, C h t str adj H = i A h w hi y hi 19 Let u h [ = m h1,, ] m hch, n h1,, n hch and let u = u 1 ; u ; ; u H. To show that E t str adj u t str, define: Z hi = { 1, i S h 0, i S h { and A hi = 1, i A h 0, i A. h It an be shown that E Z hi = n h, E A hi = m h, E Z hi A hi = m h N h n h [ 1,,, C h and h = 1,,, H, and E t str adj u = t str + H C h n h n h for = N h Nh t h t h N h. 4778

Setion on Survey Researh Methods JSM 01 The variane of t str in 18. That is [ t V ar adj ] str u = adj given u may be estimated by replaing N h with n h n h H C h N h N h nh 1 n h m h s h. 0 n h N h n h m h If proportional alloation is used in the sample design, then n h = n for all h = 1,,, H. N h N In 19, w hi beomes N n w adj hi and 0 an be written as [ t V ar adj ] str u = N n H C h n h 1 n N [ t Oh and Sheuren 1983 proposed MSE adj ] str u as [ t MSE adj ] [ t str u = V ar adj ] str u + { [ ]} and Bias t str adj H u = N h n h N h 1 C h n h n h { Bias 4. Two-Stage Cluster Sampling m h s h. n h m h [ ]}, t str adj u [ N h ta h t ] h. m adj h A multi-stage random sample is onstruted by seleting a sample in at least two stages. Sampling in stages enables one to redue the population and to ombine sampling proedures. Suppose the population an be divided into a number of subpopulations. Contrary to stratifiation, it is expeted that the variane with respet to the measurements of interest is high within eah subpopulation, whereas the variane with respet to the measurements of interest is low between subpopulations. The subpopulations are alled lusters here, and they are usually naturally formed from, for example, ommunities, ity bloks, and shools. In the first stage, a random sample of lusters is obtained, and a number of sub-lusters are formed within eah seleted luster for seletion in the next stage. The subgroups formed for seletion in eah stage prior to the final stage an be alled the population units in general, and the proess of seleting population units within eah population unit obtained in the previous stage an be repeated until the sizes of the population units meet the requirements. In the last stage, a random sample of sampling units is obtained from eah seleted population unit. 4.1 Estimation in Two-Stage Cluster Sampling Suppose that the population onsists of L lusters, then in the first stage, let a simple random sample of l lusters be obtained. Let τ be the population total and let t g be the total of luster g for g = 1,, L. Let index set U = {1,,, L} denote the population, then τ = t g. Let S denote the simple random sample of size l drawn from U, then an g U unbiased estimator of τ is ˆτ = w g t g 1 4779

Setion on Survey Researh Methods JSM 01 where w g = L/l. The variane of ˆτ given l and L an be estimated by V ar ˆτ l, L = L 1 l/l s t where s t = l 1 1 tg t and t = l 1 t g. These estimators l are usable in one-stage luster sampling, but additional work is required in two-stage luster sampling. Sine, in the seond stage, simple random sample g is to be drawn from luster g for all g S, then t g in 1 is not observed but should be estimated. Let U g denote luster g having size N g, and let S g denote simple random sample g having size n g drawn from U g for g S. Let y gi be the measurement of interest obtained from sampling unit i in luster g, then for g S, an unbiased estimator of t g is ˆt g = i Sg w gi y gi where w gi = N g. The variane of ˆt g given n g and N g an be estimated by V ar ˆt g n g, N g n g Combining 1 and, the unbiased estimator of τ for two-stage luster sampling is given by ˆτ = w g w gi y gi. 3 i S g Lohr 009 proved the variane of ˆτ has two omponents: the variability between lusters and the variability of sampling units within eah luster. The variane of ˆτ an be estimated by V ar ˆτ l, L, q = L 1 L l s ˆt l + L l N g 1 n g s g 4 N g n g where q = { n g, N g g S }, s = l 1 1 ˆt ˆt, ˆt g and ˆt = l 1 ˆt g. 4. Nonresponse in Two-Stage Cluster Sampling Nonresponse appears in the seond stage of two-stage luster sampling. If a single luster is seleted in the first stage, then the nonresponse problem in the sample drawn from this luster is idential to that in a regular simple random sampling. However, here one deals with l simple random samples with nonresponse. For g S, suppose that there are m g responding units in simple random sample g, then the response rate for sample g is r g = m g n g. Nonresponse appears if r g < 1. 4.3 Weighting Adjustment in Two-Stage Cluster Sampling Assume that it is possible to divide sample g into C g lasses by some available auxiliary information for all g S so that eah unit in a lass has an equal probability of responding. Let S g denote lass of sample g onsisting of n g units and let A g S g denote a set of m g responding units in lass of sample g for = 1,,, C g within eah g S, C g C g then n g = n g and m g = m g. For g S, let U g1, U g,, and U gcg be the sub-luster of U g having size N g1, N g,, and N gcg, respetively, so that N g = C g N g 4780

Setion on Survey Researh Methods JSM 01 and S g U g. Suppose that N g are known for all g S and = 1,,, C g. The post-stratified estimator for the total of luster g where g S is ˆt g adj = Cg i A g w gi y gi 5 where w gi = w bgi w adjgi, w bgi = N g, and w adjgi = n g. Thus, w gi = N g. The n g m g m g variane of ˆt g adj given u g = m g1,, m gcg, n g1,, n gcg, N g1,, N gcg is estimated by V ar [ ] C g ˆt g adj u g = Ng 1 m g s g where s g = 1 ygi y N g m g m g g i A g and y g = 1 y gi. m g i A g Sine ˆτ = w gˆt g, then ˆτ adj = w g ˆt g. When using post-stratified estimators adj in the seond stage to ompensate for nonresponse, the unbiased estimator of the population total is ˆτ adj = C g w g w gi y gi. 6 i A g The variane of ˆτ adj given u = {u g g S} may be estimated by V ar ˆτ adj l, L, u = L 1 l s ˆt L l + L l C g N g where s ˆt = l 1 1 [ ˆt g adj ˆt adj ], and ˆt adj = l 1 ˆt g 1 m g s g N g m g If using weighting lass adjustment, then one an replae w gi with w gi = N g n g w adjgi in 5. Thus, the weighting lass estimator of t g for g S is t g adj = Cg adj. i A g w giy gi. 7 Let u g =. m g1,, m gcg, n g1,, n gcg Given u g, for g S, the variane of t g adj and squared bias of t g may be estimated by adj and [ V ar t g { [ Bias t g adj u g adj u g ] C g = N g ng 1 n g m g s g n g N g n g m g ]} = N g n g N g 1 C g n g n g [N g y g t g adj], respetively. When using weighting lass estimators in the seond stage to ompensate for nonresponse, the estimator of τ is τ adj = C g w g i A g w giy gi. 8 4781

Setion on Survey Researh Methods JSM 01 Given u = { u g g S }, the variane of τ adj and squared bias of τ adj may be estimated respetively by V ar τ adj l, L, u = L 1 l s t + L L l l and C g [ Bias τ adj l, L, u ] L N g n g = l N g 1 N g ng 1 n g m g s g n g N g n g m g C g where s t = l 1 1 [ t g adj t adj ], and t adj = l 1 t g n g n g [N g y g t g An approximate 100 1 α % onfidene interval for τ may be onstruted by τ adj ± z α/ MSE τ adj l, L, u 9 where MSE τ adj l, L, u = V ar τ adj l, L, u [ + Bias τ adj l, L, u ] and z α/ is the 1 α/th perentile of the standard normal distribution. 5. Conlusion Two very ommon weighting adjustments for unit nonresponse, post-stratifiation and weighting lass methods, are desribed in simple random sampling, stratified random sampling, and two-stage luster sampling. It is expeted that these methods an be used to redue nonresponse bias. It is inappropriate to fully rely on these weighting adjustments beause the assumption of forming a group in whih the nonresponse is ompletely negligible is rarely satisfied in the real world. A researher should always try to avoid low response rates. Furthermore, item nonresponse and unit nonresponse normally appear together; thus, some imputation tehniques may be neessary as omplementary remedies sine filling in missing values by using some known information ould be effiient and useful for obtaining desirable lassifiation in weighting adjustments. adj. adj ] REFERENCES Cohran, W. G. 1977, Sampling Tehniques, third edition. Wiley, New York. Cornfield, J. 1944. On Samples from Finite Populations, Journal of the Amerian Statistial Assoiation, 39: 6, 36-39. Holt, D., and Elliot, D. 1991, Methods of Weighting for Unit Non-Response, Journal of the Royal Statistial Soiety. Series D The Statistiian, 40: 3, 333-34. Holt, D., and Smith, T. M. F. 1979, Post Stratifiation, Journal of the Royal Statistial Soiety. Series A General, 14: 1, 33-46. Little, R. J. A., and Rubin, D. B. 00, Statistial Analysis with Missing Data, seond edition. Wiley, New York. Lohr, S. L. 009, Sampling: Design and Analysis, seond edition. Brooks/Cole, Cengage Learning, Boston, MA. Oh, H. L., and Sheuren, F. J. 1983, Weighting Adjustment for Unit Nonresponse, In W. G. Madow, I. Olkin, and D. B. Rubin Eds., Inomplete Data in Sample Surveys, Volume : Theory and Bibliographies pp. 143-184. Aademi Press, New York. Smith, T. M. F. 1991, Post-Stratifiation Journal of the Royal Statistial Soiety. Series D The Statistiian, 40: 3, 315-33. 478