On stochastic optimization in sample allocation among strata

METRON - International Journal of Statistics 2010, vol. LXVIII, n. 1, pp. 95-103 MARCIN KOZAK HAI YING WANG On stocastic optimization in sample allocation among strata Summary - Te usefulness of stocastic optimization for sample allocation in stratified sampling is studied. Tree models of stocastic optimization are compared: E-Model, Modified E-model and V-model, recently presented by Díaz-García and Garay-Tápia (Comput. Statistics Data Anal., 3016 3026, 51, 2007), wit te classical sample allocation, wic distributes te costs among strata in suc a way tat te variance of an estimator is minimized. To make te comparison, a simulation study was conducted. None of te metods was te most efficient for all cases, but usually te classical allocation was te most efficient, followed by te E-model, quite similar to te former. Key Words - Stratified sampling; Optimization; Cost allocation; Optimum allocation. 1. Introduction Sample allocation in stratified sampling as been te subject of many studies (see, e.g., Särndal et al. 1992 and citations terein), and it still is. Recently, Díaz-García and Garay-Tápia (2007) ave proposed to use stocastic programming to allocate a sample among strata. Tey rigtly noticed tat te classical formulation of sample allocation suffers from an unrealistic assumption tat an allocation variable is also a survey variable, quite an unlikely situation in practice. For tis reason te stratum variances to be used in allocation are unknown and need to be estimated. In practice it is done based on te results of a previous survey or te knowledge of an auxiliary variable, wic sould be strongly correlated wit te survey variable. Tis problem as also been indicated by many oters, including Särndal et al. (1992). In Díaz-García and Garay-Tápia s (2007) paper tese unknown variances of te survey variable are replaced wit teir estimates from a sample, wic are random variables. Taking tis into account, te autors presented tree stocastic programming models to find te optimum allocation, namely te E- Received June, 2009 and revised Marc, 2010.

96 MARCIN KOZAK HAI YING WANG model, Modified E-model and V-model. All tese models rely on te unknown stratum variances and, in te case of te last two models, stratum fourt moments. Tus te autors used te results of a previous survey to replace tese quantities, as is done in te classical allocation. Tey sowed tat stocastic programming provides different allocation from te one te classical metod does; tey did not sow, owever, wic of te approaces was better. By better allocation we understand allocation tat provides more precise estimates of te parameter of interest. Terefore, tis researc aims to coose te best allocation among te following: (i) classical allocation (ereafter also called te C-allocation), (2) allocation based on te E-model (te E-allocation), (2) allocation based on te Modified E-model (te ME-allocation), and (4) allocation based on te V- model (te V-allocation). In tis paper we will focus only on te allocation tat aims to minimize variance of te estimator of te population mean subject to fixed survey costs. We believe, owever, tat te results and conclusions will refer also to te dual problem, in wic a survey cost is minimized subject to variance constraints. 2. Sample allocation Consider a finite population U of size N subdivided into H non-overlapping strata U of sizes N = (N 1,..., N H ) T ; Y is a survey variable. A problem of interest is to allocate a survey cost among te strata so tat te variance of te estimator ˆθ of some parameter θ studied is minimized. Hereafter we will focus on estimation of te population mean, so θ = Ȳ. In tis case te variance to be minimized is ( H W V ˆθ) 2 H = S2 W S 2 n N, (1) N ( y i Ȳ ) 2 were W = N /N, S 2 = 1 N 1 is te population variance of i=1 te study caracter in te t stratum, y i is te Y -value of te it element in N te t stratum, and Ȳ = y i. i=1 In classical allocation te following problem is solved: min n H W 2 S2 n H s.t. c T n = C C 0, 2 n N, W S 2 N

On stocastic optimization in sample allocation among strata 97 were C is te total permissible survey cost, C 0 is te fixed survey cost, c = (c 1,..., c H ) T is te vector of costs of selecting one unit from te strata, and 2 is te H-vector of twos. Díaz-García and Garay-Tápia (2007) introduced tree models of stocastic programming for sample allocation. Tey can be presented by te following general model [ H W 2 min k S2 H ] 1 n n 1 W S 2 n N n 1 + k 2 [ H were C 4 y = 1 N W 4 ( ) n (n 1) 2 C 4 y S4 N i=1 H s.t. c T n = C C 0, 2 n N, W 2 n ( ) ] 1 / 2 N 2 (n 1) 2 C 4 y S4 ( y i Ȳ ) 4, = 1,..., N, is te fourt moment of Y in te t stratum, and k 1 and k 2 are nonnegative constants, te sum of wic equals 1. For k 1 =1 and k 2 =0, te model becomes te E-model, for k 1 =0 and k 2 =1 te V-model, and for oter coices te Modified E-model. As Díaz-García and Garay-Tápia (2007) did, in tis paper we consider te Modified E-model wit k 1 = k 2 = 0.5. To arrive at tese models, te asymptotic distributions of sample variances s 2 are used, for wic bot n and N n need to go to infinity. Tus te E-model is very similar to te C-model, for large populations. If te stratum variances are known, apparently te classical allocation will be te best one because it will reac te minimum variance of te estimator. Noneteless, since te stratum variances are unknown and teir estimates are used instead, we are not able to say wic metod is te best. Díaz-García and Garay-Tápia (2007) considered Lagrangian-based allocation, in wic te objective function is minimized subject to c T n = C C 0 via te Lagrange multipliers. Te values obtained are rounded to integers, an operation tat may provide non-optimum allocation. To overcome tis difficulty te autors employed nonlinear integer programming. To allocate a sample, in tis paper we will use te random searc algoritm, recently proved very efficient for te optimum sample allocation (Kozak 2006). Note, owever, tat for te purpose of tis paper it does not make difference wic optimization procedure one cooses, provided tat it yields optimality for te objective function subject to specified constraints. Our interest lies not in an allocation procedure itself, but in ow an allocation problem is stated, namely via te classical approac or te stocastic programming.

98 MARCIN KOZAK HAI YING WANG 3. Simulation study To compare te four allocations, we conducted a simulation experiment designed as follows. Two populations (Y variables), one wit N = 1000 and te oter wit N = 5000, were generated according to a gamma distribution wit sape parameter 0.5. Tey were ten sorted and stratified into H = 3 and 5 strata as given in Table 1, wic also presents information about oter parameters of te generated populations. Table 1: Description of te two artificial populations studied. N S 2 C 4 Population 1 (N = 1000), H =3 1 250 2.54 10 4 1.27 10 7 2 500 0.02653 1.49 10 3 3 250 0.79064 6.3019 Population 1 (N = 1000), H =5 1 50 2.33 10 7 9.86 10 14 2 300 7.32 10 4 1.03 10 6 3 300 8.62 10 3 1.45 10 4 4 300 0.1509 0.0582 5 50 1.0445 5.7016 Population 2 (N = 5000), H =3 1 1000 8.72 10 5 1.56 10 8 2 3000 0.0468 5.45 10 3 3 1000 0.6773 6.3181 Population 2 (N = 5000), H =5 1 500 6.18 10 6 7.79 10 11 2 1500 1.34 10 3 3.67 10 6 3 1500 0.0126 3.14 10 4 4 1000 0.0554 6.11 10 3 5 500 0.7156 7.5046 To simulate te randomness of te variances used in te allocation procedures, for eac suc population and number of strata, 100,000 stratified samples were taken using simple random sampling witout replacement. Te overall sample

On stocastic optimization in sample allocation among strata 99 size was n = 0.1N and stratum sample sizes were proportionally distributed among strata. Based on eac suc sample, S 2 and C 4 were estimated and ten used to allocate among strata te sample of size n = 0.1N using te four allocations compared. Stratum costs of selecting one unit in eac stratum were assumed to be te same and equal to 1. Ten for eac allocation te variance of te estimator was calculated using Eq. (1) based on known population values of S 2. Tese variances were compared, and between two allocations te one was more efficient for wic te variance was smaller. Te following notes need to be made to make tis simulation study correct. Denote te optimal (n 1,..., n H ) T by n. Te true variance of te estimator in te general case is ( V ˆθ) = V (ȳ) = E (V (ȳ n)) + V (E (ȳ n)) ( H ) W 2 H = E S2 W S 2 ) + V (Ȳ n N = E ( H W 2 S2 n H W S 2 N Note, ten, tat te difference of tis equation from equation (1) lies in te expectation only. Equation (1) is in fact V (ȳ n) = V (ȳ sample) (because te optimal n is determined by te sample drawn). So for a particular sample tis can be seen as a Monte Carlo estimate of te true variance based on tis one sample. In our simulation, we ave 100,000 samples, so we ave 100,000 Monte Carlo values of V (ȳ n) for a particular allocation metod. Ten we can treat te mean of tese 100,000 values of V (ȳ n) as te estimate of te true variance of te estimator, also for tis particular allocation metod. Note tat tese V (ȳ n) are i.i.d., so by te law of large numbers te mean will converge to te expectation, tat is, te true variance of te estimator. Since we ave 100,000 samples, tis estimate of te mean is very close to te true value. Note also tat te expressions of te true variances of te estimators are all te same. Tat is to say, under bot E and V models, te variance is still ( H ) W 2 H E S2 W S 2. n N Terefore, altoug we do not compare true variances of te estimators, we compare teir Monte Carlo estimates tat are obtained based on known true stratum variances S 2. For a particular iteration, a sample is drawn to estimate tese variances S 2 in order to derive te optimum sample sizes for eac allocation; based on tese random sample sizes we estimate te true variances ).

100 MARCIN KOZAK HAI YING WANG of te estimator for eac allocation (tis time we use known stratum variances S 2 ). As we discussed above, tese are Monte Carlo estimates of te true variances, so tey are comparable among te allocations. Te allocation wit te smallest variance estimated in tat way will be te best for a particular sample. Hence finding wic allocation is te best in most situations in tese terms may be tougt of as finding te most efficient allocation. In tis simulation, it seems tat we use te same values of survey variable in te first and second survey. But note tat we only need te stratum variances in te second survey, and in real life te variances can be assumed more or less stable between a previous survey and te new survey, so it is reasonable to use te same populations. Tables 2, 3 and 4 summarize te results of te experiment. Table 2 sows tat te C-allocation and E-allocation were on average comparably efficient and te two most efficient allocations, altoug te former reaced te optimum allocation most often. Moreover, from Tables 3 and 4 it follows tat te C- allocation was in most cases more efficient tan te E-allocation. Note, owever, tat tere were cases in wic tese allocations were less efficient tan te MEallocation and V-allocation. Still tese were te C-allocation and E-allocation tat were te most efficient bot on average and most often. Note also tat te results of te ME-allocation and V-allocation were visibly less precise tan tose of te oter two allocations. In addition, for many cases, especially wen te populations were stratified into tree strata, te ME-allocation and V-allocation provided te same sample sizes from strata and te same variances, sowing a close similarity between te allocations. Interestingly, for N = 1000 and H = 5, te variability in te variances obtained via all te allocations was very large, especially compared to te oter situations studied (note te coefficients of variation and te maximum variances obtained). It resulted from te specificity of te population and its stratification; tis variation came mainly from te very small (N 5 =50) fift stratum. According to te proportional allocation rule used to estimate te variance and fourt moment of te survey variable, we draw only five samples from tis stratum, so te estimates were not stable. As a result, te number of units eac metod allocated to tis stratum varied dramatically among te 100000 samples. Unfortunately, te variances of te estimator were strongly influenced by tis stratum s variance. Having said tat, it is wort noting tat for tis situation te same scenario for te performance of te allocation metods was obtained as for te oter situations.

On stocastic optimization in sample allocation among strata 101 Table 2: Summary statistics of te simulation results obtained for two populations studied via te four allocations compared. Mean, median and maximum values are given for variances divided by te optimum (minimum) variance. Classic E-model M E-model V-model Population 1, H = 3 Mean 1.027 1.030 1.055 1.055 Median 1.014 1.014 1.032 1.032 Reaced min (1) 6207 5520 2633 2633 Max 1.553 1.593 1.635 1.635 CV (2) 0.037 0.039 0.058 0.058 Population 1, H = 5 Mean 1.149 1.129 1.565 1.565 Median 1.052 1.046 1.098 1.098 Reaced min 772 270 425 424 Max 4.502 4.510 5.211 5.198 CV 0.274 0.211 0.588 0.589 Population 2, H = 3 Mean 1.009 1.009 1.063 1.063 Median 1.004 1.005 1.029 1.029 Reaced min 1752 2 107 107 Max 1.157 1.157 1.371 1.371 CV 0.011 0.011 0.076 0.076 Population 2, H = 5 Mean 1.022 1.022 1.089 1.089 Median 1.011 1.011 1.039 1.039 Reaced min 28 31 23 24 Max 1.335 1.335 1.512 1.512 CV 0.026 0.026 0.104 0.104 (1) Number of times for te corresponding allocation for wic te minimum variance of te estimator was reaced. (2) Coefficient of variation.

102 MARCIN KOZAK HAI YING WANG Table 3: Results of te simulation study for N = 1000 and H = 3 (below diagonal) and H = 5 (above diagonal). Number of cases are given for wic te column allocation was more efficient (i.e., ad smaller variance) tan te corresponding row allocation; in brackets, number of cases are given for wic te column and row allocations provided te same results. Classic E-model M E-model V-model Classic 46917 (3163) 27800 (1262) 27805 (1263) E-model 48823 (30098) 27100 (1306) 27104 (1300) M E-model 64718 (2139) 59450 (1343) 3085 (93773) V-model 64718 (2139) 59450 (1343) 0 (100000) Table 4: Results of te simulation study for N = 5000 and H = 3 (below diagonal) and H = 5 (above diagonal). Number of cases are given for wic te column allocation was more efficient (i.e., ad smaller variance) tan te corresponding row allocation; in brackets, number of cases are given for wic te column and row allocations provided te same results. Classic E-model M E-model V-model Classic 38465 (10462) 30706 (3) 30711 (3) E-model 88172 (1878) 31217 (1) 31192 (1) M E-model 75477 (45) 74680 (23) 16042 (68122) V-model 75477 (45) 74680 (23) 1 (99998) 4. Conclusion From te experiment it follows tat among te four allocation metods compared, none could be considered te best one for every situation. Noneteless, te C-allocation was most often te best followed by te E-allocation, a result tat sould not surprise given te similarity between tem. Te V- allocation and ME-allocation, te performance of wic was very similar, were usually te worst. Te results do not mean tat stocastic programming sould be discounted in furter researc on sample allocation among strata. It is possible tat for some oter situations te allocation based on one of te models of stocastic programming (maybe a model not considered in tis paper) migt occur to be more efficient tan te classical allocation. In oter words, tere may be some oter equivalent deterministic problems tat will work more efficiently tan te classical metod. Note tat stocastic optimization offers te opportunity to find many different deterministic problems in sample allocation. Take te Modified E-model, for example. Intuitively, te E-model focuses on minimizing

On stocastic optimization in sample allocation among strata 103 te variance of te estimator of te population mean, wile te V-model on minimizing te variation of tis variation. Te Modified E-model attempts to strike a appy medium, minimizing a linear combination of te variance (te E-model s focus) and te variance s variance (te V-model s focus). Only te case k 1 = k 2 = 0.5 was considered in tis paper, but better k 1 and k 2 may exist. Wit teir optimal values, te Modified E-model may work more efficiently tan bot te E-model and V-model. Noneteless, studying te way of searcing for te optimal E-model or oter efficient deterministic problems is beyond te scope of tis work. From our results it follows tat at te moment we cannot claim tat stocastic programming offers allocation tat would perform better tan te classical allocation metod. REFERENCES Díaz-García, J. D. and Garay-Tápia, M. M. (2007) Optimum allocation in stratified surveys: Stocastic programming, Computational Statistics & Data Analysis, 51, 3016 3026. Kozak, M. (2006) Multivariate sample allocation: application of random searc metod, Statistics in Transition, 7 (4), 889 900. Särndal, C. E., Swensson, B. and Wretman, J. (1992) Model Assisted Survey Sampling, Springer- Verlag. MARCIN KOZAK Department of Experimental Design and Bioinformatics Faculty of Agriculture and Biology Warsaw University of Life Sciences Nowoursynowska 159 02-776 Warsaw, Poland nyggus@gmail.com HAI YING WANG Academy of Matematics and Systems Sciences Cinese Academy of Sciences Cina