On stochastic optimization in sample allocation among strata

Similar documents
Geometric Stratification of Accounting Data

The EOQ Inventory Formula

Catalogue no XIE. Survey Methodology. December 2004

SAMPLE DESIGN FOR THE TERRORISM RISK INSURANCE PROGRAM SURVEY

Derivatives Math 120 Calculus I D Joyce, Fall 2013

Verifying Numerical Convergence Rates

FINITE DIFFERENCE METHODS

Instantaneous Rate of Change:

M(0) = 1 M(1) = 2 M(h) = M(h 1) + M(h 2) + 1 (h > 1)

ACT Math Facts & Formulas

How To Ensure That An Eac Edge Program Is Successful

College Planning Using Cash Value Life Insurance

Note nine: Linear programming CSE Linear constraints and objective functions. 1.1 Introductory example. Copyright c Sanjoy Dasgupta 1

Unemployment insurance/severance payments and informality in developing countries

The modelling of business rules for dashboard reporting using mutual information

- 1 - Handout #22 May 23, 2012 Huffman Encoding and Data Compression. CS106B Spring Handout by Julie Zelenski with minor edits by Keith Schwarz

Projective Geometry. Projective Geometry

Math 113 HW #5 Solutions

Section 3.3. Differentiation of Polynomials and Rational Functions. Difference Equations to Differential Equations

An inquiry into the multiplier process in IS-LM model

2.23 Gambling Rehabilitation Services. Introduction

ANALYTICAL REPORT ON THE 2010 URBAN EMPLOYMENT UNEMPLOYMENT SURVEY

In other words the graph of the polynomial should pass through the points


SAT Subject Math Level 1 Facts & Formulas

Optimized Data Indexing Algorithms for OLAP Systems

Determine the perimeter of a triangle using algebra Find the area of a triangle using the formula

Can a Lump-Sum Transfer Make Everyone Enjoy the Gains. from Free Trade?

1.6. Analyse Optimum Volume and Surface Area. Maximum Volume for a Given Surface Area. Example 1. Solution

Lecture 10: What is a Function, definition, piecewise defined functions, difference quotient, domain of a function

Research on the Anti-perspective Correction Algorithm of QR Barcode

THE NEISS SAMPLE (DESIGN AND IMPLEMENTATION) 1997 to Present. Prepared for public release by:

Guide to Cover Letters & Thank You Letters

Training Robust Support Vector Regression via D. C. Program


OPTIMAL FLEET SELECTION FOR EARTHMOVING OPERATIONS

Math Test Sections. The College Board: Expanding College Opportunity

A system to monitor the quality of automated coding of textual answers to open questions

Comparison between two approaches to overload control in a Real Server: local or hybrid solutions?

Writing Mathematics Papers

f(a + h) f(a) f (a) = lim

MATHEMATICS FOR ENGINEERING DIFFERENTIATION TUTORIAL 1 - BASIC DIFFERENTIATION

A strong credit score can help you score a lower rate on a mortgage

Area-Specific Recreation Use Estimation Using the National Visitor Use Monitoring Program Data

SWITCH T F T F SELECT. (b) local schedule of two branches. (a) if-then-else construct A & B MUX. one iteration cycle

The Derivative as a Function

ON LOCAL LIKELIHOOD DENSITY ESTIMATION WHEN THE BANDWIDTH IS LARGE

Distances in random graphs with infinite mean degrees

What is Advanced Corporate Finance? What is finance? What is Corporate Finance? Deciding how to optimally manage a firm s assets and liabilities.

Chapter 10: Refrigeration Cycles

Cyber Epidemic Models with Dependences

Shell and Tube Heat Exchanger

Theoretical calculation of the heat capacity

Strategic trading in a dynamic noisy market. Dimitri Vayanos

2.12 Student Transportation. Introduction

Notes: Most of the material in this chapter is taken from Young and Freedman, Chap. 12.

Model Quality Report in Business Statistics

CHAPTER 8: DIFFERENTIAL CALCULUS

Schedulability Analysis under Graph Routing in WirelessHART Networks

f(x + h) f(x) h as representing the slope of a secant line. As h goes to 0, the slope of the secant line approaches the slope of the tangent line.

Staffing and routing in a two-tier call centre. Sameer Hasija*, Edieal J. Pinker and Robert A. Shumsky

To motivate the notion of a variogram for a covariance stationary process, { Ys ( ): s R}

Binary Search Trees. Adnan Aziz. Heaps can perform extract-max, insert efficiently O(log n) worst case

Nonlinear Programming Methods.S2 Quadratic Programming

2 Limits and Derivatives

Bonferroni-Based Size-Correction for Nonstandard Testing Problems

Predicting the behavior of interacting humans by fusing data from multiple sources

THE ROLE OF LABOUR DEMAND ELASTICITIES IN TAX INCIDENCE ANALYSIS WITH HETEROGENEOUS LABOUR

Environmental Policy and Competitiveness: The Porter Hypothesis and the Composition of Capital 1

13 PERIMETER AND AREA OF 2D SHAPES

SAT Math Must-Know Facts & Formulas

Research on Risk Assessment of PFI Projects Based on Grid-fuzzy Borda Number

Average and Instantaneous Rates of Change: The Derivative

SHAPE: A NEW BUSINESS ANALYTICS WEB PLATFORM FOR GETTING INSIGHTS ON ELECTRICAL LOAD PATTERNS

Welfare, financial innovation and self insurance in dynamic incomplete markets models

New Vocabulary volume

Factoring Synchronous Grammars By Sorting

Chapter 7 Numerical Differentiation and Integration

Multivariate time series analysis: Some essential notions

Practical Guide to the Simplex Method of Linear Programming

SAT Math Facts & Formulas

A survey of Quality Engineering-Management journals by bibliometric indicators

Government Debt and Optimal Monetary and Fiscal Policy

TRADING AWAY WIDE BRANDS FOR CHEAP BRANDS. Swati Dhingra London School of Economics and CEP. Online Appendix

Heat Exchangers. Heat Exchanger Types. Heat Exchanger Types. Applied Heat Transfer Part Two. Topics of This chapter

Tangent Lines and Rates of Change

The Dynamics of Movie Purchase and Rental Decisions: Customer Relationship Implications to Movie Studios

Pioneer Fund Story. Searching for Value Today and Tomorrow. Pioneer Funds Equities

100 Austrian Journal of Statistics, Vol. 32 (2003), No. 1&2,

Qualitative Possibilistic Mixed-Observable MDPs

The Demand for Food Away From Home Full-Service or Fast Food?

Torchmark Corporation 2001 Third Avenue South Birmingham, Alabama Contact: Joyce Lane NYSE Symbol: TMK

CHAPTER 7. Di erentiation

Pressure. Pressure. Atmospheric pressure. Conceptual example 1: Blood pressure. Pressure is force per unit area:

Tis Problem and Retail Inventory Management

Artificial Neural Networks for Time Series Prediction - a novel Approach to Inventory Management using Asymmetric Cost Functions

Equilibria in sequential bargaining games as solutions to systems of equations

a joint initiative of Cost of Production Calculator

Article. Variance inflation factors in the analysis of complex survey data. by Dan Liao and Richard Valliant

A Model of Optimum Tariff in Vehicle Fleet Insurance

Transcription:

METRON - International Journal of Statistics 2010, vol. LXVIII, n. 1, pp. 95-103 MARCIN KOZAK HAI YING WANG On stocastic optimization in sample allocation among strata Summary - Te usefulness of stocastic optimization for sample allocation in stratified sampling is studied. Tree models of stocastic optimization are compared: E-Model, Modified E-model and V-model, recently presented by Díaz-García and Garay-Tápia (Comput. Statistics Data Anal., 3016 3026, 51, 2007), wit te classical sample allocation, wic distributes te costs among strata in suc a way tat te variance of an estimator is minimized. To make te comparison, a simulation study was conducted. None of te metods was te most efficient for all cases, but usually te classical allocation was te most efficient, followed by te E-model, quite similar to te former. Key Words - Stratified sampling; Optimization; Cost allocation; Optimum allocation. 1. Introduction Sample allocation in stratified sampling as been te subject of many studies (see, e.g., Särndal et al. 1992 and citations terein), and it still is. Recently, Díaz-García and Garay-Tápia (2007) ave proposed to use stocastic programming to allocate a sample among strata. Tey rigtly noticed tat te classical formulation of sample allocation suffers from an unrealistic assumption tat an allocation variable is also a survey variable, quite an unlikely situation in practice. For tis reason te stratum variances to be used in allocation are unknown and need to be estimated. In practice it is done based on te results of a previous survey or te knowledge of an auxiliary variable, wic sould be strongly correlated wit te survey variable. Tis problem as also been indicated by many oters, including Särndal et al. (1992). In Díaz-García and Garay-Tápia s (2007) paper tese unknown variances of te survey variable are replaced wit teir estimates from a sample, wic are random variables. Taking tis into account, te autors presented tree stocastic programming models to find te optimum allocation, namely te E- Received June, 2009 and revised Marc, 2010.

96 MARCIN KOZAK HAI YING WANG model, Modified E-model and V-model. All tese models rely on te unknown stratum variances and, in te case of te last two models, stratum fourt moments. Tus te autors used te results of a previous survey to replace tese quantities, as is done in te classical allocation. Tey sowed tat stocastic programming provides different allocation from te one te classical metod does; tey did not sow, owever, wic of te approaces was better. By better allocation we understand allocation tat provides more precise estimates of te parameter of interest. Terefore, tis researc aims to coose te best allocation among te following: (i) classical allocation (ereafter also called te C-allocation), (2) allocation based on te E-model (te E-allocation), (2) allocation based on te Modified E-model (te ME-allocation), and (4) allocation based on te V- model (te V-allocation). In tis paper we will focus only on te allocation tat aims to minimize variance of te estimator of te population mean subject to fixed survey costs. We believe, owever, tat te results and conclusions will refer also to te dual problem, in wic a survey cost is minimized subject to variance constraints. 2. Sample allocation Consider a finite population U of size N subdivided into H non-overlapping strata U of sizes N = (N 1,..., N H ) T ; Y is a survey variable. A problem of interest is to allocate a survey cost among te strata so tat te variance of te estimator ˆθ of some parameter θ studied is minimized. Hereafter we will focus on estimation of te population mean, so θ = Ȳ. In tis case te variance to be minimized is ( H W V ˆθ) 2 H = S2 W S 2 n N, (1) N ( y i Ȳ ) 2 were W = N /N, S 2 = 1 N 1 is te population variance of i=1 te study caracter in te t stratum, y i is te Y -value of te it element in N te t stratum, and Ȳ = y i. i=1 In classical allocation te following problem is solved: min n H W 2 S2 n H s.t. c T n = C C 0, 2 n N, W S 2 N

On stocastic optimization in sample allocation among strata 97 were C is te total permissible survey cost, C 0 is te fixed survey cost, c = (c 1,..., c H ) T is te vector of costs of selecting one unit from te strata, and 2 is te H-vector of twos. Díaz-García and Garay-Tápia (2007) introduced tree models of stocastic programming for sample allocation. Tey can be presented by te following general model [ H W 2 min k S2 H ] 1 n n 1 W S 2 n N n 1 + k 2 [ H were C 4 y = 1 N W 4 ( ) n (n 1) 2 C 4 y S4 N i=1 H s.t. c T n = C C 0, 2 n N, W 2 n ( ) ] 1 / 2 N 2 (n 1) 2 C 4 y S4 ( y i Ȳ ) 4, = 1,..., N, is te fourt moment of Y in te t stratum, and k 1 and k 2 are nonnegative constants, te sum of wic equals 1. For k 1 =1 and k 2 =0, te model becomes te E-model, for k 1 =0 and k 2 =1 te V-model, and for oter coices te Modified E-model. As Díaz-García and Garay-Tápia (2007) did, in tis paper we consider te Modified E-model wit k 1 = k 2 = 0.5. To arrive at tese models, te asymptotic distributions of sample variances s 2 are used, for wic bot n and N n need to go to infinity. Tus te E-model is very similar to te C-model, for large populations. If te stratum variances are known, apparently te classical allocation will be te best one because it will reac te minimum variance of te estimator. Noneteless, since te stratum variances are unknown and teir estimates are used instead, we are not able to say wic metod is te best. Díaz-García and Garay-Tápia (2007) considered Lagrangian-based allocation, in wic te objective function is minimized subject to c T n = C C 0 via te Lagrange multipliers. Te values obtained are rounded to integers, an operation tat may provide non-optimum allocation. To overcome tis difficulty te autors employed nonlinear integer programming. To allocate a sample, in tis paper we will use te random searc algoritm, recently proved very efficient for te optimum sample allocation (Kozak 2006). Note, owever, tat for te purpose of tis paper it does not make difference wic optimization procedure one cooses, provided tat it yields optimality for te objective function subject to specified constraints. Our interest lies not in an allocation procedure itself, but in ow an allocation problem is stated, namely via te classical approac or te stocastic programming.

98 MARCIN KOZAK HAI YING WANG 3. Simulation study To compare te four allocations, we conducted a simulation experiment designed as follows. Two populations (Y variables), one wit N = 1000 and te oter wit N = 5000, were generated according to a gamma distribution wit sape parameter 0.5. Tey were ten sorted and stratified into H = 3 and 5 strata as given in Table 1, wic also presents information about oter parameters of te generated populations. Table 1: Description of te two artificial populations studied. N S 2 C 4 Population 1 (N = 1000), H =3 1 250 2.54 10 4 1.27 10 7 2 500 0.02653 1.49 10 3 3 250 0.79064 6.3019 Population 1 (N = 1000), H =5 1 50 2.33 10 7 9.86 10 14 2 300 7.32 10 4 1.03 10 6 3 300 8.62 10 3 1.45 10 4 4 300 0.1509 0.0582 5 50 1.0445 5.7016 Population 2 (N = 5000), H =3 1 1000 8.72 10 5 1.56 10 8 2 3000 0.0468 5.45 10 3 3 1000 0.6773 6.3181 Population 2 (N = 5000), H =5 1 500 6.18 10 6 7.79 10 11 2 1500 1.34 10 3 3.67 10 6 3 1500 0.0126 3.14 10 4 4 1000 0.0554 6.11 10 3 5 500 0.7156 7.5046 To simulate te randomness of te variances used in te allocation procedures, for eac suc population and number of strata, 100,000 stratified samples were taken using simple random sampling witout replacement. Te overall sample

On stocastic optimization in sample allocation among strata 99 size was n = 0.1N and stratum sample sizes were proportionally distributed among strata. Based on eac suc sample, S 2 and C 4 were estimated and ten used to allocate among strata te sample of size n = 0.1N using te four allocations compared. Stratum costs of selecting one unit in eac stratum were assumed to be te same and equal to 1. Ten for eac allocation te variance of te estimator was calculated using Eq. (1) based on known population values of S 2. Tese variances were compared, and between two allocations te one was more efficient for wic te variance was smaller. Te following notes need to be made to make tis simulation study correct. Denote te optimal (n 1,..., n H ) T by n. Te true variance of te estimator in te general case is ( V ˆθ) = V (ȳ) = E (V (ȳ n)) + V (E (ȳ n)) ( H ) W 2 H = E S2 W S 2 ) + V (Ȳ n N = E ( H W 2 S2 n H W S 2 N Note, ten, tat te difference of tis equation from equation (1) lies in te expectation only. Equation (1) is in fact V (ȳ n) = V (ȳ sample) (because te optimal n is determined by te sample drawn). So for a particular sample tis can be seen as a Monte Carlo estimate of te true variance based on tis one sample. In our simulation, we ave 100,000 samples, so we ave 100,000 Monte Carlo values of V (ȳ n) for a particular allocation metod. Ten we can treat te mean of tese 100,000 values of V (ȳ n) as te estimate of te true variance of te estimator, also for tis particular allocation metod. Note tat tese V (ȳ n) are i.i.d., so by te law of large numbers te mean will converge to te expectation, tat is, te true variance of te estimator. Since we ave 100,000 samples, tis estimate of te mean is very close to te true value. Note also tat te expressions of te true variances of te estimators are all te same. Tat is to say, under bot E and V models, te variance is still ( H ) W 2 H E S2 W S 2. n N Terefore, altoug we do not compare true variances of te estimators, we compare teir Monte Carlo estimates tat are obtained based on known true stratum variances S 2. For a particular iteration, a sample is drawn to estimate tese variances S 2 in order to derive te optimum sample sizes for eac allocation; based on tese random sample sizes we estimate te true variances ).

100 MARCIN KOZAK HAI YING WANG of te estimator for eac allocation (tis time we use known stratum variances S 2 ). As we discussed above, tese are Monte Carlo estimates of te true variances, so tey are comparable among te allocations. Te allocation wit te smallest variance estimated in tat way will be te best for a particular sample. Hence finding wic allocation is te best in most situations in tese terms may be tougt of as finding te most efficient allocation. In tis simulation, it seems tat we use te same values of survey variable in te first and second survey. But note tat we only need te stratum variances in te second survey, and in real life te variances can be assumed more or less stable between a previous survey and te new survey, so it is reasonable to use te same populations. Tables 2, 3 and 4 summarize te results of te experiment. Table 2 sows tat te C-allocation and E-allocation were on average comparably efficient and te two most efficient allocations, altoug te former reaced te optimum allocation most often. Moreover, from Tables 3 and 4 it follows tat te C- allocation was in most cases more efficient tan te E-allocation. Note, owever, tat tere were cases in wic tese allocations were less efficient tan te MEallocation and V-allocation. Still tese were te C-allocation and E-allocation tat were te most efficient bot on average and most often. Note also tat te results of te ME-allocation and V-allocation were visibly less precise tan tose of te oter two allocations. In addition, for many cases, especially wen te populations were stratified into tree strata, te ME-allocation and V-allocation provided te same sample sizes from strata and te same variances, sowing a close similarity between te allocations. Interestingly, for N = 1000 and H = 5, te variability in te variances obtained via all te allocations was very large, especially compared to te oter situations studied (note te coefficients of variation and te maximum variances obtained). It resulted from te specificity of te population and its stratification; tis variation came mainly from te very small (N 5 =50) fift stratum. According to te proportional allocation rule used to estimate te variance and fourt moment of te survey variable, we draw only five samples from tis stratum, so te estimates were not stable. As a result, te number of units eac metod allocated to tis stratum varied dramatically among te 100000 samples. Unfortunately, te variances of te estimator were strongly influenced by tis stratum s variance. Having said tat, it is wort noting tat for tis situation te same scenario for te performance of te allocation metods was obtained as for te oter situations.

On stocastic optimization in sample allocation among strata 101 Table 2: Summary statistics of te simulation results obtained for two populations studied via te four allocations compared. Mean, median and maximum values are given for variances divided by te optimum (minimum) variance. Classic E-model M E-model V-model Population 1, H = 3 Mean 1.027 1.030 1.055 1.055 Median 1.014 1.014 1.032 1.032 Reaced min (1) 6207 5520 2633 2633 Max 1.553 1.593 1.635 1.635 CV (2) 0.037 0.039 0.058 0.058 Population 1, H = 5 Mean 1.149 1.129 1.565 1.565 Median 1.052 1.046 1.098 1.098 Reaced min 772 270 425 424 Max 4.502 4.510 5.211 5.198 CV 0.274 0.211 0.588 0.589 Population 2, H = 3 Mean 1.009 1.009 1.063 1.063 Median 1.004 1.005 1.029 1.029 Reaced min 1752 2 107 107 Max 1.157 1.157 1.371 1.371 CV 0.011 0.011 0.076 0.076 Population 2, H = 5 Mean 1.022 1.022 1.089 1.089 Median 1.011 1.011 1.039 1.039 Reaced min 28 31 23 24 Max 1.335 1.335 1.512 1.512 CV 0.026 0.026 0.104 0.104 (1) Number of times for te corresponding allocation for wic te minimum variance of te estimator was reaced. (2) Coefficient of variation.

102 MARCIN KOZAK HAI YING WANG Table 3: Results of te simulation study for N = 1000 and H = 3 (below diagonal) and H = 5 (above diagonal). Number of cases are given for wic te column allocation was more efficient (i.e., ad smaller variance) tan te corresponding row allocation; in brackets, number of cases are given for wic te column and row allocations provided te same results. Classic E-model M E-model V-model Classic 46917 (3163) 27800 (1262) 27805 (1263) E-model 48823 (30098) 27100 (1306) 27104 (1300) M E-model 64718 (2139) 59450 (1343) 3085 (93773) V-model 64718 (2139) 59450 (1343) 0 (100000) Table 4: Results of te simulation study for N = 5000 and H = 3 (below diagonal) and H = 5 (above diagonal). Number of cases are given for wic te column allocation was more efficient (i.e., ad smaller variance) tan te corresponding row allocation; in brackets, number of cases are given for wic te column and row allocations provided te same results. Classic E-model M E-model V-model Classic 38465 (10462) 30706 (3) 30711 (3) E-model 88172 (1878) 31217 (1) 31192 (1) M E-model 75477 (45) 74680 (23) 16042 (68122) V-model 75477 (45) 74680 (23) 1 (99998) 4. Conclusion From te experiment it follows tat among te four allocation metods compared, none could be considered te best one for every situation. Noneteless, te C-allocation was most often te best followed by te E-allocation, a result tat sould not surprise given te similarity between tem. Te V- allocation and ME-allocation, te performance of wic was very similar, were usually te worst. Te results do not mean tat stocastic programming sould be discounted in furter researc on sample allocation among strata. It is possible tat for some oter situations te allocation based on one of te models of stocastic programming (maybe a model not considered in tis paper) migt occur to be more efficient tan te classical allocation. In oter words, tere may be some oter equivalent deterministic problems tat will work more efficiently tan te classical metod. Note tat stocastic optimization offers te opportunity to find many different deterministic problems in sample allocation. Take te Modified E-model, for example. Intuitively, te E-model focuses on minimizing

On stocastic optimization in sample allocation among strata 103 te variance of te estimator of te population mean, wile te V-model on minimizing te variation of tis variation. Te Modified E-model attempts to strike a appy medium, minimizing a linear combination of te variance (te E-model s focus) and te variance s variance (te V-model s focus). Only te case k 1 = k 2 = 0.5 was considered in tis paper, but better k 1 and k 2 may exist. Wit teir optimal values, te Modified E-model may work more efficiently tan bot te E-model and V-model. Noneteless, studying te way of searcing for te optimal E-model or oter efficient deterministic problems is beyond te scope of tis work. From our results it follows tat at te moment we cannot claim tat stocastic programming offers allocation tat would perform better tan te classical allocation metod. REFERENCES Díaz-García, J. D. and Garay-Tápia, M. M. (2007) Optimum allocation in stratified surveys: Stocastic programming, Computational Statistics & Data Analysis, 51, 3016 3026. Kozak, M. (2006) Multivariate sample allocation: application of random searc metod, Statistics in Transition, 7 (4), 889 900. Särndal, C. E., Swensson, B. and Wretman, J. (1992) Model Assisted Survey Sampling, Springer- Verlag. MARCIN KOZAK Department of Experimental Design and Bioinformatics Faculty of Agriculture and Biology Warsaw University of Life Sciences Nowoursynowska 159 02-776 Warsaw, Poland nyggus@gmail.com HAI YING WANG Academy of Matematics and Systems Sciences Cinese Academy of Sciences Cina