Model Quality Report in Business Statistics
|
|
|
- Bonnie Rogers
- 10 years ago
- Views:
Transcription
1 Model Quality Report in Business Statistics Mats Bergdal, Ole Blac, Russell Bowater, Ray Cambers, Pam Davies, David Draper, Eva Elvers, Susan Full, David Holmes, Pär Lundqvist, Sixten Lundström, Lennart Nordberg, Jon Perry, Mar Pont, Mie Prestwood, Ian Ricardson, Cris Sinner, Paul Smit, Ceri Underwood, Mar Williams General Editors: Pam Davies, Paul Smit Volume II Comparison of Variance Estimation Software and Metods
2 Preface Te Model Quality Report in Business Statistics project was set up to develop a detailed description of te metods for assessing te quality of surveys, wit particular application in te context of business surveys, and ten to apply tese metods in some example surveys to evaluate teir quality. Te wor was specified and initiated by Eurostat following on from te Woring Group on Quality of Business Statsitics. It was funded by Eurostat under SUP-COM 1997, lot 6, and as been undertaen by a consortium of te UK Office for National Statistics, Statistics Sweden, te University of Soutampton and te University of Bat, wit te Office for National Statistics managing te contract. Te report is divided into four volumes, of wic tis is te second. Tis volume deals wit te software available for variance estimation in sample surveys, comparing a range of pacages and metods, and evaluating some of teir properties troug a simulation study using a nown population Oter volumes of te report contain: a review and development of te teory and metods for assessing quality in business surveys (volume I); example assessments of quality for an annual and a montly business survey from Sweden and te UK (volume III); guidelines for and experiences of implementing te metods (volume IV). An outline of te capters in te report is given on te following pages. Acnowledgements Apart from te autors, several oter people ave made large contributions witout wic tis report would not ave reaced its current form. In particular we would lie to mention Tim Jones, Anita Ullberg, Jeff Evans, Trevor Fenton, Jonatan Goug, Dan Hedlin, Sue Hibbitt and Steve James, and we would also lie to tan all te oter people wo ave been so elpful and understanding wile our attention as been focussed on tis project!
3 Outline of Model Quality Report Volumes Volume I 1. Metodology overview and introduction Part 1: Sampling errors. Probability sampling: basic metods 3. Probability sampling: extensions 4. Sampling errors under non-probability sampling Part : Non-sampling errors 5. Frame errors 6. Measurement errors 7. Processing errors 8. Non-response errors 9. Model assumption errors Part 3: Oter aspects of quality 10. Comparability and coerence Part 4: Conclusions and References 11. Concluding remars 1. References Volume II 1. Introduction. Evaluation of variance estimation software 3. Simulation study of alternative variance estimation metods 4. Variances in STATA/SUDAAN compared wit analytic variances 5. References Volume III 1. Introduction Part 1: Annual statistics. Quality assessment of te 1995 Swedis Annual Production Volume Index 3. Quality assessment of te 1996 UK Annual Production and Construction Inquiries Part : Sort-term statistics 4. Quality assessment of te Swedis Sort-term Production Volume Index 5. Quality assessment of te UK Index of Production 6. Quality assessment of te UK Montly Production Inquiry Part 3: Te UK s Sampling Frame 7. Sampling frame for te UK
4 Volume IV 1. Introduction. Guidelines on implementation 3. Implementation report for Sweden 4. Implementation report for te UK 5. Visit to Statistisces Bundesamt, Wiesbaden, Germany, 3-4 Marc Visit to CSO, Cor, Ireland, 3 April Visit to INE, Madrid, SPain, 6 July 1998
5 Contents 1 Introduction... Evaluation of variance estimation software Requirements on software for business statistics Introduction Parameters Point estimators Variance estimation metods Te Taylor linearisation metod Te Jacnife metod Te Bootstrap metod Te Balanced Repeated Replication (BRR) metod Summary of requirements Critical comparison of software pacages Sample designs Nonresponse models and outlier treatment Parameters Estimators Variance estimators Interfaces, documentation and elp Initial reactions of new users to te software Correctness and speed Ease of integration wit processing systems Costs....3 Recommendations for variance estimation software for use in EU member states... 3 Simulation study of alternative variance estimation metods Te simulated population A model for data generation Domains and estimators Data features Processing Results Comparison of estimators Comparison of variance estimators Naïve variance estimators Comparison of software pacage outputs General conclusions Variances in STATA/SUDAAN compared wit analytical variances Expansion estimator Ratio estimator Wat does SUDAAN do? References...36 i
6 1 Introduction Paul Smit, Office for National Statistics One of te ey indicators of quality in sample surveys is te sampling variance arising from te random sampling mecanism troug te randomisation distribution. Tis indicates te variability introduced by coosing a sample instead of enumerating te wole population, assuming tat te information collected in te survey is oterwise exactly correct. For a discussion of te teory underlying tese calculations, see capters M 1 and M3 of te metodology report (volume I). For any given survey, an estimator of tis sampling variance can be evaluated and used to indicate te accuracy of te estimates. Te forms of tese estimators are often complex, especially wen te design contains strata or clusters, and wen te estimation model uses auxiliary information to improve te accuracy. In order to mae tese calculations feasible, appropriate software is required, and altoug it is possible to construct a program witin most survey processing systems to do tis for a specific survey, tere as been a recent trend towards te production of generalised software wic will calculate te appropriate variances in a wide range of commonly met survey situations. Tese must ten be incorporated into te survey process. Sampling variances are often not time-critical information, and any difficulties wit data transfer to or setup of tis software are offset by te generalised nature of te programs. In tis paper we evaluate five generalised pacages wic are publicly available: CLAN, GES, SUDAAN, STATA and WesVar PC. Tere are four main variance estimation metods, Taylor, jacnife, bootstrap and balanced repeated replication (tese are explained in section.1.4), and between tem tese pacages cover all te available metods except te bootstrap (Table 1.1). Tese are te pacages wic were available at te time of putting togeter te tender for tis study, wit te exception of PC-CARP wic was available but as not been studied. Oter pacages are being developed; tose nown to te Model Quality Report team are BASCULA and POULPE but neiter of tese seems to be fully functional in its current version. Metod Direct + Taylor series metods Jacnife Bootstrap Balanced repeated replication Software CLAN GES STATA SUDAAN GES SUDAAN WesVarPC None SUDAAN WesVarPC Table 1.1: Variance estimation metods available in te evaluated software pacages. 1 Reference is made trougout tis document to te Metodology report by prefixing section references wit an M.
7 Te requirements for a variance estimation pacage are discussed in section.1, and tere is a comparative description of te pacages in section.. Section.3 draws conclusions about te suitability of te pacages for general use in business surveys in EU member states, and maes recommendations for wic sould be adopted. A separate simulation study as been undertaen to loo at te properties of te available variance estimators, and tis is presented in capter 3 of tis report. A more detailed description of te differences in underlying metods between STATA/SUDAAN and te oter pacages for te Taylor linearisation approac to ratio estimation is given in capter 4. 3
8 Evaluation of variance estimation software Paul Smit, Office for National Statistics Sixten Lundström, Statistics Sweden Ceri Underwood, Office for National Statistics.1 Requirements on software for business statistics.1.1 Introduction Te units in business surveys can be of various types, suc as enterprises and ind-of-activity units. Mostly a Business Register (BR) is used as te frame for te survey. Tere is a set of units on te BR, suc as enterprises, legal units, local units, and possibly ind-of-activity units. Tere is a set of variables for eac type of unit, some common to oter types of unit, some unique. Ordinarily, te BR contains information on wic industry eac unit belongs to and a measure of te size of te unit. Te size variable is often te number of employees, or peraps a measure of turnover (depending on unit level). Tese variables and teir reference dates affect te use of auxiliary information in te sampling design and in te estimation process. In business surveys two typical inds of probability sampling design can be identified, namely (i) one-step element and (ii) one-step cluster. Typical examples are (i) surveys wit te enterprise as bot te sampling unit and observation unit, and (ii) surveys wit te enterprise as te sampling unit and all its ind-of-activity units or all its local units as te observation units. Te population is often stratified by industry and size, and from eac stratum a simple random sample is drawn. Te stratification variable industry is used wit regard to te domains of estimation tat are mostly defined by industry. Size is usually an effective variable for reducing te sampling variability (see capter M). Business surveys are ordinarily carried out continuously, eiter annually, quarterly or montly. Te samples may be co-ordinated over time, using a panel system or possibly a tecnique based on permanent random numbers (Olsson 1995). Units in business statistics typically cange fairly rapidly; tey can die, tey can merge wit anoter unit and tey can split into several units. Te industrial classification may cange, and te size of te unit can vary..1. Parameters Let us loo at te various types of finite population parameters tat are typical for a business survey. Consider te finite population of N units U { u u u } interested in te population total t y = 1,...,,...,. Sometimes we are = y (.1) U N 4
9 were y is te value of te study variable, y, for te t element. Moreover, totals for domains typically industries are also common. Let us denote te domain set by U d, d = 1,..., D, and set y ( d) = y if unit U 0 oterwise 5 d. Ten te total for domain d is ty = U y( d) = U y (.) d Ratios of different types are common in business statistics. To define tese types let z be anoter study variable and let te population total for z be denoted t z and te domain total t zd. One type of ratio is d y d z d d R = t t (.3) A typical example ere is production per ead wit industry as domain. Anoter type of ratio is R = t y t y (.4) sowing for example te production of an industry, relative to te wole population. Anoter parameter of interest is d d I = t t (.5) y d z d were prime ( ) indicates relative to anoter population. A typical application of (.5) is te relative cange in production (say) by industry from one period to anoter, tat is, te totals in te numerator and te denominator ave different reference times, but oterwise relate to te same variable and domain. Te sample units (involved in te numerator and denominator) are partly te same, partly different, and units tat contribute to te total on bot occasions may ave canged domain (industry) in between. Indices of production (say) are examples of complex sets of parameters, typically built up from components lie (.5), and usually also deflated by price indices. Yet (.5) is already a callenge for te available software. Te complexity also depends on te way samples are coordinated over time..1.3 Point estimators To estimate te parameters defined in section.1., a sample s of size n is drawn from U (or actually from te frame). Stratification is commonly used in business surveys, tat is, a simple random sample s of size n is drawn from te stratum U, = 1,..., H, were U = for H U =1 U. Let te stratum sizes be N, = 1,..., H, and te design weigts are d = N n s. However, nonresponse occurs in te survey process, and te response set r of size m is obtained, were r s. Tere are two main ways of treating tis problem, namely weigting and imputation. In weigting, te nonresponse compensation adjustment weigt v is
10 constructed primarily wit te aim of reducing te nonresponse bias, but is also used to reduce te additional component of sampling error caused by nonresponse (see capter M8). Wen using te weigting approac te estimator consists of te sum of te weigted values for elements in r, were te weigt consists of te product of d and v, were v is te tool for maing te inference from r to s and d from s to U. Wen imputation is used, values for all n elements are used in te estimation, but now n-m of tese values are estimates (approximations) of te real values. None of tese metods is expected to completely eliminate te bias. Wen a substantial nonresponse bias is still present te variance estimate and te confidence interval will be an unrelevant and incomplete measure of te quality of te point estimate. As indicated above, nonresponse will also cause an additional component of sampling error. Tis is obvious in weigting, since te number of observations is reduced from n to m. In te following, we describe estimators used in business surveys. Here we describe te estimator using a nonresponse compensation adjustment weigt, wic as a more complex form tan te estimator based on imputation. Te nonresponse compensation adjustment weigt is an approximation of te inverse of te response probability. Tat is, one sees a relevant model of te response probabilities. Commonly, tis model consists of a grouping of te sample s. Särndal, Swensson & Wretman (199) denote tem Response Homogeneity Groups (RHGs). In te following we will coose among tree different types of RHGs, namely (i) (ii) (iii) strata and RHGs coincide RHGs are subgroups of strata RHGs cut across te strata (.6) Te simplest estimator is te Horvitz-Tompson estimator, combined wit nonresponse model (i). Tat means tat we find it plausible tat eac sampled element in te stratum responds wit te same probability. In tis case te nonresponse compensation weigt is n v = and since d = N n te resulting weigt is N m and te estimator as te m form H $t = N y y r =1 (.7) were y r 1 = m r y. A somewat more complex estimator is obtained wen using nonresponse model (ii), namely $t y H N L = n = 1 n q= 1 q y rp (.8) 6
11 were n q is te size of te part of s tat falls into RHG q; m q is te size of r q, te response set in RHG q, and y r p 1 = m p r p y Wen using nonresponse model (iii) an even more complex estimator is obtained. Let us ere express it by te general version. $t = d v y (.9) y r Frames used in te Member States regularly contain more information tan industry and number of employees, for example, te turnover from a previous time of reference. Moreover, geograpical information for te local units is commonly available. Tus, tere may be register information, wic is correlated wit te study variables and/or te response probabilities, but not used in te estimator of te form (.9). A simple version of suc information is a partition of te population. To demonstrate estimators based on suc a partition we let U,..., U,..., U be groups tat form a mutually exclusive and exaustive 1 p P partition of te population. Assume tat we now te sizes of tese groups, N1,..., Np,..., NP. Ten tey can be used as poststrata. Suc an estimator, using te nonresponse model (i) mentioned above, as te form t$ yr P N p H N = r N$ y p m p = 1 = 1 p (.10) wit N$ = N$ p H =1 p, were $ N to te union of U and U p. p N = m m p ; m p is te size of r p, te response set tat belongs Estimator (.10) is a special case of te following general estimator were $t = d v g y (.11) yr r g = T T 1 ( x ) ( / ) U d r v x d r v x x σ x σ 1+ (.1) By coosing te positive factors σ te approac can be made very flexible. Tis will become apparent in subsequent sections. Te vector x is called te auxiliary vector in wat follows. Estimator (.11) is based on a general approac to regression for two-pase sampling following Särndal & Swensson (1987). It is ere used in te nonresponse situation, but since we do not now te response probabilities te second-pase inclusion probabilities ave to be estimated in some way (see also M.3.1.5). Te inverse of tis estimate is denoted by v. In wat follows te estimator (.11) is called te GREG estimator. 7
12 In te case of poststratification te auxiliary vector is defined by x = ( γ γ γ ) T 1,..., p,..., P 1 if unit U p were, for p = 1,..., P, γ p = and σ = 1 for all. Tis poststratification 0 oterwise approac gives us one simple metod of dealing wit outlying observations in a survey, since tey can be moved into an appropriate poststratum for estimation. Most of te classical estimators can be derived as special cases from te GREG estimator. For example, if x = x for all and σ x, were x is a continuous variable, and wen nonresponse model (i) is used, ten te following estimator is obtained: $t yr = H N y = 1 H N x = 1 r r U x (.13) Estimator (.13) is sometimes called te combined ratio estimator. Sometimes te group totals U p x are nown and, in tis general case, te p-groups are called model groups. Let us present a simple example. As before assume tat x is a continuous variable, but ere we now te quantities ; p = 1,..., P. Let ( γ x γ x γ x ) T x = 1,..., p,..., P, σ x for eac p-group, and te RHGs coincide wit strata (nonresponse model (i)) ten te GREG estimator taes te form U p x t$ yr = H $ P N = 1 H p= 1 N$ = 1 p p y x rp rp U p x (.14) If strata and model groups coincide ten estimator (.14) can be written $t yr H yr = =1 x r U x (.15) Estimator (.15) is sometimes called te separate ratio estimator. Wen x = ( 1, x ) for all and σ = constant, ten te classical regression estimator is obtained. Many business surveys are subject to occasional unusual observations, or outliers, wic can ave a large effect on te estimates. In tese cases, robust versions of point estimators are often used, wit te simplest being te poststratification estimator wit te outliers in teir own (completely enumerated) poststratum. Tis follows from te metod above (.13). Oter metods involve adjusting te weigts or te responding values, and winsorisation is becoming widely used witin te UK for treating outliers. Tis leads to a different estimator, wic does not necessarily fit completely into te GREG framewor. 8
13 Te parameters (.1)-(.4) are totals or functions of totals from te same period of reference. Estimators for tese parameters can be obtained by replacing tese totals by teir estimators. Parameter (.5) is muc more complex since it contains totals from two periods of reference. In most surveys two consecutive samples are drawn in suc a way tat tey overlap eac oter. Tat maes it possible to construct combined estimators tat are more effective tan just replacing te totals by teir estimators. However, variance estimation becomes complicated. We do not go deeper into tis problem but just refer to Nordberg (1998), wo as found a solution to te special sampling procedure used at Statistics Sweden. So far we ave only discussed one-step element sampling designs, but it is easy to see ow te one-step cluster alternative affects te formulas. Auxiliary information can be nown at te cluster level or at te unit level. In te latter case we can coose to use te auxiliary information eiter at cluster level or at unit level. Wen te auxiliary information is nown only at te cluster level te model groups are, of course, defined for tat level..1.4 Variance estimation metods Tere are four principal ways of calculating variances (Wolter 1985), eac unbiassed or asymptotically unbiassed in most widely-used design-estimation strategies if full response is assumed, but eac (in general) producing a different value for te unbiassed estimate: direct calculation and Taylor linearisation; jacnife; bootstrap; balanced repeated replication metod. Before we discuss tese metods just a few words about variance estimation wen imputation is used, following te discussion in section.1.3. Te literature describes many imputation metods suc as nearest neigbour donor, current ratio, current mean, auxiliary trend, etc. However, te teoretical development of variance estimators wen data contain imputations is still in its initial pase. Two examples of articles on tis problem are Särndal (199) and Deville & Särndal (1994). In surveys were te complete data set is treated as if it were te full-response set, owever, tis will commonly underestimate te variance (see, for example, Rubin 1986) Te Taylor linearisation metod Direct calculation involves application of (normally) te Sen-Yates-Grundy estimator (Sen 1953, Yates & Grundy 1953) to form te variances of simple survey estimates. More complex survey estimates are first linearised by taing te first-order terms in an appropriate Taylor-series expansion, and ten te SYG estimates are inserted into te linearised formula. Tis is basically a set of appropriate linear expressions for te variances of estimators, wic as to be coded into te software. Every different design-estimand combination requires a different formula wic must be (essentially) ard-coded; separate formulae are not required for different estimation models if te GREG estimator (see equation (.11)) is present, as all te commonly used models are eiter GREG or special cases of it. 9
14 .1.4. Te Jacnife metod Te jacnife involves dropping an observation and recalculating te estimates from te remaining observations, repeating successively until all observations ave been dropped, and ten finding te variance of te resulting series of estimates (wit a suitable multiplier to give approximate unbiassedness). Te drop-one jacnife is usually used, as it can be sown to give te variance estimate wit te smallest sampling variability, altoug it is possible to drop pairs of observations (or even more) too; tis strategy is usually adopted to speed up processing since drop-one is te most processor-intensive metod. We consider only dropone metods ere. More information on te jacnife estimator is in M.4..-M It sould be noted tat te jacnife is only strictly applicable in wit-replacement designs. It can be used in witout-replacement designs were te sampling fractions are sufficiently small (Wolter 1985, p168), but in many business survey designs, te sampling fractions are relatively large. Te dangers of tis approac are illustrated in te simulation in capter 3 below Te Bootstrap metod Te bootstrap involves resampling a number of times wit replacement from te sampled observations, and calculating an estimate for eac of te bootstrap samples. Te variance of tese bootstrap estimates is ten calculated, again wit a suitable multiplier to ensure unbiassedness. Te metod is described in more detail in M Te Balanced Repeated Replication (BRR) metod Tis is derived from te balanced alf samples (BHS) metod wic as a very specific application in cluster designs were eac cluster as exactly two final stage units. By successively deleting one of tese units and canging te weigt of te oter to compensate, a range of estimates can be produced wose variance can be calculated and suitably adjusted to give an appropriate variance estimator (Wolter 1985). Various adaptations of tis can be applied in designs were te clusters ave variable numbers of units, based on dividing tese into two groups. Recent researc (Rao & Sao 1996) sows tat only by using repeated divisions ( repeatedly grouped balanced alf samples (RGBHS)) can an asymptotically correct estimator be obtained. Tis metod, ten, can only be used for te usual stratified designs in business surveys if we are prepared to treat a stratum as if it were a cluster, and to run te pacage a number of times wit different divisions of te elements into two groups; were tere is an odd number of elements in te stratum te results are biassed, and ways of reducing tis bias (but not eliminating it) are described in Slootbee (1998). Tere are ways in wic tis can be done, but te results are typically unsatisfactory and te manipulation of bot data and software becomes very involved..1.5 Summary of requirements Tere is a number of requirements for point and variance estimation in business surveys wic any software sould satisfy. We ave pointed out several suc requirements in te ting to be estimated 10
15 previous sections. However, in order to simplify te evaluation we will ere present a structured summary of tese requirements. Te demands on te software will certainly vary between Member States (MS). Consequently, pacages wic only meet some of te requirements mentioned aead may be sufficient for a particular MS, provided tat tey meet te requirements of tis MS. Te pacages will be evaluated wit respect to teir ability to cope wit te following situations. Sampling designs: One-step stratified sampling of units or clusters. In eac stratum a simple random sample is drawn. In some strata te finite population correction (fpc) as a large effect; in tae-all strata it reduces te sampling variance to zero. Panels or random number tecniques are used in te sampling procedure. Nonresponse models and outlier adjustment: Weigting witin RHGs (i)-(iii), as described in.1.3 and equation (.6) or imputation as described in sections.1.3 and.1.4, and outlier treatment using poststratification or winsorisation as described in.1.3. Parameters: Parameters for measuring levels as in (.1)-(.4) and parameters for measuring cange as in (.5). More complex parameters suc as indices are also of great interest. Estimators: Estimators for totals as defined in (.7) to (.15). Ratios and oter functions of tese estimators are also of interest. Point estimates and te corresponding variance estimates for parameters suc as (.5), for example measures of cange between two consecutive periods (a demanding tas for te pacages) are of interest. Variance estimators: availability of different variance estimation metods (Taylor, Jacnife, BRR, Bootstrap). Te pacages will also be evaluated wit respect to: interface, documentation and elp functions; weter computations are correctly done; execution time; simplicity to integrate into production systems; cost for purcase or licenses.. Critical comparison of software pacages Te software pacages evaluated ere fall into two distinct groups based on te way tey are designed and te type of situations in wic tey can be used. It maes sense to structure te discussion around tese two groups, as te metods employed witin te pacages are very similar witin groups, and quite different between tem. Group I: CLAN and GES are designed for stratified designs wit estimation models up to te complexity of te generalised regression (GREG) estimator. Tey are caracterised by aving two parts to teir processing, one in wic te appropriate weigts are calculated for te survey observations, and ten a second pase were te estimates and teir associated variances are produced. Te variances specifically tae account of tese weigts, and are based on te variances of te residuals from te GREG model (or a specific (simpler) case). 11
16 Group II: STATA, SUDAAN and WesVar are designed principally for cluster designs wit versions of te Horvitz-Tompson (HT) estimator (in most cases optionally involving poststratification); te ey ere is tat GREG-type estimators (including most of te simpler cases suc as ratio and regression estimation) are not supported. STATA and SUDAAN bot wor in a straigtforward way wit stratified designs, but WesVar needs clusters at te penultimate sampling stage in order to wor effectively (mainly because of te BRR variance estimation metod employed). Tis group is caracterised by not aving a weigt calculation pase and requiring te (HT) weigt to be input. In some cases te software can be made to produce valid or approximately valid results for estimators oter tan HT, but tis is typically not easy and may require te pacage to be run more tan once for eac survey...1 Sample designs CLAN and GES ave te following designs built-in: 1. simple random sampling;. stratified designs; 3. probability proportional to size (wit replacement) designs; 4. one stage cluster designs (optionally wit te clusters in strata). Tese cover te main probability designs used for business surveys in Member States, but do not extend to te more complex designs used in some social surveys. It is possible to force more complex designs troug CLAN and GES by accepting some assumptions about variances at lower stages; one option is to set appropriate jacnife adjustment weigts witin GES for two-stage designs. All of tese metods, owever, are vanisingly rare in business surveys, and require considerable expertise and input from te user, so tey are not considered furter ere. Statistics Canada ave just begun to develop two-stage cluster sampling for inclusion in te next version of GES (version 5.0). STATA and SUDAAN ave te following designs built in: 1. simple random sampling;. stratified designs; 3. one stage cluster designs; 4. two- and multi-stage cluster designs. Tese cover a wider range of designs, but te complex cluster designs are not typically used for business surveys, and we now of no examples of teir current use in business surveys in member states. However, tis does give some added flexibility in te use of te pacage for various surveys. WesVar as te following two designs available: 1. simple random sampling;. two-stage cluster designs wit exactly two primary sampling units in eac cluster. Tese designs are very restrictive in te context of business surveys were clusters are rarely used, and were treating a stratum as if it were a cluster typically gives more ten two primary sampling units in eac cluster. For tis reason we will not concentrate muc discussion on WesVar. 1
17 Te finite population correction (fpc) can ave a large effect on te variance estimates; witin GES and CLAN it is included automatically (except for te jacnife estimator in GES). In STATA a specific command option must be used to get te fpc, and in SUDAAN it depends on te design weter te fpc is included or not. GES and SUDAAN alie include te fpc automatically in witout-replacement designs, and exclude it in wit-replacement designs. However, it can in some circumstances be reasonable to use wit-replacement variance estimators as approximate variance estimators in witout-replacement designs, wen inclusion of te fpc can become important; inclusion of te fpc is unliely, owever, to solve all te difficulties of tis approac... Nonresponse models and outlier treatment CLAN is te only software pacage to include te specification of non-response models. Tis is done by defining response omogeneity groups, wic can be defined differently from te stratification and model groups, and provide a flexible way of defining te weigting adjustment for non-response in line wit equations (.7)-(.10). Tis additional option witin CLAN is similar to te sort of metodology wic would arise in a two-stage stratified design, wit first stage selection being sampling from te frame and te second pase being sampling respondents from te selected sample. Tis means tat te extra functionality can be used to mae CLAN give appropriate answers in some complex designs if tere is (or can be assumed to be) no non-response. For te oter software pacages considered ere, only two alternatives are available, eiter to assume tat non-responding units were not sampled, wic is equivalent to imputing teir value wit te mean under te estimation model for te stratum in wic tey were selected, or to fill in te missing values using some imputation procedure and ten use te completed dataset. In bot tese cases (but particularly te second), it is very liely tat te calculated variance underestimates te true variability. Te only reasonable metod of calculating variances wit pacages oter tan CLAN would be to use a stocastic imputation procedure to create multiple datasets (multiple imputation, Rubin 1987) and use te pacages to produce a series of estimates wic can ten be suitably combined. Tis approac involves a lot of additional processing not available witin te pacages, and as not been attempted ere. Outlier treatment by moving outliers into a poststratum can be appropriately set up in most of te software described ere (in GES and CLAN by setting up appropriate model groups, and in SUDAAN by using te poststratification options). Exact variance calculations for oter metods, specifically winsorisation (Koic & Smit 1999a, b), are not available in any pacage, but a good (first-order) approximation can be obtained by using te winsorised values as if tey were te survey values. 13
18 ..3 Parameters Te parameters wic can be estimated in GES are: (a) count (an estimate of domain size); (b) total (equations (.1) and (.)); (c) mean; (d) ratio(equations (.3) and (.4)). Witin CLAN, te user needs to construct several macros to specify te estimation to be undertaen, and at tis stage it is possible to include arbitrary rational functions of totals, so tat purpose-built estimands can be constructed and teir sampling variances calculated explicitly witin te pacage. GES allows only te four estimands described above, but in a similar way te variances of linear combinations can be found afterwards outside te pacage. In general owever, tis will require more expertise and effort tan setting up te appropriate macros in CLAN. Te PC-CARP documentation suggests tat it estimates quantiles (wit te appropriate variances) too, a facility not available in eiter GES or CLAN. STATA and SUDAAN ave: (a) count; (b) mean; (c) total; (d) ratio; (e) regression parameters; (f) Wald statistics; (g) logistic regression parameters; () quantiles; and for STATA only (i) arbitrary linear combinations of parameters. Some of tese are not currently widely used in business surveys, but tere seems to be some development in te field of estimating distributions, wic will mae te estimation of quantiles more important, and te facility to produce estimates and variance estimates for arbitrary linear combinations of parameters can be used to assist in te estimation of variances of complex population parameters suc as canges, index numbers and so on (see capter M3). WesVar produces a similar range of statistics to STATA and SUDAAN, including arbitrary linear and non-linear combinations of statistics. Te sampling variances of te non-linear statistics can be found because WesVar relies on replication metods. Of particular interest in repeating business surveys are estimates of movement or cange. Were te units are exactly common between two periods (almost never true even if te design is set up in tis way because of differential non-response), ten any of te pacages ere can be used to estimate te movement by including te responses for different periods as two survey variables. Wen te units are not te same, ten it becomes very callenging to produce an appropriate estimate of cange and its variance. Witin CLAN tis can be acieved by including te union of te two samples as te sample, and specifying te 14
19 response omogeneity groups in suc a way tat weigting adjustments are made for te units wic were not sampled because of te sample rotation as well as tose units wic did not respond. Because te variance estimation reflects te additional uncertainty due to imputation, it gives an approximately correct variance for te estimate of cange taing account of te substitution of units (if te non-response weigting completely adjusts for bias). A similar imputation can be done to fill in te missing data for rotated (and non-responding) units before entry into te oter pacages, but because te pacages do not appropriately account for imputation wen estimating te sampling variance, it will typically be underestimated. More complex statistics are also of interest, for example deflated index numbers. None of te software is currently able to tacle suc combinations of information, and te only reasonable approaces are (i) linearisation of te target statistic and calculation of te appropriate components of te linear combination in CLAN or STATA or from results produced by any of te software pacages, or (ii) a sensitivity-type analysis sowing te effect of sampling errors on te overall statistic (see M3.4 and Koic (1998))...4 Estimators A range of estimators is available for use in business surveys, depending on te range of auxiliary information available from te business register. Te simplest estimation metod is Horvitz-Tompson (HT) estimation (also called simple raising, expansion estimation and number raised estimation), wic involves weigting eac unit by te inverse of its selection probability. Tis estimator is available in CLAN, GES, STATA and SUDAAN, but is not given in WesVar wic is designed purely for variance estimation and does not provide point estimates. Tis estimator is unusual in ONS business surveys, altoug tere are some examples of its use in recent years; in oter member states, for example at Statistics Sweden, it is widely used. Te only information wic is normally required is te number of units (altoug HT for πps sampling as already used additional information in setting up te selection probabilities). Were additional auxiliary information is available from te business register, more complex estimators are often used. In te ONS te ratio estimator (separate or combined, equation (.13) and te simplification of it wit a single stratum) is almost ubiquitous. Te true ratio estimator is available only in CLAN and GES, were it is andled appropriately wit te correct model used to calculate residuals to feed into te sampling variance calculation. In SUDAAN and STATA only te HT estimator is available. However, it is possible to obtain approximately correct variances for (one-variable) ratio estimation by (i) calculating te ratio of te survey variable to te auxiliary value, witin strata (for separate ratio estimation), taing account of te selection probabilities; (ii) constructing an additional variable as te residual between te observed value and te ratio applied to te auxiliary value, and (iii) calculating te variance of tis residual witin strata again taing account of te selection weigts. Tis involves two passes troug te software wit some additional manipulation and produces only te variance directly tere is no point estimate, and if tis is required it 15
20 needs some additional processing after te ratios ave been calculated to produce it. Crossstratum ratio estimation can naturally be done in te same way by defining appropriate groups witin wic to calculate te ratio. Te additional feature of coosing te variance function is not available; for estimation of ratios in SUDAAN only te ratio of averages ( rˆ = wy wx, wit appropriate weigts w) metod is supplied (tat is, oter ratios suc as te average ratio ~ 1 y r = w are not available). It is naturally also possible to w x supply different weigts to te expansion estimator, suc as tose taen from ratio, regression and GREG estimators, but naïve application of tese weigts in te standard HT estimator does not give te correct variances (more detail is given in Capter 4). Neverteless te effects of using tis sceme are investigated in te simulation in capter 3. Furter complexity in te estimator can be introduced by using more variables witin a regression estimation framewor, altoug tere are very few current examples of tis sort of estimation in business surveys in te UK and Sweden (only te Annual Employment Survey uses tis metod in te UK). However, it seems liely tat tese metods will become more important in te future. As before, CLAN and GES cover tese metods directly, wereas SUDAAN and STATA do not include te direct estimator, but can be used to estimate te regression parameters and ence calculate residuals to use in calculating te sampling variance. We ave not attempted to verify tat tis wors using classical regression estimation (tat is, wit te variance approximately constant wit size). Getting an appropriate (non-constant) variance function in regression may be extremely involved (especially were tere is more tan one explanatory variable), but tis is properly dealt wit under full calibration in te next paragrap. Te most general estimator, te GREG estimator, wic allows calibration to many auxiliary totals and provides a facility to add constraints to bound te weigts, is available in only CLAN and GES, and cannot be incorporated into STATA or SUDAAN. We now of no business surveys in member states wic rely on tis tecnology at te moment. One side effect of te inclusion of te GREG estimator is tat te variance function for te ratio and regression estimators can be defined by te user, by supplying suitable values to te software (normally σ x α were x is one of te auxiliary variables and α = 1, see (.1), or sometimes for some oter value of α). By maing te variance proportional to an extremely large number for any particular observation, its effect can be removed from estimation (tat is, its g-weigt will be 1), giving a rudimentary outlier treatment/robust estimation metodology...5 Variance estimators Te use of BRR wit business surveys is typically difficult, as described in section 0. WesVar relies almost entirely on te metod of BRR, and so is not a serious contender for recommendation for business surveys. SUDAAN also as tis metod available as one option among several, but tere seems to be little to commend it over te oter metods in te current context. 16
21 Te four main pacages investigated (CLAN, GES, STATA, SUDAAN) all include te direct ( Taylor ) metod of variance estimation (te SYG estimator). Te implementation is basically a set of appropriate expressions for te variance of estimators, wic as to be coded into te software. Tis is te way in wic most business survey variances are calculated, and as suc eac of te four software pacages fulfils our requirement for a basic design-based variance estimator. For te simpler cases of expansion and ratio estimation wit model groups corresponding wit strata were te full complexity is not needed, tere can be little to be gained from te software; in tese cases, purpose-written programmes may be perfectly adequate. Most pacages include te finite population correction automatically witin teir variance calculation for witout-replacement designs, but STATA requires it to be specified explicitly as a command option if it is required. Jacnife variance estimators are available in GES, SUDAAN and WesVar. It sould be noted tat te jacnife is only strictly applicable in wit-replacement designs, and te documentation for te pacages points tis out. It can be used in witout-replacement designs were te sampling fraction is sufficiently small, but in many business survey designs, te sampling fractions are ig. A furter adjustment can be made by including te fpc, but none of te pacages do tis automatically. In GES it is not obvious from te documentation tat tis is missing. Te validity of te outputs is discussed as part of te results of te simulation exercise (capter 3). In GES te jacnife option requires te user to set up jacnife groups explicitly. Te dropone jacnife is te most efficient variance estimator, and te easiest and quicest set-up is to use tis metod, by maing every element a jacnife group, and giving eac group an equal jacnife adjustment weigt. Altoug tis is fairly intuitive, it is a same tat te software does not contain a default to allow it to appen automatically. If speed of processing is vital it is possible to set up jacnife groups containing several elements (faster, less efficient and less intuitive), in wic case tere are also several ways to form appropriate jacnife adjustment weigts usually te weigt is equal to te number of elements in a group, but for multi-stage designs te weigts can be set to te number of secondary sampling units to give a variance estimate under te complex design. Tis flexibility is useful in concept but unliely to be applied in practice in business surveys. SUDAAN provides a default jacnife metod by simply coosing te eyword for jacnife variances; tis is in fact te drop-one metod. Tere is no facility for user-defined jacnife groups. In WesVar two forms of te jacnife estimator are provided one is dependent on te specific design wit two elements in eac final stage cluster, and te oter is te drop-one jacnife, wic is available only for simple random sampling. By processing strata separately and using te drop-one jacnife it is possible to force te software to deal wit some business surveys, but it is not in general suited to tem. None of te software pacages considered implements a bootstrap variance estimator. 17
22 ..6 Interfaces, documentation and elp In many NSIs it seems tat SAS is becoming te main tool for survey analysis, and tis is reflected in te software seen ere. CLAN and GES are bot written as a series of SAS macros, so tat te SAS pacage is required to use tem. CLAN uses only CORE and BASE SAS, wereas GES uses CORE, BASE, AF, FSP and IML. SUDAAN is available in two versions, one free-standing and one wic can be called directly from SAS. WesVarPC is designed to loo somewat lie SAS but oterwise as no connection wit it. Following an agreement between te autors, SPSS versions will now include te WesVar software. STATA is stand-alone (but provides a complete statistical pacage), and only available for Windows 95, Windows NT or later operating systems. Tere are two basic approaces to setting up te data and commands for te software, and tese are not related to te groupings described at te ead of section.. Te first is to provide appropriate commands and leave te user to construct a programme or script wic is ten submitted to te software, wic returns wit te completed calculations, and tis is te basis for CLAN and SUDAAN. CLAN in fact goes a stage furter and requires te user to construct several macros as well as putting togeter te code to produce te final outputs. CLAN is basically a series of macros, wic accept data and oter macros as input. Once te user-defined parts are written, te user calls te macros in te appropriate order and combination in order to get te results. Because te program is written in SAS, te entire interface is supplied by SAS. Tis metod maes it relatively easy for te software to be flexible and to cope wit cases were unusual estimates are required; it also, by dint of requiring te user to now a fair amount about te way in wic te pacage is constructed in order to use it, prevents te mindless application of default metods in situations were tey are not appropriate. By te same toen, owever, a reasonable amount of expertise in estimation teory and in SAS programming are required to use te pacage. Fortunately te recently produced CLAN manual (Andersson & Nordberg 1998) is very clearly written and sows in a very straigtforward way ow to set up te appropriate macros and data. Tere is no on-line elp system available wit CLAN. Output is sent only to a SAS dataset, wic can ten be printed, exported or furter manipulated using SAS. Tere is no formal support system for CLAN, but informal support from Statistics Sweden is available on a case by case basis. SUDAAN can be viewed in a similar way, except tat te macros are called procedures, and in te SAS-callable version tey beave lie SAS procedures. In stand alone SUDAAN tere are Program Editor and Output windows (te output ere doubles as bot Log window and Output window according to SAS s view of te world). All tat is required is for te user to learn te appropriate syntax and to type in tese commands. Te submit button is ten cliced, and te pacage processes te data as required, sending results to te output window (and/or an appropriate file). Most of te syntax is easily learnt, but tere are a few oddities: two procedures ave different names in te SAS version to avoid reserved eywords; te formatting statements in SUDAAN are notoriously long-winded and do not ave sort forms. 18
23 Tere is a two-volume manual wic describes te syntax and te basic usage of SUDAAN, wic is a very useful guide for te beginning user. However, it does not contain any explanation of te teory used in te software, and in several places tere are bald statements from wic it is almost impossible to wor out exactly wat te software is doing (for example, a poststratified estimator is mentioned for several procedures). Tere is in fact a metodology guide (Sa et al. 1995), but tis wasn t sent out as part of te documentation to accompany te license. Wit tis guide to and te pacage is well-documented. Te on-line elp system covers only te main user-guide part of te manual. SUDAAN as te advantage of reading and writing files in several formats, including SAS files (in te standalone and SAS-callable versions), text, and SPSS (in te stand-alone version only). Te SAScallable version is particularly useful wen combining SUDAAN processing wit oter operations, for example in producing a ratio estimator or doing experiments or simulations were te procedure can be embedded in a macro or loop. Tere is support for SUDAAN, and an support address, during office ours on te West Coast of te USA (approximately 1600 to 400 GMT). Te second approac is one wic provides an interface to lead te user troug te stages of setting up te appropriate files, meanwile writing te commands eiter in te foreground or beind te scenes. GES as an interface wic leads te user troug te stages. From v4.0 (te latest version) most of te information for a single run of te pacage is contained on one screenful; te catc is tat a 17 (43cm) screen is required to be able to view all te appropriate buttons, and tis does not seem to be mentioned in te documentation(!). At any stage te input files must first be defined to GES, so tat tey must be selected even if tey already exist as SAS datasets, and oterwise imported to SAS; te import facility is built in to GES so tat tere is no need to exit and return. At te same time as a file is defined, te variables corresponding to certain ey definitions (strata, etc) are cosen. All te identifiers are intended to be text variables, and altoug numerics can be used in teir place in some (but not all) parts of te software, tey can t be cosen from lists of available variables unless tey are text. Tis is frequently frustrating were, for example, te stratum is identified by a number in a numeric field, wic must be converted to a string containing te number. Once GES is running it is also not possible to run any code from te program editor witout exiting GES (te only way to amend a dataset witout exiting is to use ASSIST). GES does contain facilities to generate input files in most of te cases wic one would use in practice, normally using a SAS By statement. GES maintains its previous settings and data files from run to run, wic can be convenient wen several similar surveys are to be analysed, or several alternative models are compared for te same dataset. It also as a good system of survey organisation; eac survey is individually labelled, and witin eac survey multiple periods can be eld, wit te files for eac period stored in an individually named directory (te same directory can be reused for several periods as long as file names are not duplicated). Tis maes it very easy to produce results for repeating surveys wen tey are using te same definitions and procedures. It also means tat by selecting a new survey te previous information for tat survey is available on te definition screen. Te SAS versions of output files are constructed to contain (meta-)information on te input files wic ave 19
24 been used to produce tem, and tis is displayed on screen wen te outputs are to be viewed; tis avoids te need to code in te information in an 8-caracter SAS name. An option also allows te results to be written to a text file; furter output options can be obtained by manipulation from SAS, but not from GES directly. Te outputs viewer and te definition screens include procedures to sort, browse and edit data witout te need to exit GES. Because of te large amount of information wic needs to be supplied to GES, te input screens are not really intuitive, but tey do provide a common way of defining all te necessary input files. Te written documentation sent wit GES is fairly basic enoug to get started and ave an idea of wat te pacage requires. Te main documentation effort is on-line, were tere are tree types of elp te usual specific elp for particular procedures, coices and actions, a list of GES error messages wit teir meanings and liely causes (quite a lot of te causes are not filled in), and GES metodology elp, wic explains te metods and gives te formulae in use witin GES. Tis latter is very useful, and (if printed) would form a metodological guide to GES. Te metodology as also been publised in Estevao, Hidiroglou & Särndal (1995). Support is available for GES, up to 30 days per year wit a (compulsory) maintenance contract, and responses are normally available during Canadian east coast office ours (approximately 1400 to 00 GMT). Tere is no dedicated support person or address. STATA is also command-driven, wit commands entered in te command window. Tese must be learnt as tere is no facility for selecting tem from pull-down menus, but tere is a review window wic sows previously used commands, and tese can be reselected. Te syntax of commands is relatively straigtforward, and tere is a particular series beginning svy wic are designed for survey analysis. Tere are two oter windows in te STATA interface, an output window for results and messages, and a variable window wic sows te names of te variables in te current dataset. Te documentation available wit STATA is copious, but te amount dealing wit survey metods is relatively small, altoug te commands are clearly described. On-line elp for specific commands (but not describing te bacground teory) is also available. Support is available by pone, fax and , again during USA office ours (approximately 1600 to 400 GMT). Uniquely among te pacages considered ere tere is also a STATA listserver to wic queries can be sent; te existence of tis probably reflects te wider range of functions available in STATA. WesVarPC also as an interface wic leads te user troug te setting-up stages, normally maing coices from lists as to wat sould be next in te syntax statement. Wen te wole set of code as been constructed, it is submitted. Te code is visible during te set-up process, and can be typed in directly for speed if te syntax is already nown. Tere is a compreensive user guide wic is available for downloading wit te software over te internet, wic describes ow to use te pacages. Output is sent to text files, and inputs can be read from files in a range of formats including SAS (up to version 6.04; transfer files must be used for later versions), text, SPSS and dbase. 0
25 ..6.1 Initial reactions of new users to te software During te study of tese software products, te initial reactions of new users ave been monitored, and tese are summarised ere: CLAN orror GES complicated STATA nice SUDAAN basically straigtforward but a bit confusing in places WesVar nice interface but difficult to wor out ow to set up data to get te desired outputs...7 Correctness and speed CLAN, GES, STATA and SUDAAN all produce te same point and variance estimates for te Taylor-type variances of totals using number-raised estimation. CLAN and GES also agree on te ratio and regression estimators, wit rounding error differences at only about te 10 t decimal place. Te artificial variance of a ratio estimator from SUDAAN as described in section..4 is rater more different from te CLAN/GES results, possibly from a double dose of rounding error, or possibly from some minor difference in te ultimate metodology used witin SUDAAN in doing someting it was not designed for. Te simulation study gives no grounds to suggest tat any of te software produces incorrect answers, and independent cecing of GES at Statistics Sweden confirms tis. We did not try to run exact comparability trials, but instead give an overview of te speed of processing of tese pacages in te context of teir use. SUDAAN and STATA are bot relatively quic, taing a minute or less to produce estimates and variance estimates for a survey te size of te UK s Annual Business Inquiry (ABI), based on te simulation example, on a Pentium 166MHz PC wit 18 Mb RAM (networed). Asing for jacnife estimates from SUDAAN increases te processing time sligtly, but tis is still te of te order of one minute. CLAN and GES tae considerably longer; bot ave weigt calculation and estimation pases; witin CLAN te weigt calculation pase is long (around alf an our), and ten survey estimation proceeds in several minutes; for GES weigt estimation taes about two minutes, but estimation taes around an our. For GES te use of te jacnife variance estimator approximately doubles te processing time. In te context of producing survey results tese times are broadly acceptable, since sampling errors are not normally a critical part of te production process. For very eavy processing or simulation wor, bot CLAN and GES (and GES in particular) are rater slow...8 Ease of integration wit processing systems Te ease wit wic te software can be integrated wit processing depends very muc on te actual processing system. SAS is becoming common as a tool for processing in NSIs, and were tis is used te interfaces to CLAN, GES and SUDAAN are very straigtforward. Te ability of SAS to access databases directly for common database-platform combinations could be useful in tis regard, but does not seem to be widely used in NSIs. However GES as client-server operation wic allows tis to be set up from witin te software, and additionally allows processing on a larger macine away from te PC. Away from integrated 1
26 SAS-based systems, files must be transferred, and tis usually requires some manual intervention. Te range of file types supported for input and output witin te pacages reviewed ere is sufficient tat te data can be transferred fairly readily, altoug some reformatting may be required. In tese cases tere is no good automated procedure...9 Costs Te costs of te various pacages are given in te following table. Pacage Initial license Annual maintenance CLAN free free GES C$30,000 for a site license (unlimited number of users) for one platform. Licenses for additional platforms cost C$7,500 eac STATA US$975 optional SUDAAN stand-alone SAS-callable US$995 US$800 (+US$60 eac additional user) WesVar PC free free C$3,000 for site license (unlimited number of users) Additional platforms cost C$750 eac none, but upgrades must be purcased US$400 (+US$130 eac additional user) Table.1 Te costs of a single license for te evaluated software pacages (information correct at 1 January 1999)..3 Recommendations for variance estimation software for use in EU member states Te current position wit variance estimation software is confusing. Tere are no clearly superior pacages, and eac as advantages and disadvantages wic vary according to te situation in te particular survey to be processed. Te group II software pacages (STATA and SUDAAN) are only really appropriate wen expansion estimation is used. In situations were tis is te only (or peraps predominant) metod, tey offer several advantages including fast processing, additional survey analysis features and a reasonably friendly interface. Were survey estimators are more complex, from ratio estimators to GREG estimation, only CLAN and GES are really suitable in tat tey provide te correct variance estimators. Tey also produce te appropriate survey weigts, wic are not available from te oter software. GES is very expensive and relatively slow, but as a reasonable user interface wic leads troug te set-up process in a logical way. CLAN is free and sligtly quicer, but requires SAS programming experience and as no user interface beyond wat SAS provides. Tis user-unfriendliness could be seen as a feature to prevent people wit insufficient
27 nowledge from using te software in an inappropriate way, but is not elpful in a pacage designed for general use. So as a general recommendation for all-purpose processing of te types of designs typical in business surveys, CLAN and GES are te main contenders, but in specific cases wit expansion estimation, STATA and SUDAAN are equally acceptable. 3
28 3 Simulation study of alternative variance estimation metods Paul Smit, Susan Full & Ceri Underwood, Office for National Statistics Ray Cambers & David Holmes, University of Soutampton Te tender suggested a simulation study of te variance estimation metods available in te software, to assess te properties of te variance estimators bias, coverage, variability, relation to te size of te estimate. As a bonus tis as te effect of demonstrating tat te various software pacages do or do not produce te same solutions wit te same model formulation and te same input data, as reported in section..7. Te combinations of features (estimation metod and variance calculation approac) wic are available in te software considered are sown in Table 3.1. Taylor Number raised estimation CLAN, GES, STATA, SUDAAN Jacnife GES, SUDAAN Ratio estimation CLAN, GES, STATA*, SUDAAN* + GES, SUDAAN* + Regression estimation CLAN, GES GES Constrained-weigt regression estimation STATA* Table 3.1 Combinations of estimators and variance estimation tecniques used in te simulation study, wit te pacages wic ave been used. * variance estimation uses weigts in a manner wic is not strictly valid (see section..4); + valid variances can be produced but only by using te software in a non-standard way (see section..4). Te simulation process as turned out to be a long one, and not all of te results obtained are presented ere; instead we concentrate on te main messages to ave emerged. Some of te results presented ere seem to lac internal consistency, and on te wole it seems tat te wole area will benefit from furter detailed study in te future. It is oped tat te study will continue past te end of te present contract. 3.1 Te simulated population A model for data generation For te purposes of te simulation study, data were taen from te UK s Annual Business Inquiry (ABI), wic is a sample survey, cross-stratified by 5-digit industries of te SIC(9) (approximately four-digit NACE classes but sligtly more detailed in places) and employment size (more detailed information on tis survey is contained in te Model Quality Report, volume III capter 3). Te information on employment comes from te Inter- Departmental Business Register (IDBR) (Perry 1995), te UK s frame for business surveys. Te survey data ave been used to fit a model of te form 4
29 log ( y ) = β + β log( x + 1) + β ( x ) i 0 1 i1 log were y i, x i1 and x i are respectively te survey value, register employment and register turnover (available from te IDBR) for unit i, and te β j are regression parameters to be estimated in stratum. Tis model is ten used to generate fitted values for te wole population of manufacturing businesses based on te values of x i1 and x i from te IDBR. Te residuals from te model are ot-deced so tat te survey outcomes are stocastic, and reflect te data wic migt be obtained if a real census of manufacturing industry could be undertaen. Any negative survey values wic arise from tis procedure are set to 0, wic in fact sligtly over-represents te proportion of zero responses in te simulated population. Ten a collection of 1000 samples was made by repeatedly sampling from tis population using te ABI design, and te information from tese samples was used in te various software pacages to produce estimates of totals and teir corresponding sampling variance estimates Domains and estimators Te domains used were: te wole survey; te -digit industries of te SIC9, eac of wic corresponds to an amalgamation of strata/model groups (but see also below); te standard statistical regions in te UK, of wic tere are 11. Tese cut completely across te stratification and te model groups, and so provide a good test of te ability of te pacages to deal wit domains wose size is unnown. Estimation used tree principal metods, number raised estimation, ratio estimation using register employment as te auxiliary variable, and regression estimation using one auxiliary variable (again register employment). In te last two cases te variance of te residuals was taen as proportional to te register employment value. In tis way te wole population is nown, and ence te true population and domain totals are easily calculated, so tat te overall error (wit contributions from bias and variance) of te estimates from eac of te simulated samples can be found. Using tis to calculate te root mean square error, and comparing wit te variance of tese estimates allows us to deduce te bias in estimation. Also te variance of te estimates sould correspond to te sampling variance, and te distribution of te point estimates of te sampling variance from te simulated samples can be compared wit tis; an additional useful piece of information ere is te variability of te point estimates of sampling variance. Some furter analysis looing at te relationsip between te size of estimates and teir estimated sampling errors may also be interesting, altoug we do not pursue tis particular avenue of researc any furter in tis report Data features Te simulated dataset as a number of features wic are wort mentioning because tey raise certain issues about te variance estimation process or te way in wic te software 5 i
30 wors. Te two-digit industry domains are amalgamations of strata, and it is at stratum level tat we are controlling for nown auxiliary values in estimation, so we expect tese to be relatively accurately estimated. Tere was a misspecification of te population data variables wic resulted in one business in eac two-digit industry aving te wrong two-digit code, so te two-digit industries are not quite a strict amalgamation of te strata. Te region-level domains are suc tat te regional totals of auxiliary variables are not controlled as a byproduct of estimation, so te estimation includes an implicit estimation of te domain size, wic adds an additional component of variability. Some of te domains are very sparse. Te misspecification described above resulted in only one business in te simulated population being in industry 9, so we ave te most variable case possible (essentially binomial), wit two possible estimates (one of wic is zero). Te population dataset contains a few extreme values, wic, in an ordinary survey situation would be adjusted or treated in some way. In tis case tey ave been left witout adjustment, wic means, for example, tat te variability in te total population is dominated by te variability in industry 07 and region 03. Taing examples from domains not so grossly affected allows us to see ow well te different estimators perform in different situations. 3. Processing Te speed of processing as been an issue in undertaing tese simulations. STATA and SUDAAN run relatively quicly, but are quite restricted in te range of estimation models wic can be used. CLAN and GES run slower, but ave te weigt calculation functions wic are required to produce appropriate data for some of te naïve applications of STATA and SUDAAN metodology (see ). In te case of GES te slow speed and te wide variety of estimator and variance estimator combinations as meant tat te wole study as not been completed, and we will present only te preliminary results from less tan te complete number of simulated samples. 3.3 Results Comparison of estimators Te properties of te point estimators from te tree different estimation models (expansion, ratio and regression) are summarised for tree domains in Table 3.. Te estimates of te population total are affected by te extreme value in industry 07 and region 03. Wen tis outlier is included in te sample it causes a uge overestimate, and wen it is not included, a substantial underestimate. Because representation of tis element in te simulations is not exactly in accord wit its selection probability, but subject to random variation, tis gives rise to some large biases in te point estimators. In oter parts of te population were tere are no suc extreme outliers, suc as in Industry 01 and region 01 in Table 3., te biases are very small and te mean square error is dominated by te variability of te estimates around teir expectation. Tis is a more typical and muc more expected pattern. 6
31 Te main conclusion seems to be tat tere is very little difference in te bias and variance properties of te expansion, ratio and regression estimators for tis population in strata were te population is well-beaved (in te matematical sense); possibly te expansion estimator is sligtly worse in general. Estimator Total Industry 01 Region 01 Number of simulations Average bias of point estimator (% of true total) Standard deviation of point estimators (% of true total) Mean squared error of point estimator (% of true total) SYG expansion SYG ratio SYG regression JK expansion JK ratio JK regression SYG expansion SYG ratio SYG regression JK expansion JK ratio JK regression SYG expansion SYG ratio SYG regression JK expansion JK ratio JK regression Table 3. A comparison of te mean squared error caracteristics of te point estimators from tree different estimation models wit two different estimators. Results are all taen from GES Comparison of variance estimators Table 3.3, below, compares te standard deviation of te point estimates from te simulations wit te average estimated standard error (te variances are averaged and ten te root is taen). Te SYG variance estimators are very close to te standard deviation of te point estimates, even in te samples wic are affected by te extreme outlier. Tere are some configurations of sample data were te estimators wor less well, for example wit te regression estimator in industry 01 were te estimator underestimates te true variability. Te biases in te SYG standard error estimators for total, industry and region domains combined for te tree estimation scemes are sown in Figure 3.1. Note tat Industry 9 as been omitted; tis is te (spurious) industry wit a single member, and te ratio estimator as a very large bias in tis case (755%). 7
32 Estimator Total Industry 1 Region 1 Number of simulations Standard deviation of point estimators (% of true total) Average estimated standard error (% of true total) Bias of standard error estimate (% of sd of point estimators) SYG expansion SYG ratio SYG regression JK expansion JK ratio , ,59.78 JK regression , ,63.93 SYG expansion SYG ratio SYG regression JK expansion ,97.71 JK ratio , JK regression ,.7 SYG expansion SYG ratio SYG regression JK expansion , JK ratio , ,99,460.9 JK regression ,13.30 Table 3.3 Summary of te variance estimator properties from GES outputs, for Sen-Yates- Grundy ( Taylor ) variance estimators and (drop one) jacnife variance estimators for tree example domains. bias of standard error estimate expansion estimator ratio estimator regression estimator Figure 3.1 Boxplots of te bias of te SYG standard error estimators (expressed as a percentage of te standard error of te point estimates). Note tat te biases for Industry 9 ave been omitted as tey swamp te rest of te information. 8
33 Te jacnife estimators are extremely biassed most of te time, as can be seen in Table 3.3. Tis is because te jacnife estimator is only strictly valid in wit-replacement designs, and our design is stratified witout replacement. It can be used as an approximation, and te approximation will be quite good were te sampling fraction is small (Wolter 1985, p168). In many business surveys te sampling fraction is, owever, large, and in tese cases te approximation can be dreadful. All is not lost, because a furter approximate variance estimator can be obtained for tis situation by introducing te finite population correction into te jacnife (Wolter 1985, p169). Tis is not an option witin eiter GES or SUDAAN (te two pacages wit implementations of te stratified jacnife variance estimator), and must be included manually. An investigation of tis estimator is still underway but preliminary results from 150 simulations are sown in Table 3.4. Domain Number of simulations Standard deviation of point estimators (% of true total) Average estimated standard error 3 (% of true total) Bias of standard error estimate (% of sd of point estimators) Industry ,05, Industry Industry , ,54.80 JK expansion Industry , , Industry , ,68.06 Table 3.4 Properties of jacnife sampling error estimates from runs of GES wit te finite population correction included at te stratum level. Industries 6 and 5 are respectively te best and worst cases, and te sd of te point estimators in tese domains may indicate a data problem. Given Wolter s assertion tat tis adjusted variance estimator is unbiassed, te information on te biases of te standard error estimates in Table 3.4 doesn t seem credible. It seems tat some furter wor sould be done to investigate weter tis is driven by certain aspects of te data, or is an artifact of te (relatively) small number of replicates on wic tis table is based Naïve variance estimators Te variance estimators considered so far are basically te appropriate ones for te estimation metods and sample design under consideration. However, oter possible combinations of inputs and te use of pacages are possible, and we ave called tese naïve variance estimators because, altoug te inputs seem reasonable at first glance, te combination of weigts and software gives an inappropriate estimator. However, under some circumstances tis is te easiest approac, and it is wortwile looing to see weter tese estimators provide a sound approximation and ence weter tey are practical alternatives. First taing te ratio estimator weigts from te calibration software and using tose in STATA, we discover tat te variance estimate is te same as for te expansion estimator. 3 Includes te finite population correction in te calculation. 9
34 Tis is because te weigts are still constant witin strata. Using te same (ratio) weigts in SUDAAN produces a variance estimator different from te number raised one, but not very different. Averaged over 50 samples, te ratio of te variance estimates (naïve estimator to true variance estimator) is from 0.53 to 1.3, but tis translates into a ratio of cvs of only 0.88 to Neverteless tis suggests tat te naïve ratio variance is too close to te expansion variance, and not appropriate to ratio estimation. Following te same approac wit regression estimation is not possible because it produces negative weigts in some strata, and neiter STATA nor SUDAAN allow negative weigts. An alternative is to use te constraining options witin GES (in tis case) to produce strictly positive weigts werever a solution to te calibration equations exists, and to replace te weigts (for a wole stratum) wit te expansion weigts were suc a solution does not exist. Tis guarantees no negative weigts, but te extent of replacement and constraining can potentially cause a large increase in te variance. Te results are sown in Table 3.5; note, owever, tat te standard deviation of te point estimates of total seem inconsistent wit oters presented ere for te regression estimator. Tis may be due to te constraining process (see Hedlin, Falvey, Cambers & Koic 1998). Te main indication from tis table seems to be tat te naïve regression variance severely underestimates te actual variability of te estimates obtained wit regression weigting. None of te naïve variance estimators considered ere seems to offer a good approximation to te underlying variance of te estimates. Tis means tat, were ratio or regression estimators are used in business surveys, appropriate software wic explicitly taes account of te estimation model is necessary, and tat software wic uses only tecniques for expansion weigting cannot be safely used. domain Number of simulations Standard deviation of point estimators (% of true total) Average estimated standard error (% of true total) Bias of standard error estimate (% of sd of point estimators) expansion regression expansion regression expansion regression expansion regression total ind1 ind reg1 Table 3.5 Comparison of te properties of variance estimators obtained from STATA using te usual expansion weigts, and using constrained or adjusted weigts (see text for full description) to give a pseudo-regression estimator Comparison of software pacage outputs Taylor variances: We will first loo at te only estimation metod common to all te pacages, expansion estimation, and te Taylor (SYG) variance estimator. All te pacages 30
35 produce identical solutions for te variance wit te expansion estimator. Te two pacages wic ave appropriate processing for te ratio and regression estimators, CLAN and GES, also produce identical estimates for tis estimator in standard cases, altoug te treatment of samples containing zero values of auxiliary variables can give rise to sligt differences. Using SUDAAN twice as described in section..4 to produce a quasi-ratio estimator gives a teoretically correct variance estimator (but doesn t give a point estimate at all), but wit a rater indirect implementation. However, te solution is not te same as te CLAN/GES one. Te differences are sown in Table 3.6. Te general impression tat te industry estimates are quite close wereas te region estimates are way out is broadly indicative of te trends in te remaining parts of te dataset. In general it seems tat SUDAAN, even wen apparently using a fix to give te correct form of te variance, is not appropriate software for calculating te variance of ratio estimates. Jacnife variances: Only SUDAAN and GES allow te use of jacnife variance estimators in stratified designs. Te default in SUDAAN is te drop-one jacnife estimator, and in GES te user must set up jacnife groups. Since te drop-one estimator is preferred (Wolter 1985 p164), tis as been used ere. In tis case, te jacnife estimators of variance from SUDAAN and GES are identical for te number-raised estimator; neiter includes te fpc, wic must be added later if it is required (see section 3.3., above). Domain Total Industry 1 Industry Region 1 Number of simulations Standard deviation of point estimators (% of true total) Average estimated standard error (% of true total) Bias of standard error estimate (% of sd of point estimators 4 ) GES ratio SUDAAN ratio 150 na GES ratio SUDAAN ratio 150 na GES ratio SUDAAN ratio 150 na GES ratio SUDAAN ratio 150 na , Table 3.6 Summary of te properties of SYG variance estimators for ratio estimation from GES and using an apparently correct fix in SUDAAN. 3.4 General conclusions 1. Te variance estimators wic are common to several pacages do in fact produce te same results in eac case, wit rounding error contributing only after many significant figures. 4 Te SUDAAN figures ave been compared wit te variance of te GES point estimators; tat is, te divisor for biases from bot pacages is te same in tis column. 31
36 . Te Sen-Yates-Grundy variance estimators are generally very close to te actual variation in te estimates over repeated sampling. 3. Te jacnife variance estimators are inappropriate in business surveys wit ig sampling fractions, and do not seem to be corrected by application of te finite population correction. 4. Te use of software pacages for estimators for wic tey are not designed, or te use of naïve variance estimators troug using te rigt weigts in te wrong formula bot produce variance estimates wic are very biassed. Tese approaces are not recommended. Hence an appropriate pacage must be used wen ratio, regression or more complex estimators are in use. 3
37 4 Variances in STATA/SUDAAN compared wit analytical variances 4.1 Expansion estimator Te usual estimator of a total in stratified sampling is David Holmes, University of Soutampton H tˆ = N y (4.1) y = 1 s were ys = n 1 s y, and te variance is given by V ( tˆ ) y ( f ) N 1 = S yu n (4.) were were S yu is te stratum variance. Tis variance is estimated by ( tˆ ) ( f ) N Vˆ y = 1 S ys (4.3) n 1 S ys is te sample variance in stratum, ys = ( n 1) ( ) y s ys S. Note: (4.1) can be written as N tˆ = y = w y, were w. n s 4. Ratio estimator Te separate ratio estimator of a total in stratified sampling (used by ABI) is were x given by were 1 = n x s y s t ˆ y rat = t x (4.4), xs and t x is te stratum total of te x. Te variance of te estimator is ( 1 f ) N V ( tˆ y, rat ) = S yu, rat (4.5) n S yu, rat = 1 N 1 = S yu R U ( y R x ) S xyu + R S xu (4.6) is te stratum variance of te variable section Tis variance can be estimated by y R x and R t y =. See Cocran (1977), t x 33
38 were y R ˆ = s x s ( tˆ ) ( 1 f ) ( S Rˆ S Rˆ S ) ˆ N V 1 y, rat = ys xys + xs (4.7) n, (tat is, te stratified version of equation 6.11 in Cocran). An alternative variance estimator (see equations 6.1 and 6.13) is ( tˆ ) ( 1 f ) x U ( S Rˆ S Rˆ S ) ˆ N V y, rat = ys xys + xs (4.8) n x s Note: (4.4) can be written t ˆ = w y were y,rat s tx w =. x s 4.3 Wat does SUDAAN do? For stratified random sampling, te variance formula used is 1 were S zs = ( z z ) s ( 1 f ) Vˆ = n S (4.9) n 1 s variance formula corresponds to te design option DESIGN = STRWOR. zs, and z is te appropriate linearised value. Note tat te So, if we want to estimate te variance of te usual expansion estimator (see (4.1)), we use DESCRIPT. Te linearised value is z = w y, and so long as w N = (tat is, te n sampling weigt), te variance formula in (4.9) gives te correct variance estimator of (4.3). Wat about te variance estimator for te ratio estimator defined in (4.4)? Can SUDAAN be triced by defining te weigt to be itself will be correct, but te variance formula in (4.9) wit Tis is not te variance given in (4.7) or (4.8). w = t x s x ( f ) xu? Te answer is no. Te ratio estimator s z t x = w y = y will give x s ˆ N 1 V = S ys (4.10) n x Te answer is to use te RATIO procedure. In general, we can estimate te ratio for any subgroup d as Rˆ d = s s δ ( d) w y δ ( d) w x (4.11) 34
39 were δ i 1 if sample unit is in subgroup d ( d) =. Te linearised value 0 oterwise z () d () d δ w ( y Rˆ d x ) = δ w y i () d (4.1) is substituted into (4.9) to obtain te variance estimate. So, in te special case were te strata ys () are defined as te subgroups (d), we ave from (11) R ˆ =. Te linearised value x in (4.1) becomes s z = N n ( y Rˆ x ) ( y Rˆ x ) N x = s s n x and substituting tis in (4.9) we get If tis is multiplied by ( 1 f ) ( S Rˆ S Rˆ S ) V ˆ( Rˆ ) = ys xys + xs (4.13) n x ( x U in (4.8). If, instead, tis is multiplied by estimate obtained in (4.7). s N ) and summed over, we get te variance estimate obtained ( x s N ) and summed over, we get te variance 35
40 5 References ANDERSSON, C. & NORDBERG, L. (1998) A user s guide to CLAN 97 a SAS-program for computation of point- and standard error estimates in sample surveys. Stocolm: Statistics Sweden. COCHRAN, W.G. (1977) Sampling tecniques, tird edition. New Yor: Wiley DEVILLE, J.-C. & SÄRNDAL, C.-E. (1994) Variance estimation for te regression imputed Horvitz-Tompson estimator. Journal of Official Statistics, 10, ESTEVAO, V., HIDIROGLOU, M.A. & SÄRNDAL, C.-E. (1995) Metodological principles for a generalized estimation system at Statistics Canada. Journal of Official Statistics, 11, HEDLIN, D., FALVEY, H., CHAMBERS, R. & KOKIC, P. (1998) Te effective use of auxiliary information in a business survey. In NTSS 98 International Seminar on New Tecniques and Tecnologies for Statistics, Contributed Papers, pp KOKIC, P.N. (1998) Estimating te sampling variance of te UK Index of Production. Journal of Official Statistics, 14, KOKIC, P.N. & SMITH, P.A. (1999a) Winsorisation of outliers in business surveys. Submitted to Journal of te Royal Statistical Society, Series D. KOKIC, P.N. & SMITH, P.A. (1999b) Outlier-robust estimation in sample surveys using twosided winsorisation. Submitted to JASA. NORDBERG, L. (1998) On variance estimation for measures of cange wen samples are coordinated by a permanent random number tecnique. R&D Report 1998:6, Statistics Sweden. OHLSSON, E. (1995) Coordination of samples using permanent random numbers. In Business survey metods (eds. B.G. Cox, D.A. Binder, B.N. Cinnappa, A. Cristianson, M.J. Colledge & P.S. Kott), pp New Yor: Wiley. PERRY, J. (1995) Te Inter-Departmental Business Register. Economic Trends 505, November RAO, J.N.K. & SHAO, J. (1996) On balanced alf-sample variance estimation in stratified sampling. Journal of te American Statistical Association, 68, RUBIN, D.B. (1986) Basic ideas of multiple imputation for nonresponse. Survey Metodology, 1, RUBIN, D.B. (1987) Multiple imputation for nonresponse in surveys. New Yor: Wiley. SÄRNDAL, C.-E. (199) Metods for estimating te precision of survey estimates wen imputation as been used. Survey Metodology, 18,
41 SÄRNDAL, C.-E. & SWENSSON, B. (1987) A general review of estimation for two pases of selection wit applications to two-pase sampling and non-response. International Statistical Review, 55, SÄRNDAL, C.-E., SWENSSON, B. & WRETMAN, J. (199) Model-assisted survey sampling. New Yor: Springer-Verlag. SEN, A.R. (1953) On te estimate of variance in sampling wit varying probabilities. Journal of te Indian Society of Agricultural Statistics, 5, SHAH, B.V., FOLSOM, R.E., LAVANGE, L.M., WHEELESS, S.C., BOYLE, K.E. & WILLIAMS, R.L. (1995) Statistical Metods and matematical algoritms used in SUDAAN. Nort Carolina: Researc Triangle Institute. SLOOTBEEK, G.T. (1998) Bias correction in te balanced alf sample metod if te number of sampled units in some strata is odd. Journal of Official Statistics, 14, WOLTER, K.M. (1985) Introduction to variance estimation. New Yor: Springer-Verlag. YATES, F. & GRUNDY, P.M. (1953) Selection witout replacement from witin strata wit probability proportional to size. Journal of te Royal Statistical Society, Series B, 15,
SAMPLE DESIGN FOR THE TERRORISM RISK INSURANCE PROGRAM SURVEY
ASA Section on Survey Researc Metods SAMPLE DESIG FOR TE TERRORISM RISK ISURACE PROGRAM SURVEY G. ussain Coudry, Westat; Mats yfjäll, Statisticon; and Marianne Winglee, Westat G. ussain Coudry, Westat,
Geometric Stratification of Accounting Data
Stratification of Accounting Data Patricia Gunning * Jane Mary Horgan ** William Yancey *** Abstract: We suggest a new procedure for defining te boundaries of te strata in igly skewed populations, usual
How To Ensure That An Eac Edge Program Is Successful
Introduction Te Economic Diversification and Growt Enterprises Act became effective on 1 January 1995. Te creation of tis Act was to encourage new businesses to start or expand in Newfoundland and Labrador.
The EOQ Inventory Formula
Te EOQ Inventory Formula James M. Cargal Matematics Department Troy University Montgomery Campus A basic problem for businesses and manufacturers is, wen ordering supplies, to determine wat quantity of
Catalogue no. 12-001-XIE. Survey Methodology. December 2004
Catalogue no. 1-001-XIE Survey Metodology December 004 How to obtain more information Specific inquiries about tis product and related statistics or services sould be directed to: Business Survey Metods
Verifying Numerical Convergence Rates
1 Order of accuracy Verifying Numerical Convergence Rates We consider a numerical approximation of an exact value u. Te approximation depends on a small parameter, suc as te grid size or time step, and
An inquiry into the multiplier process in IS-LM model
An inquiry into te multiplier process in IS-LM model Autor: Li ziran Address: Li ziran, Room 409, Building 38#, Peing University, Beijing 00.87,PRC. Pone: (86) 00-62763074 Internet Address: [email protected]
2 Limits and Derivatives
2 Limits and Derivatives 2.7 Tangent Lines, Velocity, and Derivatives A tangent line to a circle is a line tat intersects te circle at exactly one point. We would like to take tis idea of tangent line
Comparison between two approaches to overload control in a Real Server: local or hybrid solutions?
Comparison between two approaces to overload control in a Real Server: local or ybrid solutions? S. Montagna and M. Pignolo Researc and Development Italtel S.p.A. Settimo Milanese, ITALY Abstract Tis wor
2.23 Gambling Rehabilitation Services. Introduction
2.23 Gambling Reabilitation Services Introduction Figure 1 Since 1995 provincial revenues from gambling activities ave increased over 56% from $69.2 million in 1995 to $108 million in 2004. Te majority
Can a Lump-Sum Transfer Make Everyone Enjoy the Gains. from Free Trade?
Can a Lump-Sum Transfer Make Everyone Enjoy te Gains from Free Trade? Yasukazu Icino Department of Economics, Konan University June 30, 2010 Abstract I examine lump-sum transfer rules to redistribute te
College Planning Using Cash Value Life Insurance
College Planning Using Cas Value Life Insurance CAUTION: Te advisor is urged to be extremely cautious of anoter college funding veicle wic provides a guaranteed return of premium immediately if funded
A system to monitor the quality of automated coding of textual answers to open questions
Researc in Official Statistics Number 2/2001 A system to monitor te quality of automated coding of textual answers to open questions Stefania Maccia * and Marcello D Orazio ** Italian National Statistical
What is Advanced Corporate Finance? What is finance? What is Corporate Finance? Deciding how to optimally manage a firm s assets and liabilities.
Wat is? Spring 2008 Note: Slides are on te web Wat is finance? Deciding ow to optimally manage a firm s assets and liabilities. Managing te costs and benefits associated wit te timing of cas in- and outflows
A strong credit score can help you score a lower rate on a mortgage
NET GAIN Scoring points for your financial future AS SEEN IN USA TODAY S MONEY SECTION, JULY 3, 2007 A strong credit score can elp you score a lower rate on a mortgage By Sandra Block Sales of existing
Optimized Data Indexing Algorithms for OLAP Systems
Database Systems Journal vol. I, no. 2/200 7 Optimized Data Indexing Algoritms for OLAP Systems Lucian BORNAZ Faculty of Cybernetics, Statistics and Economic Informatics Academy of Economic Studies, Bucarest
Schedulability Analysis under Graph Routing in WirelessHART Networks
Scedulability Analysis under Grap Routing in WirelessHART Networks Abusayeed Saifulla, Dolvara Gunatilaka, Paras Tiwari, Mo Sa, Cenyang Lu, Bo Li Cengjie Wu, and Yixin Cen Department of Computer Science,
Computer Science and Engineering, UCSD October 7, 1999 Goldreic-Levin Teorem Autor: Bellare Te Goldreic-Levin Teorem 1 Te problem We æx a an integer n for te lengt of te strings involved. If a is an n-bit
Theoretical calculation of the heat capacity
eoretical calculation of te eat capacity Principle of equipartition of energy Heat capacity of ideal and real gases Heat capacity of solids: Dulong-Petit, Einstein, Debye models Heat capacity of metals
Instantaneous Rate of Change:
Instantaneous Rate of Cange: Last section we discovered tat te average rate of cange in F(x) can also be interpreted as te slope of a scant line. Te average rate of cange involves te cange in F(x) over
To motivate the notion of a variogram for a covariance stationary process, { Ys ( ): s R}
4. Variograms Te covariogram and its normalized form, te correlogram, are by far te most intuitive metods for summarizing te structure of spatial dependencies in a covariance stationary process. However,
- 1 - Handout #22 May 23, 2012 Huffman Encoding and Data Compression. CS106B Spring 2012. Handout by Julie Zelenski with minor edits by Keith Schwarz
CS106B Spring 01 Handout # May 3, 01 Huffman Encoding and Data Compression Handout by Julie Zelenski wit minor edits by Keit Scwarz In te early 1980s, personal computers ad ard disks tat were no larger
1. Case description. Best practice description
1. Case description Best practice description Tis case sows ow a large multinational went troug a bottom up organisational cange to become a knowledge-based company. A small community on knowledge Management
ACT Math Facts & Formulas
Numbers, Sequences, Factors Integers:..., -3, -2, -1, 0, 1, 2, 3,... Rationals: fractions, tat is, anyting expressable as a ratio of integers Reals: integers plus rationals plus special numbers suc as
Survey Data Analysis in Stata
Survey Data Analysis in Stata Jeff Pitblado Associate Director, Statistical Software StataCorp LP Stata Conference DC 2009 J. Pitblado (StataCorp) Survey Data Analysis DC 2009 1 / 44 Outline 1 Types of
THE NEISS SAMPLE (DESIGN AND IMPLEMENTATION) 1997 to Present. Prepared for public release by:
THE NEISS SAMPLE (DESIGN AND IMPLEMENTATION) 1997 to Present Prepared for public release by: Tom Scroeder Kimberly Ault Division of Hazard and Injury Data Systems U.S. Consumer Product Safety Commission
Distances in random graphs with infinite mean degrees
Distances in random graps wit infinite mean degrees Henri van den Esker, Remco van der Hofstad, Gerard Hoogiemstra and Dmitri Znamenski April 26, 2005 Abstract We study random graps wit an i.i.d. degree
Tis Problem and Retail Inventory Management
Optimizing Inventory Replenisment of Retail Fasion Products Marsall Fiser Kumar Rajaram Anant Raman Te Warton Scool, University of Pennsylvania, 3620 Locust Walk, 3207 SH-DH, Piladelpia, Pennsylvania 19104-6366
Tangent Lines and Rates of Change
Tangent Lines and Rates of Cange 9-2-2005 Given a function y = f(x), ow do you find te slope of te tangent line to te grap at te point P(a, f(a))? (I m tinking of te tangent line as a line tat just skims
M(0) = 1 M(1) = 2 M(h) = M(h 1) + M(h 2) + 1 (h > 1)
Insertion and Deletion in VL Trees Submitted in Partial Fulfillment of te Requirements for Dr. Eric Kaltofen s 66621: nalysis of lgoritms by Robert McCloskey December 14, 1984 1 ackground ccording to Knut
Derivatives Math 120 Calculus I D Joyce, Fall 2013
Derivatives Mat 20 Calculus I D Joyce, Fall 203 Since we ave a good understanding of its, we can develop derivatives very quickly. Recall tat we defined te derivative f x of a function f at x to be te
OPTIMAL FLEET SELECTION FOR EARTHMOVING OPERATIONS
New Developments in Structural Engineering and Construction Yazdani, S. and Sing, A. (eds.) ISEC-7, Honolulu, June 18-23, 2013 OPTIMAL FLEET SELECTION FOR EARTHMOVING OPERATIONS JIALI FU 1, ERIK JENELIUS
Improved dynamic programs for some batcing problems involving te maximum lateness criterion A P M Wagelmans Econometric Institute Erasmus University Rotterdam PO Box 1738, 3000 DR Rotterdam Te Neterlands
Math 113 HW #5 Solutions
Mat 3 HW #5 Solutions. Exercise.5.6. Suppose f is continuous on [, 5] and te only solutions of te equation f(x) = 6 are x = and x =. If f() = 8, explain wy f(3) > 6. Answer: Suppose we ad tat f(3) 6. Ten
Lecture 10: What is a Function, definition, piecewise defined functions, difference quotient, domain of a function
Lecture 10: Wat is a Function, definition, piecewise defined functions, difference quotient, domain of a function A function arises wen one quantity depends on anoter. Many everyday relationsips between
The modelling of business rules for dashboard reporting using mutual information
8 t World IMACS / MODSIM Congress, Cairns, Australia 3-7 July 2009 ttp://mssanz.org.au/modsim09 Te modelling of business rules for dasboard reporting using mutual information Gregory Calbert Command, Control,
Free Shipping and Repeat Buying on the Internet: Theory and Evidence
Free Sipping and Repeat Buying on te Internet: eory and Evidence Yingui Yang, Skander Essegaier and David R. Bell 1 June 13, 2005 1 Graduate Scool of Management, University of California at Davis ([email protected])
Math Test Sections. The College Board: Expanding College Opportunity
Taking te SAT I: Reasoning Test Mat Test Sections Te materials in tese files are intended for individual use by students getting ready to take an SAT Program test; permission for any oter use must be sougt
Pioneer Fund Story. Searching for Value Today and Tomorrow. Pioneer Funds Equities
Pioneer Fund Story Searcing for Value Today and Tomorrow Pioneer Funds Equities Pioneer Fund A Cornerstone of Financial Foundations Since 1928 Te fund s relatively cautious stance as kept it competitive
FINITE DIFFERENCE METHODS
FINITE DIFFERENCE METHODS LONG CHEN Te best known metods, finite difference, consists of replacing eac derivative by a difference quotient in te classic formulation. It is simple to code and economic to
Research on the Anti-perspective Correction Algorithm of QR Barcode
Researc on te Anti-perspective Correction Algoritm of QR Barcode Jianua Li, Yi-Wen Wang, YiJun Wang,Yi Cen, Guoceng Wang Key Laboratory of Electronic Tin Films and Integrated Devices University of Electronic
Article. Variance inflation factors in the analysis of complex survey data. by Dan Liao and Richard Valliant
Component of Statistics Canada Catalogue no. -00-X Business Survey etods Division Article Variance inflation factors in te analysis of complex survey data by Dan Liao and Ricard Valliant June 0 Survey
Determine the perimeter of a triangle using algebra Find the area of a triangle using the formula
Student Name: Date: Contact Person Name: Pone Number: Lesson 0 Perimeter, Area, and Similarity of Triangles Objectives Determine te perimeter of a triangle using algebra Find te area of a triangle using
SAT Subject Math Level 1 Facts & Formulas
Numbers, Sequences, Factors Integers:..., -3, -2, -1, 0, 1, 2, 3,... Reals: integers plus fractions, decimals, and irrationals ( 2, 3, π, etc.) Order Of Operations: Aritmetic Sequences: PEMDAS (Parenteses
Bonferroni-Based Size-Correction for Nonstandard Testing Problems
Bonferroni-Based Size-Correction for Nonstandard Testing Problems Adam McCloskey Brown University October 2011; Tis Version: October 2012 Abstract We develop powerful new size-correction procedures for
Welfare, financial innovation and self insurance in dynamic incomplete markets models
Welfare, financial innovation and self insurance in dynamic incomplete markets models Paul Willen Department of Economics Princeton University First version: April 998 Tis version: July 999 Abstract We
Writing Mathematics Papers
Writing Matematics Papers Tis essay is intended to elp your senior conference paper. It is a somewat astily produced amalgam of advice I ave given to students in my PDCs (Mat 4 and Mat 9), so it s not
Chapter 11. Limits and an Introduction to Calculus. Selected Applications
Capter Limits and an Introduction to Calculus. Introduction to Limits. Tecniques for Evaluating Limits. Te Tangent Line Problem. Limits at Infinit and Limits of Sequences.5 Te Area Problem Selected Applications
Staffing and routing in a two-tier call centre. Sameer Hasija*, Edieal J. Pinker and Robert A. Shumsky
8 Int. J. Operational Researc, Vol. 1, Nos. 1/, 005 Staffing and routing in a two-tier call centre Sameer Hasija*, Edieal J. Pinker and Robert A. Sumsky Simon Scool, University of Rocester, Rocester 1467,
h Understanding the safe operating principles and h Gaining maximum benefit and efficiency from your h Evaluating your testing system's performance
EXTRA TM Instron Services Revolve Around You It is everyting you expect from a global organization Te global training centers offer a complete educational service for users of advanced materials testing
His solution? Federal law that requires government agencies and private industry to encrypt, or digitally scramble, sensitive data.
NET GAIN Scoring points for your financial future AS SEEN IN USA TODAY S MONEY SECTION, FEBRUARY 9, 2007 Tec experts plot to catc identity tieves Politicians to security gurus offer ideas to prevent data
Pre-trial Settlement with Imperfect Private Monitoring
Pre-trial Settlement wit Imperfect Private Monitoring Mostafa Beskar University of New Hampsire Jee-Hyeong Park y Seoul National University July 2011 Incomplete, Do Not Circulate Abstract We model pretrial
WORKING PAPER SERIES THE INFORMATIONAL CONTENT OF OVER-THE-COUNTER CURRENCY OPTIONS NO. 366 / JUNE 2004. by Peter Christoffersen and Stefano Mazzotta
WORKING PAPER SERIES NO. 366 / JUNE 24 THE INFORMATIONAL CONTENT OF OVER-THE-COUNTER CURRENCY OPTIONS by Peter Cristoffersen and Stefano Mazzotta WORKING PAPER SERIES NO. 366 / JUNE 24 THE INFORMATIONAL
Operation go-live! Mastering the people side of operational readiness
! I 2 London 2012 te ultimate Up to 30% of te value of a capital programme can be destroyed due to operational readiness failures. 1 In te complex interplay between tecnology, infrastructure and process,
Yale ICF Working Paper No. 05-11 May 2005
Yale ICF Working Paper No. 05-11 May 2005 HUMAN CAPITAL, AET ALLOCATION, AND LIFE INURANCE Roger G. Ibbotson, Yale cool of Management, Yale University Peng Cen, Ibbotson Associates Mose Milevsky, culic
For Sale By Owner Program. We can help with our for sale by owner kit that includes:
Dawn Coen Broker/Owner For Sale By Owner Program If you want to sell your ome By Owner wy not:: For Sale Dawn Coen Broker/Owner YOUR NAME YOUR PHONE # Look as professional as possible Be totally prepared
MATHEMATICS FOR ENGINEERING DIFFERENTIATION TUTORIAL 1 - BASIC DIFFERENTIATION
MATHEMATICS FOR ENGINEERING DIFFERENTIATION TUTORIAL 1 - BASIC DIFFERENTIATION Tis tutorial is essential pre-requisite material for anyone stuing mecanical engineering. Tis tutorial uses te principle of
DEPARTMENT OF ECONOMICS HOUSEHOLD DEBT AND FINANCIAL ASSETS: EVIDENCE FROM GREAT BRITAIN, GERMANY AND THE UNITED STATES
DEPARTMENT OF ECONOMICS HOUSEHOLD DEBT AND FINANCIAL ASSETS: EVIDENCE FROM GREAT BRITAIN, GERMANY AND THE UNITED STATES Sara Brown, University of Leicester, UK Karl Taylor, University of Leicester, UK
SAT Math Must-Know Facts & Formulas
SAT Mat Must-Know Facts & Formuas Numbers, Sequences, Factors Integers:..., -3, -2, -1, 0, 1, 2, 3,... Rationas: fractions, tat is, anyting expressabe as a ratio of integers Reas: integers pus rationas
Chapter 10: Refrigeration Cycles
Capter 10: efrigeration Cycles Te vapor compression refrigeration cycle is a common metod for transferring eat from a low temperature to a ig temperature. Te above figure sows te objectives of refrigerators
1.6. Analyse Optimum Volume and Surface Area. Maximum Volume for a Given Surface Area. Example 1. Solution
1.6 Analyse Optimum Volume and Surface Area Estimation and oter informal metods of optimizing measures suc as surface area and volume often lead to reasonable solutions suc as te design of te tent in tis
Simultaneous Location of Trauma Centers and Helicopters for Emergency Medical Service Planning
Simultaneous Location of Trauma Centers and Helicopters for Emergency Medical Service Planning Soo-Haeng Co Hoon Jang Taesik Lee Jon Turner Tepper Scool of Business, Carnegie Mellon University, Pittsburg,
2.12 Student Transportation. Introduction
Introduction Figure 1 At 31 Marc 2003, tere were approximately 84,000 students enrolled in scools in te Province of Newfoundland and Labrador, of wic an estimated 57,000 were transported by scool buses.
Section 3.3. Differentiation of Polynomials and Rational Functions. Difference Equations to Differential Equations
Difference Equations to Differential Equations Section 3.3 Differentiation of Polynomials an Rational Functions In tis section we begin te task of iscovering rules for ifferentiating various classes of
Digital evolution Where next for the consumer facing business?
Were next for te consumer facing business? Cover 2 Digital tecnologies are powerful enablers and lie beind a combination of disruptive forces. Teir rapid continuous development demands a response from
Training Robust Support Vector Regression via D. C. Program
Journal of Information & Computational Science 7: 12 (2010) 2385 2394 Available at ttp://www.joics.com Training Robust Support Vector Regression via D. C. Program Kuaini Wang, Ping Zong, Yaoong Zao College
Predicting the behavior of interacting humans by fusing data from multiple sources
Predicting te beavior of interacting umans by fusing data from multiple sources Erik J. Sclict 1, Ritcie Lee 2, David H. Wolpert 3,4, Mykel J. Kocenderfer 1, and Brendan Tracey 5 1 Lincoln Laboratory,
In other words the graph of the polynomial should pass through the points
Capter 3 Interpolation Interpolation is te problem of fitting a smoot curve troug a given set of points, generally as te grap of a function. It is useful at least in data analysis (interpolation is a form
Keskustelualoitteita #65 Joensuun yliopisto, Taloustieteet. Market effiency in Finnish harness horse racing. Niko Suhonen
Keskustelualoitteita #65 Joensuun yliopisto, Taloustieteet Market effiency in Finnis arness orse racing Niko Suonen ISBN 978-952-219-283-7 ISSN 1795-7885 no 65 Market Efficiency in Finnis Harness Horse
New Vocabulary volume
-. Plan Objectives To find te volume of a prism To find te volume of a cylinder Examples Finding Volume of a Rectangular Prism Finding Volume of a Triangular Prism 3 Finding Volume of a Cylinder Finding
Chapter 11 Introduction to Survey Sampling and Analysis Procedures
Chapter 11 Introduction to Survey Sampling and Analysis Procedures Chapter Table of Contents OVERVIEW...149 SurveySampling...150 SurveyDataAnalysis...151 DESIGN INFORMATION FOR SURVEY PROCEDURES...152
The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data ABSTRACT INTRODUCTION SURVEY DESIGN 101 WHY STRATIFY?
The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health, ABSTRACT
Cyber Epidemic Models with Dependences
Cyber Epidemic Models wit Dependences Maocao Xu 1, Gaofeng Da 2 and Souuai Xu 3 1 Department of Matematics, Illinois State University [email protected] 2 Institute for Cyber Security, University of Texas
TRADING AWAY WIDE BRANDS FOR CHEAP BRANDS. Swati Dhingra London School of Economics and CEP. Online Appendix
TRADING AWAY WIDE BRANDS FOR CHEAP BRANDS Swati Dingra London Scool of Economics and CEP Online Appendix APPENDIX A. THEORETICAL & EMPIRICAL RESULTS A.1. CES and Logit Preferences: Invariance of Innovation
SHAPE: A NEW BUSINESS ANALYTICS WEB PLATFORM FOR GETTING INSIGHTS ON ELECTRICAL LOAD PATTERNS
CIRED Worksop - Rome, 11-12 June 2014 SAPE: A NEW BUSINESS ANALYTICS WEB PLATFORM FOR GETTING INSIGTS ON ELECTRICAL LOAD PATTERNS Diego Labate Paolo Giubbini Gianfranco Cicco Mario Ettorre Enel Distribuzione-Italy
Heterogeneous firms and trade costs: a reading of French access to European agrofood
Heterogeneous firms and trade costs: a reading of Frenc access to European agrofood markets Cevassus-Lozza E., Latouce K. INRA, UR 34, F-44000 Nantes, France Abstract Tis article offers a new reading of
Chapter XXI Sampling error estimation for survey data* Donna Brogan Emory University Atlanta, Georgia United States of America.
Chapter XXI Sampling error estimation for survey data* Donna Brogan Emory University Atlanta, Georgia United States of America Abstract Complex sample survey designs deviate from simple random sampling,
Pretrial Settlement with Imperfect Private Monitoring
Pretrial Settlement wit Imperfect Private Monitoring Mostafa Beskar Indiana University Jee-Hyeong Park y Seoul National University April, 2016 Extremely Preliminary; Please Do Not Circulate. Abstract We
Guide to Cover Letters & Thank You Letters
Guide to Cover Letters & Tank You Letters 206 Strebel Student Center (315) 792-3087 Fax (315) 792-3370 TIPS FOR WRITING A PERFECT COVER LETTER Te resume never travels alone. Eac time you submit your resume
1 The Collocation Method
CS410 Assignment 7 Due: 1/5/14 (Fri) at 6pm You must wor eiter on your own or wit one partner. You may discuss bacground issues and general solution strategies wit oters, but te solutions you submit must
OPTIMAL DISCONTINUOUS GALERKIN METHODS FOR THE ACOUSTIC WAVE EQUATION IN HIGHER DIMENSIONS
OPTIMAL DISCONTINUOUS GALERKIN METHODS FOR THE ACOUSTIC WAVE EQUATION IN HIGHER DIMENSIONS ERIC T. CHUNG AND BJÖRN ENGQUIST Abstract. In tis paper, we developed and analyzed a new class of discontinuous
ANALYTICAL REPORT ON THE 2010 URBAN EMPLOYMENT UNEMPLOYMENT SURVEY
THE FEDERAL DEMOCRATIC REPUBLIC OF ETHIOPIA CENTRAL STATISTICAL AGENCY ANALYTICAL REPORT ON THE 2010 URBAN EMPLOYMENT UNEMPLOYMENT SURVEY Addis Ababa December 2010 STATISTICAL BULLETIN TABLE OF CONTENT
Rewards-Supply Aggregate Planning in the Management of Loyalty Reward Programs - A Stochastic Linear Programming Approach
Rewards-Supply Aggregate Planning in te Management of Loyalty Reward Programs - A Stocastic Linear Programming Approac YUHENG CAO, B.I.B., M.Sc. A tesis submitted to te Faculty of Graduate and Postdoctoral
Notes: Most of the material in this chapter is taken from Young and Freedman, Chap. 12.
Capter 6. Fluid Mecanics Notes: Most of te material in tis capter is taken from Young and Freedman, Cap. 12. 6.1 Fluid Statics Fluids, i.e., substances tat can flow, are te subjects of tis capter. But
Handling attrition and non-response in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
Multivariate time series analysis: Some essential notions
Capter 2 Multivariate time series analysis: Some essential notions An overview of a modeling and learning framework for multivariate time series was presented in Capter 1. In tis capter, some notions on
CHAPTER 7. Di erentiation
CHAPTER 7 Di erentiation 1. Te Derivative at a Point Definition 7.1. Let f be a function defined on a neigborood of x 0. f is di erentiable at x 0, if te following it exists: f 0 fx 0 + ) fx 0 ) x 0 )=.
