ASA Section on Survey Researc Metods SAMPLE DESIG FOR TE TERRORISM RISK ISURACE PROGRAM SURVEY G. ussain Coudry, Westat; Mats yfjäll, Statisticon; and Marianne Winglee, Westat G. ussain Coudry, Westat, 1650 Researc Boulevard, Rockville, Maryland 0850 Key Words: Stratified Sample Design, Optimum Sample Allocation, on-linear Programming, Systematic Sampling, Composite Selection Probabilities We describe in tis paper te sample design for te terrorism risk insurance program survey tat Westat conducted for te U.S. Department of Treasury to estimate at te national level and for a number of domains te uptake rate and te average premium paid for te terrorism risk insurance. Te sampling frame for te private sector was constructed from te Dun and Bradstreet listing of businesses, and tat for te state and local governments and special districts was compiled from te 00 Census of. Te sample design was a stratified single-stage design wit systematic sampling of business entities. We used a non-linear programming tecnique to determine te optimum sample allocation to minimize te total sample wile acieving te required precision levels for te survey estimates. We determined te composite selection probabilities for te systematic sampling procedure tat at least one member (eadquarters or a subsidiary of te business was selected, and constructed te sampling weigts based on te composite selection probabilities. 1. Introduction Te terrorism risk insurance program survey was conducted in 003 and 004 for te U.S. Department of Treasury to estimate at te national level and for a number of domains te uptake rate and te average premium paid for terrorism risk insurance. Te Terrorism Risk Insurance Act (TRIA of 00 mandates several studies by te Treasury Department as administrator of te Terrorism Risk Insurance Program. Te Program was establised To protect consumers by addressing market disruptions and ensure te continued widespread availability and affordability of property and casualty insurance for terrorism risk, and To allow for a transition period for te private markets to stabilize, resume pricing of suc insurance, and build capacity to absorb any future losses. To fulfill te requirements of te TRIA, te study was designed to assess te effectiveness of te program, and te likely capacity of te property and casualty insurance industry to offer terrorism risk insurance in workers compensation, oter casualty, and property insurance lines after te program sunsets, by law, on December 31, 005. Data collection involved tree surveys: a demand-side survey of businesses; a supply-side survey of insurance companies (insurers; and a separate supply-side survey of re-insurers, i.e., te companies tat insure te insurance companies. We discuss in tis paper te sample design and weigting metodology of te demand-side survey of businesses tat will be representative of te entire industrial and governmental composition of te U.S. economy.. Sampling Frame Te target population for te demand-side survey of insurance purcasers consists of all private sector businesses, and state and local governments wit 10 or more employees. Te frame for te private sector included eadquarters and subsidiaries of domestic businesses, and te subsidiaries located in te U.S. of foreign owned businesses wit eadquarters outside te United States. Te description of te sampling frames for te private sector and for te governments follows..1 Creation of Sampling Frame for te Private Sector We constructed te private sector sampling frame using te Dun and Bradstreet (D&B business directory. We included all business eadquarters and subsidiaries located in te United States, including subsidiaries of foreign businesses wit eadquarters outside te U.S. Brances of businesses and businesses wit less tan 10 employees were excluded from te frame. Te initial frame consisted of 1,53,635 businesses. We removed as muc as possible public sector records in te D&B frame before combining wit te Census of Government (CoG frame. After removing te public sector records, te final private sector sampling frame consisted of 1,476,746 businesses entities. Te variables, geograpic location, industry classes, and size, were te stratification variables. We defined 15 geograpic location strata as follows. Te first seven geograpic locations were te seven igrisk cities identified by te Treasury. Te businesses tat were not in te ig-risk cities were assigned to eigt geograpic locations defined by region by urban/non-urban status. We followed te Census Bureau definitions to define four regions, and used te 3358
ASA Section on Survey Researc Metods Census 000 city population to define urban/non-urban status. Businesses located in cities wit a population size of at least 350,000 were assigned urban status and businesses located in cities or places wit fewer tan 350,000 people were assigned non-urban status. We defined te industry classes following te 1997 ort American Industry Classification System (AICS codes. Te D&B businesses were classified by detailed Standard Industrial Classification codes (te SIC + Codes. To construct te industry groups by AICS codes, we used conversion tables provided by te Census Bureau for mapping SIC codes to te AICS codes. We mapped most of te businesses into AICS groups by applying te conversion table. For a small number of remaining businesses, we resolved te mapping manually by looking up te 8-digit SIC codes on te D&B frame and te corresponding descriptions from te U.S. Bureau of te Census. We defined te four size categories based on total assets (< $10 million; $10 100 million; $100 million $1 billion; and > $1 billion. Tis classification was approximate because total assets were missing for about 78 percent of te businesses in te D&B frame. To circumvent tis problem, we used te distribution of assets on te 000 corporate tax returns to obtain number of businesses wit assets between $10 and $100 million. Te number of businesses wit assets above $100 million on te corporate tax returns was divided into te categories: $100 million to $1 billion, and above $1 billion by using te count for number of businesses wit assets over 1 billion from te D&B list and te distribution of employment (number of employees to obtain te cutpoints to matc te asset distribution from te corporate tax returns.. Creation of Sampling Frame for All data files used to create a sampling frame for te governments were downloaded from te U.S. Bureau of te Census ome page. Table -1 lists all te files used in te creation of te sampling frame for te governmental entities. ote tat te government frame included only te independent governmental units wereas te dependent governmental units were excluded. Te number of employees (full time equivalent, also obtained from te Census Bureau ome page, was ten merged onto te files in Table -1. owever, employment data was missing for some governments. We imputed te employment figures for te missing cases since employment is used in te stratification. We defined te four size categories based on employment. Te four size categories are: less tan 150; 150-699; 700-3,999; and greater tan or equal to 4,000. Table -1. Files downloaded from te U.S. Census Bureau ome page o. File name one of te files in Table -1 contains te 50 state governments, wic were also eligible for te study. We created te final file of Census of Government (CoG frame consisting of 40,375 records by processing te files 1 troug 5 and adding te state governments. 3. Sample Design Te sample design for te demand-side survey is a single-stage stratified sample of businesses wit systematic sampling of businesses witin strata. Te sample was allocated to te strata to minimize te total sample size wile satisfying te required precision levels for te national and domain level estimates of uptake rates and total employment. Te stratification and sample allocation are discussed in tis section. 3.1 Stratification o. of records 1 00GID_Counties 3,034 00GID_Cities 19,49 3 00GID_Towns 16,504 4 00GID_Special_Districts 35,05 5 00GID_Scools 13,506 Type of government County Municipal (or City Townsip Special District Independent Scool Districts First, we created two special strata wit 100 percent sampling (certainty strata. One of tese special strata was te 50 state governments, and te oter was te businesses wit more tan 300,000 employees. Tere were 1 suc businesses (i.e., wit more tan 300,000 employees, and one of tese was a state government. Terefore, te two certainty strata contained 61 entities. In addition, a number of businesses tat are owners of ig-risk buildings were sampled wit certainty. Te sampling strata for te remainder of te frame were defined by cross-classification of tree categorical variables: Industry (10 categories, Geograpy (15 categories, and Size (4 categories. Among te 10 industry categories tere were five igrisk industries, and five categories for te remainder of te industries. Among te 15 geograpy categories, tere were seven ig-risk cities, and eigt categories defined as region by urban/non-urban witout te seven ig-risk cities. Tere were 5 non-empty strata, and 3359
ASA Section on Survey Researc Metods te number of sampling entities witin strata varied from 1 to 149,605. 3. Sample Allocation Te sample allocations were determined for te combined frame constructed from te Dun and Bradstreet (D&B listing and te Census of (CoG. Te combined D&B and CoG frame contained 1,517,11 business entities. Te survey estimates were required for 8 domains including te national level estimates. Te 8 domains of interest are: 4 Census Regions, 7 ig Risk Cities, 5 ig Risk Industries, 5 Major Industries, Urban/on-Urban locations, 4 Size Categories, and te ational level. We used non-linear programming to obtain te minimum sample size tat would satisfy te CV requirement for te estimated employment and te 95 percent confidence interval alf-widt requirement for te estimated uptake rate for eac of te 8 domains given above. We applied te additional constraint tat te maximum sampling rate would be 40 percent because te assumed response rate would not exceed 40 percent. Te total sample was ten minimized under te constraints tat te CV requirements for te estimates of employment and 95 percent confidence interval alfwidts of uptake rates would be satisfied for te 8 domains of interest. Min ( = f α, (3-1 1α = were denotes te sampling stratum. is te number of entities in stratum, and α is te sampling rate for stratum. We minimize (3-1 under te constraints tat 0 < α 0. 40, and te precision requirements are satisfied for te D domains of interest. We used 40 percent as te upper tresold for sampling fraction because we did not expect te response rate to be more tan 40 percent. Tus, we would sample.5 times te required sample from eac of te sampling strata. We let n = α denote te sample size for te sampling stratum. We also define =, W α 1 were W is te design weigt for stratum. Ten equation (3-1 becomes Min f ( W = = 1 W, (3- subject to te constraints tat te weigts W. 5, and te corresponding CV and 95 percent confidence interval alf-widt constraints for te D domains of interest are satisfied. We computed te 95 percent confidence interval alf-widts by assuming tat te uptake rates would be 50 percent for te businesses in te largest size category, and 0 percent for te oter 3 size categories. Te CV constraints and te 95 percent confidence interval alf-widt constraints were obtained using te variances wit finite population correction factors. CV Constraints Te relative variance of te estimated total d Yˆ (or mean tat is te square of te coefficient of variation (CV of te estimate is given as Rel. Var. ( ( ˆ ˆ Var dy Y = d ( Y d = ( W 1 εd d S Y. (3-3 Tus, te CV constraints can be expressed as linear constraints if te constraints are expressed in terms of te squares of te CVs as a function of te sampling weigts instead of te sampling fractions. 95 % Confidence Interval alf-widt Constraints Te estimate of a proportion for te domain of interest d can be written as d d d W d pˆ = =, (3-4 W n x W d were x is te number of observed cases from stratum tat belong to te category yes (e.g., take terrorism risk insurance. Ten, te variance of te estimated proportion is given by ( pˆ ( W 1 [ p ( p 1 ] d Var d =, (3-5 d x 3360
ASA Section on Survey Researc Metods and te 95 percent Confidence Interval alf-widt W ( d pˆ is given by Var( d pˆ. Tus, te squared 95 percent confidence interval alf-widt is also a linear function of te sampling weigts. Tus, te bot te CV and te 95 percent confidence interval alfwidt constraints become linear if expressed as te squared quantities in terms of sampling weigts instead of sampling fractions. Te vector of Gradients can also be computed analytically. It sould be noted tat te sample allocation is a trade-off between obtaining smaller CVs or smaller confidence interval alf-widts. Te smaller CVs (of employment, assets, etc. can be obtained by sampling at iger rates te larger businesses; but smaller alfwidts of te confidence intervals require tat te smaller businesses be sampled at iger rates. After te sample allocation to te primary strata, we furter stratified te sample on te basis of size in tose primary strata for wic te sample size was 50 or more in order to obtain additional stratification gains. 4. Sample Weigting Te sample design for te terrorism risk insurance program survey is a stratified single-stage sample of business entities. Te probability of selecting a business is te composite probability tat at least one of its component businesses is selected. Te base weigts were computed as reciprocal of te selection probabilities, and tese weigts were adjusted to account for te nonrespondent businesses. 4.1 Base Weigts Te sampling unit is a subsidiary (or a eadquarters but te business may not always be able to report te insurance data separately for all of its subsidiaries. For example, a business may only report aggregate data at te ultimate eadquarters level tat will account for all te subsidiaries owned by te business. On te oter and, a business may report insurance data for a sub-set of its subsidiaries as a group for a number of suc groups tat will collectively account for te entire business. Tus, te eadquarters level aggregate level data is a special case were te entire business will be a single group for reporting insurance data. Te weigt assigned to te eadquarters or a group of subsidiaries (ereafter referred to as te reporting units will be based on te composite probability of selection of te group (or te reporting unit, wic is te probability tat at least one entity (eadquarters or subsidiary belonging to te reporting unit will be selected. Even if te eadquarters was not selected and a subsidiary belonging to te reporting unit tat contains te eadquarters was selected data will be collected for te entire reporting unit. We denote by,, and n respectively te stratum, te number of business entities in te stratum, and te number sampled from te stratum. We selected te sample of business entities witin eac stratum wit systematic sampling from a sorted list of tese entities. For te sake of simplicity, te reporting unit (group of subsidiaries will be called a business, wic may or may not be reported by te ultimate eadquarters. For example, a large corporation may own several businesses, and eac is responsible for its own insurance, and te eadquarters insures itself and any establisments directly reporting to te eadquarters. We use te symbol i to denote a business. For a business tat is a single entity (i.e., wit no subsidiaries te probability of selection is given by n π i = ; i. For a business wit multiple entities (eadquarters and subsidiaries, we compute te composite probability tat at least one of tese entities will be selected. Suppose tat te business (reporting unit denoted by i is in stratum, and tere are a number of entities (subsidiaries bot in stratum and in oter strata for wic data are reported troug te business i in stratum. Te number of entities (sampling units in stratum is. We denote by te number of entities in stratum for wic te data are reported troug te business dented by i, were 0 ; = 1,, 3,,. If P i is te probability tat at least one of te subsidiaries of business i will be sampled from stratum, ten te composite probability of selecting te business i (i.e., probability tat at least one of te entities tat is associated wit business i will be selected is given by π i = 1 ( 1 Pi ; i, (4-1 = 1 n were P i = 0, if = 0 and P i =, if = 1. Te case wen is discussed below. 3361
ASA Section on Survey Researc Metods Case wen Te sampling entities in stratum are labeled,, 3,...,. Let t 1 J, k be te index of te k subsidiary of business i in stratum, were k = 1,,...,. We sampled n out of te business entities from stratum wit systematic sampling procedure. We need to calculate te probability P i tat at least one of te entities (subsidiaries of business i will be sampled from stratum. For calculating te probability P i, we define te intervals and I as follows. and, k = 0, (4- n ( L, L + 1 =, k, k ; if L, k + 1 n, k, (4-3 = L, k, U 0, L, k + 1 ; n n oterwise = L, J 1 mod. We compute te were k (, k n intervals corresponding to te subsidiaries of te business i tat are in stratum were = 1,, 3,...,. ext, we compute te union of te intervals tat correspond to te subsidiaries of business i tat are in stratum and denote te composite interval by = U, k k = 1. (4-4 (i Te composite interval is te interval tat corresponds to te composite probability tat at least one of te subsidiaries of business i will be sampled from stratum. Te corresponding probability P is ten given by i P i = (4-5 were. denotes te lengt of te interval. ote tat = is te sampling interval for stratum. We n n also note tat P i = reduces to wen ( = 1 because te lengt of te interval i = 1 for = 1, i.e. tere is only one subsidiary of business i in stratum. Te composite probability tat at least one entity belonging to te business i will be selected is ten given by π i = 1 ( 1 Pi ; i, (4-6 = 1 were P = 0, if = 0 i, and P i is given by (4-1 for 1. Te base weigt assigned to business i will be te reciprocal of te probability of 1 selection, i.e., wi = ; i. π i 4. Sampling Weigts for te ig Risk Buildings After te sample ad been selected, Westat identified more tan 00 ig risk buildings and te businesses tat owned tese ig risk buildings were selected wit certainty. Some of tese businesses were in te original sample and tose not already in te sample were included in te sample. Tus, te sample size becomes random. We considered bot te conditional and unconditional approaces for constructing te sampling weigts tat account for te supplementary sample of certainty businesses. Tese two approaces are discussed below. Conditional Approac We condition on te acieved sample size and te randomization is over all possible samples of size equal to te acieved sample (Särndal and idiroglou, 1989. If M out of te entities in stratum are ig risk (R entities, and m of te ig risk entities are in te initial sample ten te acieved Sample size from stratum is n m + M business entities 336
ASA Section on Survey Researc Metods including te ig risk businesses. Te weigting under te conditional approac is conditionally unbiased, and ence it is unconditionally unbiased as well. Unconditional Approac Under te unconditional approac te randomization is over all possible samples tat could be selected irrespective of te acieved sample size. Terefore, te selection probabilities of te businesses tat are not at ig risk do not depend on te acieved sample size. It sould be noted tat te weigting under te unconditional approac are conditionally biased. We constructed te sampling weigts under te conditional approac because te estimates are conditionally unbiased and ence unconditionally unbiased as well. Moreover, te conditional variance is also unconditionally unbiased. Kalton (00 also recommended tat if a subset of sampling distribution in wic te estimator is conditionally approximately unbiased can be identified, ten a conditional analysis sould be employed in analyzing an actual sample. 4.3 Final Sample Weigts Te final sample weigts were constructed by applying adjustment to account for nonresponse (Elliot, 1991. Te sample cases can be divided into respondents and nonrespondents. Furter, te respondents can be eiter eligible or ineligible (out of scope for te survey. Te eligibility of te nonrespondent businesses could not always be determined. For example, a sampled business tat did not cooperate could be very small (less tan 10 employees and ence ineligible for te survey. Terefore, te nonrespondent businesses were classified into two categories: (1 eligible nonrespondents and ( nonrespondents wit unknown eligibility. In order to apply te adjustments for unknown eligibility and nonresponse, te sample cases were grouped into four response status categories: 1. Eligible Respondents;. Ineligible or Out of Scope; 3. Eligible onrespondents; and 4. Unknown Eligibility. In a typical application, te nonresponse adjustment can be carried out in two stages. At te first stage te base weigts of tose wit unknown eligibility (Category 4 are allocated proportionally to tose wose eligibility is known (Categories 1,, and 3 and te weigts of tose wit unknown eligibility are set to zero. In te second stage te adjusted weigts of eligible non-respondents (Category 3 is redistributed among te respondents (Category 1. Since additional information on te activity status (active versus inactive is available on a subset of tose wit unknown eligibility, tis information can be used for making te adjustment for unknown eligibility. As suggested by atan (003, te unknown eligibility adjustment can itself be made in two stages by making use of tis information. In te first stage we allocated te weigt of tose for wom te activity status was unknown between tose wose activity status was known, and te weigt of tose wit unknown activity status was set to zero. In te second stage te weigts of tose wo were known to be active wit unknown size (number of employees were allocated between tose wo were known to be active wit known sizes, and te weigts of tose wit unknown size were set to zero. Coudry et al. (00 also implemented a similar strategy for constructing weigts for te Random Digit Dial (RDD sample for te national survey of veterans were adjustment for unknown eligibility was made in two stages. First, unknown eligibility adjustment was made for tose telepone numbers for wic residential status (residential versus nonresidential was not known. Second, te unknown eligibility adjustment was made for tose known to be residential but it was not known weter tere was a veteran in te ouseold. In all te non-response adjustments described above, te adjustments were carried out witin weigting classes defined by te combination of te stratification variables: Geo-location, Size and AICS code. Since te cross-classification by all tree stratification variables would ave resulted in a sparse table wit too many cells, standard CAID (Ci-square ierarcical Automatic Interaction Detector analysis was used to form te weigting classes by combining classes witout significant differences in response or eligibility propensities (Kass, 1980. Te CAID analysis was done separately for eac nonresponse adjustment type, i.e., first stage of unknown eligibility adjustment, second stage of unknown eligibility adjustment, and te nonresponse adjustment for eligible nonrespondents. Te final survey weigt was defined as te product of te base weigt and te nonresponse adjustment factors as described above. Tese weigts were used to obtain survey estimates at te national level and for te domains of interest. 5. References Coudry, G.., Park, I., Kudela, M.S., and elmick, J.C. (00. 001 ational Survey of Veterans Design and Metodology Final Report. Westat, Rockville, Maryland. Elliot, D. (1991. Weigting for onresponse: A Survey Researcer s Guide. Office of Population Censuses and Surveys, Social Surveys Division, London. 3363
ASA Section on Survey Researc Metods Kalton, G. (00. Models in te Practice of Survey Sampling (Revisited. Journal of Official Statistics, 18, 19-154. Kass, G. (1980. An Exploratory Tecnique for Investigating Large Quantities of Categorical Data. Applied Statistics, 9, 119-17. atan, G. (003. onresponse Adjustment for te TRIP Establisment Sample. Westat Memorandum umber S-34 dated December 13, 003. Särndal, C.E., and idiroglou, M.A. (1989. Small Domain Estimation: A Conditional Approac. Journal of te American Statistical Association, 84, 66-75. 3364