Empirical Study on the Second-stage Sample Size

ASA Sectio o Survey Researc etods Epirical Study o te Secod-stage Saple Size a iu, ary Batcer, Rya Petska ad Ay uo a iu, Erst & oug P, 5 Coecticut Ave, W, Wasigto, C 0036 Abstract I a typical researc settig, two-stage stratified saplig is typically doe i situatios were bot te populatios ad te saples are large But i te case of a audit settig, were busiess records are sapled ad reviewed, saplig is typically doe o relatively sall populatios ad saples For tis settig, tere are two coo etods used for variace estiatio; te classical desig-based approac or a resaplig approac Te classical desig-based approac directly icorporates te secod-stage saple size ito te variace forula, wile te typical resaplig approac does ot explicitly express te secod-stage saple size but it is iplied i te variace forula It is kow tat as te secod-stage saple size icreases, te overall variace decreases; but ow large of a secod-stage saple size is large eoug? I tis paper, we will ivestigate te ipact te secod-stage saple size as o te overall estiatio i differet estiatio approaces i te two-stage stratified, audit saplig settig Key words: Jackkife; Ratio Estiatio; Stratified Saplig; Two-Stage Saplig Backgroud uc of te researc for ulti-stage stratified saplig is liited to bot large populatios ad large saples But i te audit practice, busiess saplig is typically doe o relatively sall saples due to tie ad cost costraits For exaple: i two-stage busiess saplig te first-stage could be a locatio If a large first-stage saple is take, we ay eed to travel to ay differet places i order to pull te ecessary records Tis ca be very costly ad tiely At te secod-stage, te cost of pullig out ad reviewig a sigle record could also be costly So, we wat to iiize te saple, at bot stages, as uc as possible I a typical busiess saplig situatio, tere exists a list wic cosists of a relatively sall uber of Te odel-based approac is a good coice if tere is a good odel fit, see Valliat, orfa ad Royall (000 I tis paper, we oly ited to copare te two desig-based etods 393 etities, ad a correspodig list, for eac etity, tat cotais a large uber of busiess records I oter words, te first-stage populatio is sall ad te secod-stage populatios are large Te quatity to be estiated ay be, for exaple, te aout subject to sales tax, te aout deductible fro icoe tax, or a aout tat is i error Te estiates for tese quatities ave a lower boud of zero but ca take o large positive values, soeties illios of dollars I additio, tere are always requireets to iiize te ipact of te saplig o copay operatios ad to keep te saple size as sall as possible, wile still acievig good precisio If a etity is selected, tat etity will te provide teir list of ivoices wit correspodig dollar aouts For tose etities ot i te saple, oly te total ivoice aouts at te firststage are available fro te fiacial report I tis type of proble, a two-stage stratified saple desig is ofte used Te classical desig-based approac does give us a closed for of variace estiatio, but te forula is very coplicated for two-stage stratified saplig ad gets eve ore difficult for additioal stages O te oter ad, resaplig approaces are fairly straigtforward ad easy to ipleet i ultistage saple desigs For te classical desig-based approac, te secod-stage saple size is explicitly expressed i te variace forula, wile te cotributio to variace fro te secod-stage saple size is iplied i te variace calculatio, but ot explicitly expressed, for resaplig approaces I te classical desig-based approac, te secod-stage saple size ca be calculated usig assuptios about te costs ad o te ratio of te variace copoets of te two stages (or, p56 Toug, tis ay becoe extreely difficult for saple desigs wit ore ta two stages Te secod-stage saple size caot be calculated fro a variace forula for resaplig approaces I geeral, te statistical properties of te variace estiators for resaplig approaces are liited to siulatio or epirical studies (Sardal, p49 I tis paper, we copare two etods of variace estiatio te closed for of te desig-based approac ad te Jackkife; oe of te ost coo resaplig etods Specifically, we will look at te ipact te secod-stage saple size as o te overall variace estiatio usig siulatios

ASA Sectio o Survey Researc etods Siulated ata - Typical Auditig Situatio Our ypotetical, typical populatio will cosist of 3 etities (PSUs ad witi eac etity tere will exist udreds to tousads of ivoices (SSU For eac ivoice, tere is a ivoice aout (x wic is kow ad a qualified aout (y wic could be aywere betwee zero to te full ivoice aout Our goal is to estiate te total qualified aout i te populatio Te total ivoice aouts for eac of te populatio etities is kow ad teir distributio is igly skewed Te distributio of ivoice aouts witi etities is also very skewed Figure sows te distributio of te total ivoice aouts for all 3 etities Figure sows a exaple of a distributio of te ivoice aouts witi a sigle etity Figure 3 is te scatterplot of te total qualified aout agaist te total ivoice aout for te 3 etities i te first-stage populatio Figure 4 is te scatterplot of te qualified aouts agaist te ivoice aouts for ivoices witi a sigle etity Te plot i Figure 4 sows te typical relatiosip betwee te qualified aout ad te ivoice aout at te SSU level ue to te distributios of te desig variable x (te ivoice aout beig skewed at bot, te first ad secodstages, a saple desig tat is stratified at bot stages ad sapled witout replaceet is appropriate We expect soe cages i te qualified percetages across etities, but te cages ay ot be substatial, as sow i Figure 3 Terefore, te cobied ratio estiatio etod is used Give te kow values of te ivoice aouts (x, te qualified aout (y is siulated usig βx + u( β x, wit probability β y βx uβx, wit probability (- β (3 were u is a rado uber fro Uifor (0, For eac etity, a value is assiged to β tat ca be viewed as te approxiate qualified percet per etity For te 0 etities i our populatio wit a relatively sall total ivoice aout, β is radoly assiged a value of 05 or 06; for te te etities wit a ediu to large total ivoice aout, β is radoly assiged 06 or 07; ad for te oe largest etity, β is set to be 07 Te siulated values of y are scattered aroud te lie βx witi te rage of (0, x Figure 4 gives te scatterplot of siulated y agaist x at te SSU level for a sigle etity Figure 3 gives te scatterplot of siulated y agaist x at te PSU level for all etities $0 $,000,000 $,000,000 Total Ivoice Aout Figure Frequecy istributio of Total Ivoice Aout Per Etity (PSU $0 $,500 $5,000 Ivoice Aout Figure Typical Frequecy istributio of Ivoice Aout (SSU Total Qualified Aout (y $,500,000 $,000,000 $500,000 $0 $0 $,50,000 $,500,000 Total Ivoice Aout (x Figure 3 Relatiosip of Qualified Aout (y ad Ivoice Aout (x, PSU evel Qualified Aout (y $5,000 $,500 $0 Te oter etod to cosider usig is Probability Proportioate to Size (PPS saplig at te first stage ad stratified saplig at te secod-stage Tis etod was ot explored i tis paper 394 $0 $,500 $5,000 Ivoice Aout (x Figure 4 Typical Relatiosip of Qualified Aout (y ad Ivoice Aout (x, SSU evel

ASA Sectio o Survey Researc etods Table Te Populatio Suary by Stratu Stratu efiitio: Total Ivoice PSU Aout Per Etity (PSU Stratu iiu axiu uber of PSUs Total Ivoice Aout uber of SSUs uber of SSU Strata Per PSU 0,557 64,770 0 5,673,39 8,4 77,36,598,935 0 9,479,76 43,584 3 Certaity,03,06,03,06,03,06 3,340 3 Total 3 7,56,05 85,65 3 Saple esig o te Siulated ata Soe otatios are defied i te followig Te populatio is stratified at bot stages Te strata are called PSU strata at te firststage ad SSU strata at te secod-stage Te populatio of PSU uits (etities is divided ito strata;,,, Witi eac stratu, tere are PSU uits; i,,, Witi te i t PSU of stratu, te SSU uits (ivoices, are divided ito i strata; k,,, i Witi stratu k of PSU (, i, tere are ik eleetary uits Fro te ik SSU uits of cell (, i, k, ik eleetary uits are radoly selected At te SSU level, x ikj is te kow ivoice aout ad y ikj is te qualified aout; j,,, ik X is te total ivoice aout of te populatio PSU uits Te 3 etities (PSU are stratified ito tree strata by te total ivoice aout per etity, as sow i Table For eac PSU, te ivoices are also stratified ito two strata if te PSU falls witi stratu or tree strata if te PSU falls witi stratu or i te certaity stratu, based upo te total ivoice aout Te SSU stratu boudaries are created idepedetly witi eac etity usig te elaeous-hodges etod for stratificatio For eac saple, ie PSUs are sapled oe is take wit certaity ad four are radoly selected fro eac of te rado strata Te, for eac of te sapled PSUs, a uber of SSUs are radoly selected fro eac SSU stratu To copare differet secod-stage saple sizes, four scearios are used, as suarized i Table Tere are 3 SSU strata fro a saple of ie PSUs I sceario, te SSUs are radoly selected fro eac SSU stratu ad te total uber of sapled SSUs is 30 Siilarly, tere are 690 ad,380 sapled SSUs if te SSU saple size per SSU stratu is 30, as i sceario, or 60, as i sceario 3 I sceario 4, all SSUs are take for eac sapled PSU At tis poit, te saple desig becoes a oe-stage desig ad serves as a becark for our coparisos Te uber of sapled SSUs i sceario 4 depeds o te selected PSUs, ad averaged approxiately 36,4 SSUs per saple Table Scearios of Secod-Stage Saple Size Sceario Saple Size Per SSU Stratu ik Total SSU Saple Size ik, i, k 0 30 30 690 3 60,380 4 Full Average 36,4 Te above siulatio process is repeated,000 ties wit a differet seed for saple selectio every tie 4 Estiatio Forula To acieve our siulatio results, we use a two-stage desig stratified at bot stages Te qualified aout is estiated usig te cobied ratio estiator ad te variace is estiated by usig bot te closed for ad te Jackkife etods Poit Estiator Te cobied ratio estiator for a two-stage stratified saple desig is st Rc X, (4 X st were st i i i i k ik y ik 395

ASA Sectio o Survey Researc etods ad y ik ik ik j y ikj X st is defied siilarly Variace Estiator I tis paper, we copare two etods of variace estiatio te closed for of te desig-based approac ad te Jackkife; a coo resaplig etod Estiated Variace - Closed For Tere is a closed for for te variace of Ŷ Rc We refer to Cocra (977 ad elaborate te stadard variace forula as follows: V ( Rc were f ( i i k f ik i ( ( f ik i ik S + dik (subscript eas te first-stage f (4 ik ik (subscript eas te secod-stage ik S i ik i i RX i i ik [( ( ] ik yikj Rxikj ik RX ik j Te estiated variace is i (ote i ik yik ad X i ik xik ad i k i i k [( ( ] ik yikj Rxikj yik Rxik s ' d ik ik j Estiated Variace Jackkife As oe of te resaplig etods, te Jackkife is flexible ad siple to ipleet i coplex saple desigs We refer to Wolter (985 for te Jackkife etod were θ is te estiate fro te full saple, ie st X θ, X st te sae as (4 et θ ( i deote te estiator of te sae fuctioal for as θ obtaied after deletig te t i PSU i te t stratu fro te saple efie te pseudovalue θ i as θ ( w + θ w θ i were w ( i ( / ( ote tat if te dropped PSU is i a certaity stratu, te pseudovalue θ i is te sae as te value of θ calculated fro te full saple, ie, θ i θ, sice w ( ( / 0 Te Jackkife estiator of θ is defied by θ i θ / (44 i v ( Rc were i i R X i i k i ( f ik i ( ( f ik i ik s + ' d ik (43 Oe versio of te Jackkife variace estiator is defied as v w ( J ( θ θ ( i θ ( i were θ θ ( ( i / i, (45 v J (θ is approxiately ubiased for bot Var (θ i (4 ad Var( θ i (45 396

ASA Sectio o Survey Researc etods 5 Two Issues i te Calculatios I our settig, tere are two issues i te stadard estiatio of variace Te first ivolves ow to treat te certaity PSU we estiatig te variace usig te Jackkife approac I te Jackkife approac, te secod-stage variace of te certaity PSU is ofte igored But i our busiess saplig case, te secod-stage variace of te certaity PSU ca be relatively large ad ave a sigificat ifluece o te overall variace, ad terefore sould be icluded I order to do so, tis portio of te variace ust be calculated separately Te variace of te certaity PSU ca be calculated usig eiter a closed for, if possible or, a resaplig etod At tis poit, tis becoes fairly easy because it is siply a oe-stage desig Tis aalogue applies to saple desigs wit ore ta two stages were te uits used at te firststage of subsaplig are te basis for te foratio of replicates i order to calculate te variace of te PSU (Wolter, p 3 Te secod issue deals wit te calculatio of te degrees of freedo Typically for larger saples ad populatios, a rule of tub etod is used to calculate te degrees of freedo; wic is te uber of sapled PSUs ius te uber of PSU strata A better estiate of te degrees of freedo ere is to use te Sattertwaite adjustet 3 We calculate te Sattertwaite degrees of freedo by assuig te usual assuptios of orality ad idepedece E( v( θ F V ( v( θ were ( θ ( θ v ( v ( θ (5 v is te estiated variace of te t PSU stratu Te oral assuptio ay ot old, but te Sattertwaite approxiatio still sees to work well i our siulatio 6 Siulatio Results For eac sceario i Table, we drew,000 saples ad calculated te estiated qualified aout ad its correspodig estiated variace usig bot te closed for ad Jackkife etods Bias Copariso Te relative bias,, is calculated for eac of te,000 saples Here is te true qualified aout ad is te estiate Te 3 Additioal refereces ca be foud i Rust ad Rao (996 397,000 saples are first arraged i te order of icreasig values of te relative bias ad te grouped ito te sets of te 00 saples Te first group cosists of te 00 saples wit te largest relative egative biases, ad te last group cotais te 00 saples wit te largest relative positive biases Te average bias for eac of tese te groups is te calculated ad copared across eac sceario ad estiatio etod I geeral, te biases of te Jackkife estiates are sligtly larger ta tose of te closed for estiates for eac sceario, but overall te two etods perfor very siilar i ters of bias Te sigificat bias differeces occur across differet scearios or SSU saple sizes Terefore, Figure 5 oly presets te bias copariso of four scearios fro te closed for calculatio Copared to te becark of te full SSU saple, te sceario of 0 uits per SSU stratu as a sigificatly larger average bias; about 5 ties tat of te becark for te largest egative ad positive bias groups Te scearios cosistig of 30 ad 60 uits per SSU stratu are uc closer to te becark i ters of te average relative bias % 9% 6% 3% 0% -3% -6% -9% -% Closed For Calculatios Saple Size 0 Saple Size 30 Saple Size 60 Saple Size All Figure 5 Relative Bias Copariso of Four Scearios Relative Precisio Copariso Aoter way of easurig te closeess betwee te estiated qualified aout ad te true qualified aout is to use te relative widt of te cofidece iterval or t ( df v( relative precisio, defied as Te df is 6, by rule of tub, wic is used i a large saple settig I te sall saple situatio, te Sattertwaite approxiatio etod (5 is used to ave better, or a ore coservative, coverage Te relative precisio at te 90 percet cofidece level was calculated for eac of te,000 saples Followig te sae etodology as i our bias coparisos, te,000 saples were first arraged i order of icreasig values of te relative precisio ad te grouped ito te sets of 00 saples Te first group cosists of te

ASA Sectio o Survey Researc etods Table 3 Average Relative Precisio by Group for ifferet SSU Saple Sizes ad ifferet Variace Estiatio etods 0 30 60 Full Group Closed Jackkife Closed Jackkife Closed Jackkife Closed Jackkife 79% 70% 65% 60% 6% 59% 59% 59% 97% 89% 8% 78% 78% 78% 76% 78% 3 07% 0% 90% 87% 87% 86% 84% 85% 4 6% 0% 97% 94% 93% 93% 89% 90% 5 4% 8% 03% 0% 98% 98% 94% 95% 6 33% 6% 0% 08% 04% 04% 98% 00% 7 4% 35% 6% 6% 0% 0% 03% 05% 8 53% 47% 4% 4% 6% 6% 09% 0% 9 70% 64% 33% 3% 4% 5% 5% 6% 0 07% 98% 56% 55% 40% 43% 4% 8% 00 saples wit te sallest relative precisio, ad te last group cotais te 00 saples wose relative precisio levels are te largest Te, for eac of te te groups, we calculated te average relative precisio 00 t( df v( 00 Table 3 displays te relative i precisio by group for bot te closed for ad te Jackkife etods As sow i Table 3, te relative precisio decreases as te SSU saple size icreases Te settigs of 30 ad 60 produce relative precisios tat are very close to te full SSU saple size settig But te total SSU saple sizes for te settigs of 30 ad 60 differ very uc fro tat of full SSU settig As sow i Table, te saple sizes for te settigs of 30 ad 60 are 690 uits ad,380 uits respectively wile te full SSU saple size is 36,4 uits o te average Coverage Rate Copariso Te coverage rate is a easure closely related to relative precisio Table 4 gives te coverage rate, te proportio of cofidece itervals tat cotai te true populatio total, for our siulatio results calculated for a 90 percet cofidece iterval Table 4 Coverage Rate for ifferet SSU Saple Sizes ad ifferet Variace Estiatio etods As sow i Table 4, te differet estiatio etods ad differet SSU saple sizes result i ior differeces i te coverage rate Te Jackkife results were calculated by usig bot te poit estiate ad variace estiate calculated fro te Jackkife etod I practice, it is ofte te case tat te poit estiate is calculated by use of a closed for calculatio ad te variace of te estiate is calculated by usig te Jackkife Tis cobied use basically causes o cage i Table 4 Effect of Adjustets Based o te Two Issues Te coverage rates sow i Table 4 are calculated usig te Sattertwaite adjustet for degrees of freedo alog wit te additioal variace adjustet of te certaity PSU for te Jackkife variace calculatio To see te overall ipact of tese adjustets idividually, Table 5 presets te coverage rates calculated wit ad witout te Sattertwaite adjustet to te degrees of freedo, ad wit ad witout accoutig for te secod-stage variace for te certaity PSU Te table sows tat te Sattertwaite adjustet iproves te coverage rate for bot te closed for ad te Jackkife etods It also sows tat variace adjustet of certaity PSU iproves te coverage rate for te Jackkife estiatio Estiatio SSU Stratu Saple Size etod 0 30 60 Full Closed 896% 886% 890% 903% Jackkife 877% 880% 89% 906% 398

ASA Sectio o Survey Researc etods Table 5 Coverage Rate for ifferet Settigs Estiatio egrees of Variace SSU Stratu Saple Size etod Freedo Adjustet 0 30 60 Full Closed For 6 884% 876% 883% 886% Closed For Sattertwaite 896% 886% 890% 903% Jackkife 6 o 856% 859% 878% 889% Jackkife Sattertwaite o 870% 877% 888% 906% Jackkife 6 es 863% 86% 880% 889% Jackkife Sattertwaite es 877% 880% 89% 906% 7 Coclusio Siulatios were also perfored o populatio data tat as less variatio at bot stages Te outcoe of tis aalysis gave siilar results to tose preseted i tis paper Troug te results of our aalysis, we ca coclude tat: Bot te closed fro ad te Jackkife estiatio etods perfor siilarly, especially as te SSU saple sizes get larger Te variace estiatio of te closed for is very coplicated for a two-stage stratified saplig ad gets eve ore difficult for additioal stages O te oter ad, te Jackkife forula for variace estiatio is a ore straigtforward for a ulti-stage saple desig Terefore, a Jackkife estiatio etod sees to be a good coice for a ultistage saple desig 8 Refereces Cocra, WG (977 Saplig Tecique, 3 rd ed ew ork: Wiley or, S (999 Saplig: esig ad Aalysis uxbury Press 3 Rust, KF & Rao, JK (996 Variace Estiatio for Coplex Surveys Usig Replicatio Teciques Statistical etods i edical Researc, 5: 83-30 4 S & a rdal, CE, Swesso, B & Wreta, J (99 odel-assisted Survey Saplig ew ork: Spriger-Verlag 5 Valliat, R, orfa, A H & Royall, R (000, Fiite Populatio Saplig ad Iferece, a Predictio Teory, ew ork: Wiley 6 Wolter, Kirk (985, Itroductio to Variace Estiatio, ew ork: Spriger-Verlag Te coice of a secodary saple size depeds o te populatio distributio itself I tis type of settig, saplig soewere betwee 30 ad 60 uits per SSU stratu sees to provide us wit reasoable estiates eeded for our busiess saplig situatio I oter words, less ta 4 percet of te SSU uits eed to be reviewed, copared to a 00 percet review, i order to acieve reasoable estiates Te Sattertwaite approxiatio for degrees of freedo defiitely elps iprove te coverage rate ad sould be used Te secod-stage variace for te certaity PSU sould be couted i te Jackkife variace estiatio i order to get a ore accurate variace estiate 399