Confidence Intervals for Capture-Recapture Data With Matching

Similar documents
Risk and Return. Sample chapter. e r t u i o p a s d f CHAPTER CONTENTS LEARNING OBJECTIVES. Chapter 7

Effect Sizes Based on Means

Monitoring Frequency of Change By Li Qin

1 Gambler s Ruin Problem

A MOST PROBABLE POINT-BASED METHOD FOR RELIABILITY ANALYSIS, SENSITIVITY ANALYSIS AND DESIGN OPTIMIZATION

Large-Scale IP Traceback in High-Speed Internet: Practical Techniques and Theoretical Foundation

An important observation in supply chain management, known as the bullwhip effect,

The risk of using the Q heterogeneity estimator for software engineering experiments

A Multivariate Statistical Analysis of Stock Trends. Abstract

FREQUENCIES OF SUCCESSIVE PAIRS OF PRIME RESIDUES

Principles of Hydrology. Hydrograph components include rising limb, recession limb, peak, direct runoff, and baseflow.

An inventory control system for spare parts at a refinery: An empirical comparison of different reorder point methods

Comparing Dissimilarity Measures for Symbolic Data Analysis

On the predictive content of the PPI on CPI inflation: the case of Mexico

Two-resource stochastic capacity planning employing a Bayesian methodology

Pythagorean Triples and Rational Points on the Unit Circle

6.042/18.062J Mathematics for Computer Science December 12, 2006 Tom Leighton and Ronitt Rubinfeld. Random Walks

Normally Distributed Data. A mean with a normal value Test of Hypothesis Sign Test Paired observations within a single patient group

DAY-AHEAD ELECTRICITY PRICE FORECASTING BASED ON TIME SERIES MODELS: A COMPARISON

Managing specific risk in property portfolios

How To Understand The Difference Between A Bet And A Bet On A Draw Or Draw On A Market

Time-Cost Trade-Offs in Resource-Constraint Project Scheduling Problems with Overlapping Modes

Re-Dispatch Approach for Congestion Relief in Deregulated Power Systems

Index Numbers OPTIONAL - II Mathematics for Commerce, Economics and Business INDEX NUMBERS

Probabilistic models for mechanical properties of prestressing strands

An Introduction to Risk Parity Hossein Kazemi

Title: Stochastic models of resource allocation for services

F inding the optimal, or value-maximizing, capital

Frequentist vs. Bayesian Statistics

Evaluating a Web-Based Information System for Managing Master of Science Summer Projects

Point Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11)

Beyond the F Test: Effect Size Confidence Intervals and Tests of Close Fit in the Analysis of Variance and Contrast Analysis

Static and Dynamic Properties of Small-world Connection Topologies Based on Transit-stub Networks

The Lognormal Distribution Engr 323 Geppert page 1of 6 The Lognormal Distribution

The impact of metadata implementation on webpage visibility in search engine results (Part II) q

Stochastic Derivation of an Integral Equation for Probability Generating Functions

Implementation of Statistic Process Control in a Painting Sector of a Automotive Manufacturer

The Advantage of Timely Intervention

A Simple Model of Pricing, Markups and Market. Power Under Demand Fluctuations

Alpha Channel Estimation in High Resolution Images and Image Sequences

Compensating Fund Managers for Risk-Adjusted Performance

NEWSVENDOR PROBLEM WITH PRICING: PROPERTIES, ALGORITHMS, AND SIMULATION

TRANSMISSION Control Protocol (TCP) has been widely. On Parameter Tuning of Data Transfer Protocol GridFTP for Wide-Area Networks

Measuring relative phase between two waveforms using an oscilloscope

We are going to delve into some economics today. Specifically we are going to talk about production and returns to scale.

Machine Learning with Operational Costs

THE RELATIONSHIP BETWEEN EMPLOYEE PERFORMANCE AND THEIR EFFICIENCY EVALUATION SYSTEM IN THE YOTH AND SPORT OFFICES IN NORTH WEST OF IRAN

Risk in Revenue Management and Dynamic Pricing

Local Connectivity Tests to Identify Wormholes in Wireless Networks

Synopsys RURAL ELECTRICATION PLANNING SOFTWARE (LAPER) Rainer Fronius Marc Gratton Electricité de France Research and Development FRANCE

X How to Schedule a Cascade in an Arbitrary Graph

Large firms and heterogeneity: the structure of trade and industry under oligopoly

IMPROVING NAIVE BAYESIAN SPAM FILTERING

4. Discrete Probability Distributions

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network

On Software Piracy when Piracy is Costly

c 2009 Je rey A. Miron 3. Examples: Linear Demand Curves and Monopoly

Buffer Capacity Allocation: A method to QoS support on MPLS networks**

The fast Fourier transform method for the valuation of European style options in-the-money (ITM), at-the-money (ATM) and out-of-the-money (OTM)

Methods for Estimating Kidney Disease Stage Transition Probabilities Using Electronic Medical Records

Multiperiod Portfolio Optimization with General Transaction Costs

C-Bus Voltage Calculation

THE WELFARE IMPLICATIONS OF COSTLY MONITORING IN THE CREDIT MARKET: A NOTE

Predicate Encryption Supporting Disjunctions, Polynomial Equations, and Inner Products

CRITICAL AVIATION INFRASTRUCTURES VULNERABILITY ASSESSMENT TO TERRORIST THREATS

Multistage Human Resource Allocation for Software Development by Multiobjective Genetic Algorithm

Introduction to NP-Completeness Written and copyright c by Jie Wang 1

Joint Production and Financing Decisions: Modeling and Analysis

TRANSCENDENTAL NUMBERS

Forensic Science International

Learning Human Behavior from Analyzing Activities in Virtual Environments

CSI:FLORIDA. Section 4.4: Logistic Regression

Drinking water systems are vulnerable to

Price Elasticity of Demand MATH 104 and MATH 184 Mark Mac Lean (with assistance from Patrick Chan) 2011W

POISSON PROCESSES. Chapter Introduction Arrival processes

Modeling and Simulation of an Incremental Encoder Used in Electrical Drives

From Simulation to Experiment: A Case Study on Multiprocessor Task Scheduling

NBER WORKING PAPER SERIES HOW MUCH OF CHINESE EXPORTS IS REALLY MADE IN CHINA? ASSESSING DOMESTIC VALUE-ADDED WHEN PROCESSING TRADE IS PERVASIVE

Project Management and. Scheduling CHAPTER CONTENTS

SQUARE GRID POINTS COVERAGED BY CONNECTED SOURCES WITH COVERAGE RADIUS OF ONE ON A TWO-DIMENSIONAL GRID

Free Software Development. 2. Chemical Database Management

The Fundamental Incompatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsampling

2D Modeling of the consolidation of soft soils. Introduction

Softmax Model as Generalization upon Logistic Discrimination Suffers from Overfitting

Design of A Knowledge Based Trouble Call System with Colored Petri Net Models

Temperature advection derived from Doppler radar wind profiles. Susanne Jonker

Expert Systems with Applications

1.3 Saturation vapor pressure Vapor pressure

The Online Freeze-tag Problem

High Quality Offset Printing An Evolutionary Approach

Failure Behavior Analysis for Reliable Distributed Embedded Systems

ECONOMIC OPTIMISATION AS A BASIS FOR THE CHOICE OF FLOOD PROTECTION STRATEGIES IN THE NETHERLANDS

GAS TURBINE PERFORMANCE WHAT MAKES THE MAP?

A Note on Integer Factorization Using Lattices

More Properties of Limits: Order of Operations

Applications of Regret Theory to Asset Pricing

Theoretical and Experimental Study for Queueing System with Walking Distance

INFERRING APP DEMAND FROM PUBLICLY AVAILABLE DATA 1

Rejuvenating the Supply Chain by Benchmarking using Fuzzy Cross-Boundary Performance Evaluation Approach

An actuarial approach to pricing Mortgage Insurance considering simultaneously mortgage default and prepayment

Transcription:

Confidence Intervals for Cature-Recature Data With Matching Executive summary Cature-recature data is often used to estimate oulations The classical alication for animal oulations is to take two samles and count how many individuals are resent in both The classical method involves assuming stochastic indeendence (ie that having been in one samle has no effect on the robability of being in the other) For animals this may a reasonable assumtion but for humans it is much less credible Peole who do not return their census forms (and hence are not in samle 1) are more likely to be less than co-oerative when the enumerator calls (and hence less likely to be in the coverage survey which is samle ) This aer resents a Bayesian method which exresses the robability distribution of the oulation not only as a function of the three numbers of eole (those in both samles, in samle 1 only and in samle only) and also of the correlation coefficient ϕ between the samles The median estimate of the oulation size and the uer and lower ½ er cent oints can be modelled as functions (i) of ϕ (to which they are quite sensitive) and (ii) of the form of the rior distribution (to which they are not very sensitive) The method is illustrated using Scottish data from the 011 Census Comuter code, samle outut and tis for imroving efficiency are resented 1 Introduction The use of cature-recature data to estimate animal oulations dates back to the early 0 th century One current examle of its use to estimate human oulations is in the census when it is common to follow u a general census enumeration (which is known not to cover all members of a oulation) with a census coverage survey (in which intensive effort is exended in achieving accurate counts in a minority of geograhical areas) If this minority is reresentative, comaring the census and coverage counts allows an imression to be formed of the adjustments which must be made to the census counts to make them better reflect the actual oulation However to do this, it is necessary to count how many ersons were in both the census and the coverage survey and how many were in one but not the other Finally it is necessary to estimate how many ersons did not feature in either enumeration Classically, this was done by assuming that inclusion in the two samles were indeendent of each other, even though there are good grounds for assuming that this is not the case and that in articular, those with an incentive not to articiate in the census are also more likely to be unwilling to take art in the coverage survey In this reort, a Bayesian argument is used to model how the estimate of the number of eole missing from both samles, and its credible interval, change as a function of the extent of stochastic deendence between the two samles First the roblem is stated formally Then the Bayesian solution is stated and the various comonents needed to imlement the argument are develoed, along with a discussion of matters relevant to comutational efficiency Finally the method is alied to samle data sets with different assumtions about the Bayesian rior distribution

The roblem: A random samle of size A B cases is taken A second random samle of size A C is taken of which A were also in the first samle What is the robability that there are exactly N A B C k cases in the total oulation? S1 In Out S In A C Out B k 3 A solution: A comlete solution is not ossible solely on the basis of the information given However a artial solution can be develoed as follows First, we note that N differs from k only by the sum of the three known constants A, B and C We therefore require the robability density of k conditional on A, B and C We can write k A, C k k A, C A, C kk (1) A, C This is the classical Bayesian solution in which the osterior robability is roortional to the roduct of the likelihood and the rior robability 3a The likelihood function: The vector A C, k, follows a multinomial distribution with integer arameter A B C k of the form A, C k A B C k! A B C k A B C k () A! B! C! k! The four robabilities must be constant over k since, while the cell entries vary robabilistically, the arameters on which they are based are functions of the samles and of the stochastic relationshi between them and these are fixed The robabilities are unknown and must be based on assumtions The method will be as credible as these assumtions It seems credible to assume that A, B and C will stand in the same roortions as A, B and C which gives us two constraints A third is rovided by the requirement that the four robabilities sum to one Instead of imosing a fourth constraint, we exress the result in terms of the degree of stochastic deendence between the samles This is measured by the Pearson correlation coefficient for dichotomous variables which we can define in terms of the four robabilities as A k B C A B C k A C B k Substituting the three constraints into the exression for and solving for rather ungainly result A yields the B C 4BC B C 1 1 1 B 1 C 1 (3) A where B B / A and C C / A Then B A C AC and k 1 A B C These can then be substituted directly into () to calculate the likelihood

We assume that there is ositive correlation between the samles so that if a case has been included in one of the samles, the robability of inclusion in the other is increased The value of can be varied systematically from zero uwards to exlore the effect on the confidence intervals of changing the degree of stochastic deendence between the samles To imlement these calculations it is safer to use the natural logarithms of the likelihood as this removes the danger of numeric overflows and underflows It is useful to note that the natural logarithm of k! is returned by the function gammaln(k+1) in Excel and the function lgamma(k+1) in Enterrise Guide The value for the first term of the likelihood where k 0 can be found by substituting this value in () Thereafter it is easier to calculate each term as a function of the receding one using the relationshi A, C, k N N k A, C, k 1 N 1 k 3b The rior distribution: Use of the multilication rule for robabilities in (1) imlies that the rior and the likelihood must be stochastically indeendent and so the numbers A, B and C cannot be used in the definition of the rior A safe aroach would be to adot an uninformative rior with a large variance reresenting little rior knowledge about the value of N However it may be argued that this does not use all the information available and that other relevant data (articularly revious estimates of N including Mid Year Estimates or MYEs) may be used This aroach is adoted in the next section Priors can be divided into one-arameter models such as the Poisson distribution, and two-arameter models such as the normal distribution The difference in ractice is that two-arameter models allow the mean and variance of the rior to be controlled indeendently We now illustrate these oints with hyothetical, though realistic, data 1 In all cases the algorithm was run until the log likelihood fell below a value equal to one millionth of its highest value 4a An examle using a Poisson rior: This illustration of the method uses a one-arameter Poisson rior N The rior density is given by f N e / N! and the mean and variance are both equal to the arameter In this examle it was assumed that 388 cases were found in both samles, 56 in the first one only and 75 in the second one only It was also assumed that a revious study had returned a oulation estimate of 550, so a Poisson rior with arameter 550 was used Table 1: 5 50 975 Comlete indeendence ( = 4 11 18 000) Low deendence ( = 010) 1 1 31 Low deendence ( = 00) 33 46 Low deendence ( = 030) 36 49 64 Medium deendence ( = 040) 54 71 89 Medium deendence ( = 050) 81 101 1 Medium deendence ( = 060) 13 146 171 Footnotes 1) The code for all the calculations in this section is given in aendix B to this note ) Imlementing the Poisson form for the rior is eased by noting that each term can be derived simly by multilying the revious one by /N

High deendence ( = 070) 193 53 High deendence ( = 080) 336 374 413 High deendence ( = 084) 444 487 53 Table 1 gives the 5, 50 and 975 oints of the osterior distribution of the number k of cases included in neither samle The rows reresent increasing degrees of deendence The table shows that the degree of deendence makes a large difference to the outcome throughout the range of values and that the effect increases the higher the value As assumes higher values 3, the calculation becomes unstable and the results cannot be relied on for values above 080 Informal exerience suggests that the deendence is in the low range ( equals 04 or below) but even within this range there is considerable variation 4b An examle using a normal rior: This requires a both mean and a variance The MYE can serve as the mean, as it did for the Poisson Confidence intervals are not ublished for MYEs (otherwise, the variance arameter could be derived from them) so another aroach must be used One such aroach is to link the variance to the mean The one-arameter Poisson rior above in effect did this because, with a arameter is high as 550, the Poisson closely aroximates the normal distribution with mean and variance both equal to 550 Table 1 was recomiled using the normal density with these arameters but the result was too close to Table 1 to merit relication Only at the bottom end of the table did the results differ ercetibly with the normal giving slightly lower cut-offs For = 08 for examle, the figures were 301, 334 and 367 Setting the variance equal to the mean is equivalent to assuming that the estimate of 550 has a standard deviation of 34 or 43, which could well be reasonable for MYEs To exlore the effect of changing the variance arameter, two further simulations were undertaken using the same mean of 550 but variances of 450 and 700 (equivalent to standard deviations of 39 and 48 resectively) The results are given in Table which shows that the sensitivity of the osterior distribution cut-offs to changes in the variance of the normal rior increases with the degree of deendency between the samles but is only imortant for degrees which are higher than those likely to be met in ractice Table : Using a normal rior with = 550 = 450 = 700 5 50 975 5 50 975 Comlete indeendence ( = 4 11 18 4 10 18 000) Low deendence ( = 010) 1 1 31 1 1 31 Low deendence ( = 00) 33 46 33 46 Low deendence ( = 030) 35 49 64 36 50 65 Medium deendence ( = 040) 54 70 87 55 7 90 Medium deendence ( = 050) 79 98 118 83 103 14 Medium deendence ( = 060) 116 138 161 16 150 176 High deendence ( = 070) 174 199 6 199 8 58 High deendence ( = 080) 74 304 335 336 37 409 Footnote 3) While the uer bound of is in theory one, for some combinations of A, C and k it may be significantly less than this This can lead to results which are out of bounds such as roducing negative values for robabilities where the bracketed denominator in (3) assumes a negative value

High deendence ( = 084) 338 370 403 49 468 508 Imlementing the normal form for the rior is eased by exressing each term as a function of the revious one using the relationshi ln N, N N 1, ½ A diagnostic statistic of interest concerns the comatibility between the rior and osterior distributions In Bayesian analysis it is common (and safe) to use an uninformative rior as this will be comatible with any osterior distribution which is likely to occur in ractice However if an informative rior is used, as here, it is useful to be aware of the extent of the comatibility For large values of A, B and C, an aroximate method of doing this is to divide the difference between the means of the rior and osterior distributions by the standard deviation of the difference, the latter being the square root of the sum of their variances The ratio is aroximately normally distributed with zero mean and unit variance Formally, osterior rior z osterior rior For the data given above, z took values from -083 where 0 0 to +011 for 0 and +160 for 0 4 to +1050 for 0 8 The values varied very little from table to table 5 Conclusion: The method outlined in section 3 and illustrated in section 4 can be alied in ractice to cature-recature data There are two elements of uncertainty which must be addressed to aly it successfully The first element concerns the form of the rior and the arameters used for it The central limit theorem shows that the distributions of sums and means of samles tend to the normal distribution as samle size increases This is true for all distributions, subject only to the observations being stochastically indeendent In this sense the normal is the limiting case of all other distributions It is a common choice for the rior in Bayesian analyses and can be used here MYEs can be used for the mean and any reasonable value for the variance, as the results are not sensitive to this at low levels of deendency The second element concerns the likelihood, where an assumtion must be made about the degree of stochastic deendency between the two samles Section 4 suggests that the results are sensitive to this and it will be imortant to have a rationale for how the question is dealt with It is not necessary to assume a single value for the correlation coefficient We might use the knowledge we have about the correlation to assume that it is not less than 010 but that it is not greater than 03 A distribution over the remaining range would then exress what we do know about and what we do not know This could be imlemented in the WinBUGS software by exressing as following a beta distribution with the bulk of the robability mass concentrated in the interval from 03 to 01 In ractice however a close aroximation to this could be obtained by taking a weighted average of the relevant rows in Table If for examle we are content to assume that equals 03, 0 and 01 with robabilities 03, 04 and 03 resectively then the ercentage oints (rounded to the nearest integer) would be 3, 34 and 47 In summary, subject to a satisfactory decision about the degree of deendency between the samles, this method gives technically very defensible results for the confidence intervals of oulation estimates based on cature-recature data

Aendix A: Comarison of the method with the Chaman estimator and variance The Chaman method assumes that the two samles are stochastically indeendent The estimate of the oulation and its aroximate variance are given by A B 1 C B 1 N ˆ 1 and ˆ A B 1 C B 1AC var N B 1 B 1 B In the case of the above examle, this gives an estimate of 530 with a variance of 147 which, assuming a normal aroximation, gives uer and lower cut-offs of 515 and 545 The Comlete indeendence rows of the three tables in section give corresonding values of 53, 530 and 537 Thus while the estimate is the same, the above method offers slightly smaller confidence intervals than the Chaman method

Aendix B: Code used for the calculations in section * The first ste calculates and stores the roduct terms which are roortional to the osterior distribution and the roortionality constant The second ste calculates the osterior distribution and the three cut-offs; data roducts (kee = k loglk logr) const (kee = const); N1 = 56; N = 75; N1 = 388; * Numbers of cases in samle 1 only, samle only, and both samles; ncyc = 5000; * The maximum number of cycles; crit = -10000; * The criterion below which no further cycles are run; const = 0; * The constant of roortionality; sum = 0; * The sum of the rior distribution; hi = 00; * The dichotomous correlation; R1 = N1 / N1; R = N / N1; t1 = (1 + R1) * (1 + R); t = sqrt(4 * R1 * R + hi** * (R1 - R)**); 1 = (1 / t1) * (1 - (hi** * (R1 + R) + hi * t) / ( * (1 - hi**))); 1 = 1 * R1; = 1 * R; k = 1-1 - - 1; * The robability arameters; k = 0; * k is the number of cases not in either samle; label1: N = k + N1 + N + N1; if k = 0 then do; loglk = lgamma(1 + N) - lgamma(1 + N1) - lgamma(1 + N1) - lgamma(1 + N) + N1 * log(1) + N1 * log(1) + N * log(); * For the normal rior we have; mu = 550; var = 450; z = (N - mu) ** / var; logr = -(log( * 3141596 * var) + z) / ; /** For the Poisson rior we have; lamda = 550; logr = - lamda + N * log(lamda) - lgamma(n + 1);*/ end; else do; loglk = loglk + log(n / k) + log(k); logr = logr + (mu - N + 05) / var; * For the Poisson rior, logr = logr + log(lamda / N); end; const = const + ex(loglk + logr); sum = sum + ex(logr); outut roducts; k = k + 1; if k le ncyc and loglk gt crit then goto label1; outut const; const = const * 10 ** 0; ut " Constant of roortionality * 10 ** 0 = " const; ut " Number of cycles run = " k; acc = int(-log(max(sum, - sum) - 1) / log(10)); ut " Sum of rior robabilities is " sum " and equals one to " acc " laces of decimals "; run; data _null_; set const; call symut('const', const); ut " const = " const; run;

data _null_; retain s0 s1 s 0 975 500 05; const = symget('const'); set roducts end = eof; term = ex(loglk + logr) / const; if s0 lt 005 and s0 + term ge 005 then 05 = max(k - 1, 0); * do not include the next term as it would take you over the boundary; if s0 lt 05 and s0 + term ge 05 then do; if s0 + term / gt 05 then 500 = k - 1; else 500 = k; end; if s0 lt 0975 and s0 + term ge 0975 then 975 = k; * do include the next term so that it does take you over the boundary; s0 = s0 + term; s1 = s1 + term * k; s = s + term * k * k; if eof then do; CI = 975-05; acc = int(-log(max(s0, - s0) - 1) / log(10)); ut " Percentage oint of the number of cases in neither samle:"; ut " 5 oint = " 05 " median = " 500 " 975 oint = " 975; ut " Sum of osterior robabilities is " s0 " and equals one to " acc " laces of decimals "; * At the end, s0 is the sum of the whole osterior distribution, which should equal one Acc measures how close it gets; z = (388 + 56 + 75 + s1-550)/sqrt(550 + s - s1**); ut " Measure of the consistency between the rior and the osterior = " z; end; run;