Regression Analysis of Probability-Linked Data

Size: px
Start display at page:

Download "Regression Analysis of Probability-Linked Data"

Transcription

1 Official Statistics Research Series, Vol 4, 2009 ISSN ; ISBN (Online) Regression Analysis of Probability-Linked Data Ray Chambers Centre for Statistical and Survey Methodology, University of Wollongong his report was commissioned by Official Statistics Research, through Statistics New Zealand. he opinions, findings, recommendations and conclusions expressed in this report are those of the author(s), do not necessarily represent Statistics New Zealand and should not be reported as those of Statistics New Zealand. he department takes no responsibility for any omissions or errors in the information contained here.

2 Abstract Data obtained after probability linkage of administrative registers will typically include errors due to the fact that some linked records actually contain data items are sourced from different individuals. Such errors can induce bias in standard statistical analyses if ignored. In this report we describe some approaches to eliminating this bias in the case of linear regression analysis and, more generally when inference is based on an estimating euation, with an emphasis on logistic regression. Simulation results that illustrate the gains from allowing for linkage error in linear and logistic regression analysis are presented, as are extensions of the approach to situations where a sample is linked to a register and to where the linked registers are of uneual size. Keywords Record matching, linkage errors, linear regression, logistic regression, estimating euations, measurement error. Reproduction of material Material in this report may be reproduced and published, provided that it does not purport to be published under government authority and that acknowledgement is made of this source. Citation Chambers, R. (2009). Regression analysis of probability-linked data, Official Statistics Research Series, 4. Available from Published by Statistics New Zealand atauranga Aotearoa Wellington, New Zealand ISSN (Online) ISBN (Online) 2

3 Acknowledgements he theory set out in this paper was not developed in a vacuum. It has benefited considerably from advice and critical input from Walt Davis of Statistics New Zealand, Milorad Kovacevic of Statistics Canada and Glenys Bishop and James Chipperfield of the Australian Bureau of Statistics. My thanks go out to all of them for their encouragement. Also, I would like to acknowledge the input of the referee who provided me with the details of the Neter et al. (1965) reference. his is a well-written paper that nicely summarises many of the statistical issues that I have attempted to tackle in this report. 3

4 Contents 1 Introduction Background and assumptions Research uestions Linear regression using linked data Bias-corrected OLS inference Efficient linear estimation using linked data Maximum likelihood using linked data A fixed population approach Using estimating functions with probability-linked data Correcting estimating functions for linkage error Application to linear and logistic regression Variance estimation when linkage probabilities are estimated Maximum likelihood logistic regression with linked data Simulation analysis Simulation of linear regression with linked data Simulation of logistic regression based on linked data Regression analysis under sample to register linkage Regression analysis under nested linkage Using estimating functions with nested linked data Fitting linear and logistic models to nested linked data Reversing the nesting Conclusions and further research References Appendix 1 Approximating the V matrix Appendix 2 R Code for linear model fitting and simulation R functions for linear regression analysis R code for linear model simulations Simulation of known lambda case Simulation of estimated lambda case Appendix 3 R code for logistic model fitting and simulation R functions for logistic regression analysis R Code for logistic model simulations Simulation of known lambda case Simulation of estimated lambda case

5 List of tables able 1 Options for G (θ) in logistic regression able 2 Specification of Ĝ and θû in (50) for the linear case able 3 Simulation results for the linear model able 4 Simulation results for slope estimators under the logistic model List of figures Figure 1 Boxplots of percentage relative errors generated by different estimators in linear model simulations Figure 2 Boxplots of percentage relative errors generated by different slope estimators in logistic model simulations

6 1 Introduction In their seminal paper on the topic, Fellegi and Sunter (1969) defined record linkage as a solution to the problem of recognizing those records in two files which represent identical persons, objects, or events... Record linkage allows data for a single individual to be compiled from different data sources, enabling more powerful and effective analyses to be carried out than would otherwise be the case. In particular, datasets created by linking individual records constitute a critical resource for research in health, epidemiology, economics, demography, sociology and many other scientific areas. National statistical agencies increasingly rely on linking surveys to administrative registers to provide more accurate measurement and to reduce respondent burden. Freuently, one or more datasets (whether all administrative data or a mix of administrative and survey data) are linked to answer a broader array of research uestions than can be addressed through any of the datasets individually. Linked longitudinal datasets are particularly useful in health related research. hese are datasets created by matching individual health and health-related records from a variety of sources over a period of time. For example, a longitudinal dataset created by linking hospital admission and general practitioner records to private health insurance expenditure records for individuals in a particular social and/or demographic group could be used to build models for how changes in that group s health expenditure influences subseuent uptake of medical services. his type of linkage is able to bring together a much better picture of the driving factors behind many public health issues. hus, using data obtained from linking physician billing claims held by the Ontario Health Insurance Plan with data for consenting Ontario respondents to the 1994/95 Canadian National Population Health Survey, Iron, Manuel and Williams (2003) report on an analysis of the relationship between utilization and costs of physician services and incidence of self-reported chronic conditions for residents of Ontario province in Canada. Data linkage is not confined to the health sciences. In a review commissioned by the UK Department for rade and Industry, Chesher and Nesheim (2006) describe the extensive use of data linkage in economic research, particularly in the United States. Statistics New Zealand has recently developed a linked longitudinal employer-employee dataset based on linking administrative data held on the NZ Inland Revenue Department's tax system and Statistics New Zealand's list of NZ businesses. his dataset allows the analysis of job and worker flows, employment tenure, multiple job holding and business demography. Similarly, the Census Data Enhancement Initiative of the Australian Bureau of Statistics (ABS) aims to create a Statistical Longitudinal Census Dataset that integrates census data from the same individuals over a number of censuses, with the objective of building a research resource for longitudinal analysis of the Australian population. In the UK, the Interdepartmental Migration and Population ask Force set up by the Office for National Statistics has recently recommended the use of record linkage to improve migration and population statistics in the UK. he aim in this case is to link administrative, health register, school enrollment and university student data with incoming passenger survey and labour force survey data to create an integrated longitudinal data set that will allow in-depth analysis of the UK migrant experience. he process used to link datasets often involves a probabilistic matching of records from one dataset to another. In most linkage operations matching variables present in both datasets are used to maximise the probability that the values of the variables making up the linked record are the correct measurements for the population unit corresponding to that record. However, when analysis is undertaken using the resulting linked data, the errors inherent in this type of record matching are typically ignored. his is unfortunate since these errors introduce bias and additional variability into standard statistical estimation techniues. his poses a significant barrier to policy-relevant research using probabilistically linked data. 6

7 Statistical methods for linking datasets are now well established (Herzog, Scheuren and Winkler, 2007), with recent statistical research in this area mainly focused on the confidentiality issues that arise as a conseuence of linkage. See Sibthorpe, Kliewer and Smith (1995) and rutwein, Holman and Rosman (2005) for an Australian health data perspective, and Mackie and Bradburn (2000) for contributions to a workshop on confidentiality and linkage sponsored by the US Committee on National Statistics and the Institute of Medicine. In contrast, aside from the notable contributions of Neter et al. (1965), Scheuren and Winkler (1993) and Lahiri and Larsen (2005), there has been comparatively little methodological research carried out on the impact of linkage errors on analysis of linked data. Linkage errors are the errors caused by incorrectly linking different population units as well as the errors caused by not linking the same population units in the datasets that are linked. hese errors are a particular type of measurement error, and can lead to biased inference unless appropriate steps are taken to control and/or adjust for this bias. In this report we develop a methodological framework that can be used to provide appropriate modifications to standard statistical analysis methods to ensure that they remain unbiased when used with probabilistically linked data. he framework is based on modelling the relationship between the probabilistically linked data and the true data that would be obtained if error free linkage were possible. Inference then proceeds on the basis of a combined model defined by the integration of this linkage error model with the statistical model for the true data values that is of primary interest. Our assumptions about the data linkage situation and a description of a simple model for linkage errors are set out in the following sub-section. In section 2 we apply these ideas to fitting a simple linear regression model to linked data from two registers that each cover the same population. In section 3 this theory is generalised to where the statistical model of interest is fitted via the solution of an estimating euation, with application to logistic regression serving to motivate our approach. Simulation results for both linear and logistic regression are described in section 4 and illustrate the potential gains from the modified analytic methods that we propose. In section 5 we extend our framework to the important case of linking a survey to a register, while in section 6 we look at another important extension, where the registers that are linked are nested, in the sense that the population making up one of the linked registers is a subgroup of the population making up the other. Section 7 concludes the report with a short discussion of avenues for further research. 1.1 Background and assumptions In what follows we assume that the existence of a population of N units, indexed by i = 1,K, N, such that, for each unit in this population, it is possible to measure the values of a scalar random variable Y and a vector random variable X. We are interested in modelling the relationship between Y and X in this population, and in particular we seek to fit a model of the form E(Y X) = g X;β ( ) for the regression of Y on X. Here g corresponds to a known functional form while the parameter β is unknown and needs to be estimated. his is usually straightforward if we have the values of Y and X for a random sample of units from this population. Unfortunately, we do not have such a sample. Instead, we have access to two registers that separately contain the population values of Y and X. We shall refer to these as the Y-register and X-register respectively from now on. For the time being we also assume that both registers refer to the same population and have no duplicates, so each is made up of N records. If each unit in the population has a uniue identifier, and this identifier is also stored on both registers, we can use it to link records from the two registers, and then estimate β using the Y and X values associated with the N linked records. Unfortunately, such a uniue identifier does not exist. Instead, some form of probability-linking algorithm is used to associate (i.e. link) records on the X-register with records on the Y-register. his algorithm makes it is possible (at least conceptually) to link every record on the X-register with a 7

8 record on the Y-register. hat is, linkage is complete and one to one between the Y and X- registers. Clearly, the data set constructed by this process (the linked data) can contain linkage errors, i.e. records where the values of Y and X actually come from different population units. Although it may be theoretically possible for any two records on the Y and X-registers to be linked, most reasonable probability linking algorithms will only attempt to link records that are similar in some sense. Conseuently, we shall assume that the linked records can be partitioned into Q distinct blocks such that there is no possibility that linked records in different blocks contain data for the same population unit. We model this situation by assuming that the different blocks correspond to different values of a categorical population variable Z that can be derived from the information on either register, and which is defined in such a way that if a record on one register does not have the same value of Z as the record on the other register, then it is reasonable to assume that these two records cannot correspond to the same unit in the original population. Conversely, the fact that a Y-register record and an X-register record have the same value for Z does not guarantee that they correspond to the same unit, and so linkage errors can still occur within a block. We refer to Z as a blocking variable, and those population units with the same value of Z as being in the same block. Note that errors in measurement of Z can lead to the same population unit having one value of Z on the Y-register and another on the X-register, which invalidates the assumption of no linkage errors when Y and X-register records have different values of Z. Conseuently, we shall assume that Z is measured without error on both the Y-register and the X-register. With this set up, data linkage errors only occur among records in both registers in the same block. his property of the blocking variable Z indicates a subtle but key difference between the use of the blocking concept in our development and its use in data linkage methodology. In the latter case, blocking variables define stages (or passes ) in the linkage process, where at any particular stage matching is carried out with respect to a particular blocking variable. hat is, only those remaining unmatched records at this stage with the same value for this blocking variable are considered as potential matches. However, once all matches at a particular stage of the process are declared, all remaining unmatched records are then considered as candidates for matching at the next stage using another blocking variable. Conseuently it is uite possible that links can be created between Y and X-register records that have different values for any particular blocking variable. In our case the blocking variable Z is an ex-post construct. It defines a partition of the declared links into groups such that all linkage errors are isolated within the groups there are no errors that cross group boundaries. Clearly Z can be defined in terms of the blocking variables used in creating the links, but there is no fundamental reuirement for this. he main reuirement is that Z partitions both the Y and X-registers so that all (or virtually all) linkage errors are confined to the groups of records defined by the distinct values of this variable. Without loss of generality, we denote the Q distinct values taken by Z by 1,2,K,Q. Let block correspond to the M population units with Z =, so N =. Since Z is measured without error in both registers, and linkage is complete, the number of records in block in each register is the same, i.e. M. Let i index the records in the linked data set. Again, without loss of generality we assume that this index is the same as the one used to index the X-register, i.e. the linkage process associates a record from the Y-register, with its associated Y-value, with each record on the X-register. In block we then have M linked data pairs (y i,x i ), where y i denotes the Y- value from block on the Y-register that is matched to X i. More accurately, the record with Y = y i in block on the Y-register is matched to the record with X = X i in block on the X- M 8

9 register. We use y to denote the vector of order M defined by the linked values y i in block and X as the matrix with rows defined by the values X i in the same block. Also, let y denote the unknown vector of order M with entries indexed as in the X-register that corresponds to the true Y values associated with X. Since one and only one of the M records in block on the Y-register can be matched to each distinct record in the corresponding block on the X-register, we model randomness in the outcome of the linkage process via the identity y = A y (1) where A = [a ij ] is an unknown random permutation matrix of order M. Note that the entries a ij of A are either zero or one, with a value of one occurring just once in each row and column. Also, since we are assuming that linkage errors are confined to blocks, it is natural to impose the condition that A 1 and A 2 are independently distributed when 1 2. Clearly inference based on linked data will involve assumptions about the distribution of the A. In this report we assume that linkage is non-informative at each level of Z, i.e. the distribution of A is independent of y given X. Let E( A X )= E. (2) Given the care that typically goes into the construction of a linked data set, it seems reasonable that a declared link is more likely to be correct than incorrect. Although the probability that such a link is correct will typically vary between the records that make up the linked dataset, as a first approximation we assume that the probability of correct linkage is the same for all records in a block. We also assume that it is eually likely that any two Y- records in the same block that are not linked to a particular X-record in that block could in fact be the correct link for this record. A simple way of characterising both of these assumptions is via an exchangeable linkage errors model, where for each value of and, for i j, Pr( correct linkage)= Pr( a ii = 1)= λ (3) Pr( incorrect linkage)= Pr( a ij = 1)= γ. (4) Given (3) and (4) hold, it follows that (2) is then of the form E = ( λ γ )I + γ 1 1 (5) where I is the identity matrix of order M and 1 denotes a vector of ones of length M. Since 1 A = 1 and A 1 = 1 we have 1 E = 1 and E 1 = 1, which means that (5) implies λ + (M 1)γ = 1. (6) In other words, we just need to specify λ in order to completely specify the first order properties of the linkage mechanism under the model (5). his will be particularly useful 9

10 later since estimation of λ reuires only that we know whether a defined link is correct or incorrect, and not the identity of the correct link. he model specified by (3) and (4) represents what is probably the simplest way of characterising the behaviour of a probability-based linkage process, and will form the basis for the theory developed in this report. It was originally suggested by Neter et al. (1965) in a groundbreaking paper that investigated its use in assessing the impact of linkage error on response error analysis, where alternative data sources were linked to respondent records in order to assess the extent of response error in these records. As these authors note, and as we shall see in next section, the impact of linkage error defined by (3) and (4) is to attenuate the relationship between the study variable (in their case the difference between the survey value and the linked alternative value) and explanatory covariates. Depending on the available information from the operation of the linkage process, more sophisticated models for linkage error can be formulated. For example, it may be the case that the Y and X-registers are ordered so that only nearby records in the linked data can possibly correspond to the correct link. his can be modelled by replacing (4) by 1 λ Pr(incorrect linkage) = Pr(a ij = 1) = 2m 0 if j i m otherwise with appropriate modifications for values of i close to either the beginning or the end of the X-register. Here 2m denotes the number of nearest neighbours to y i in the linked data set that can actually contain the correct value y i. Another extension is where there exists another variable on the X-register, say W, with values w i that vary within a block, such that the probability of a correct link depends on these values. For example, we could have Pr( correct linkage)= Pr( a ii = 1)= p w i,w i ;λ and, for i j, ( ) Pr( incorrect linkage)= Pr( a ij = 1)= p( w i,w j ;λ ) where p(w i,w j ;λ ) is a function that (i) takes values in the interval [0,1]; (ii) is maximised M when w i = w j ; and (iii) satisfies p(w i,w j ;λ ) = 1. An obvious candidate function in this j=1 case is where p(w i,w j ;λ ) is proportional to exp( λ w i w j ). Note however that if W is categorical and available on both registers then by including it in the definition of the blocking variable Z we recover the situation implicit in the exchangeable linkage errors model, where all linked records in the same block have the same probability of being incorrectly linked. 1.2 Research uestions Given the preceding development, there are a number of uestions that immediately arise. 1. What are the properties of the estimator of β based on the linked data that assumes all linkages are correct? 2. Are there more efficient ways of estimating β using the linked data?. 10

11 he methodological framework described in the previous sub-section was based a number of strong assumptions about the linkage process that will typically be violated. As a conseuence, we can ask further uestions. 3. How do we need to modify our inference when linkage is incomplete (i.e. there are unlinked records in one or both of the Y and X-registers? 4. What happens when one or both of the Y and X-registers are based on sample survey data? How do we integrate sample selection and linkage in inference? 5. We have assumed that all components of X are on one register. What happens if some components are actually held on the Y-register? More generally, what happens if components of X are held on different registers and these are linked either prior to the linkage to the Y-register or simultaneously with the linkage to the Y-register? In the rest of this report we develop some theory that may help in answering these uestions. 11

12 2 Linear regression using linked data In this section we consider the situation where the widely used linear regression model is the focus of inference. hat is, the population values of Y and X in each block (i.e. those associated with population units with the same value of Z) satisfy E X ( y )= X β = f (7) Var X ( y )= σ 2 I. (8) where we use a subscript of X to denote conditioning on the value X of the explanatory variables in block. Note that in addition to the regression parameter β in (7), which is the target of inference, (8) now includes an unknown scale parameter σ 2. Given the y and X, the optimal estimator of β is then its Ordinary Least Suares (OLS) estimator ˆβ = 1 X X X y. (9) Unfortunately, unless the linkage is perfect, (9) cannot be calculated. Instead, what is usually done is to substitute the linked data values y for y in this expression, which leads to the naïve linked data OLS estimator ˆβ = 1 X X X y (10) 2.1 Bias-corrected OLS inference Under the linkage error model (1) it is easy to see that (10) is actually ˆβ = Under non-informative linkage 1 X X X A y. so E X ( A y )= E X ( A )E X ( y )= E f E X ( ˆβ ) = 1 X X X E X β = Dβ. (11) hat is, the naïve OLS estimator (10) based on the linked data set is biased. Provided E is known and the inverse of the matrix D in (11) exists, an unbiased estimator of β in this situation is ˆβ R = D 1 ˆβ = X 1 X X 1 E X { ˆβ which, since X E X is then of full rank, reduces to ˆβ R = ( X E X ) 1 X ( y ). (12) Note that the subscript of R used to denote the estimator defined by (12) serves as a reminder that this estimator is based on a ratio-type correction for the bias in the naive estimator (10). 12

13 We use an iterated expectation argument to obtain the variance of ˆβ R. o start, observe that Var X ( ˆβR )= D 1 Var X ( ˆβ )D ( 1 ) where Var X ( ˆβ )= E X Var AX ˆβ { ( ) + Var X E AX ( ˆβ ) {. Here a subscript of AX denotes conditioning on both A and X, so and E AX ( ) 1 ( ˆβ )= X X Var AX ( ˆβ )= σ 2 X X ( X A X )β ( ) 1 ( X A A X )( X X ) 1 ( ) 1 = σ 2 X X since A A = I. Put V = Var X ( A X β)= Var X ( A f ). It follows that ( ) 1 { X ( σ 2 I + V )X ( X X ) 1 D 1 ( ) 1 { X ( σ 2 I + V )X ( X E X ) 1. Var X ( ˆβR )= D 1 X X = X E X An estimator of (13) is then ˆV X ( ˆβR )= X E X ( ) 1 { X ( ˆσ 2 I + ˆV )X X E X where ˆσ 2 and ˆV are estimates of σ 2 and V respectively. ( ) 1 ( ) (13) (14) In order to define these estimates, we note that after some simplification, and using the fact that A A = I, E X ( y f ) ( y f ) = E X y y y f f y + f f 2f ( A I )y = Nσ 2 2 f ( E I )f. { It follows that when f and E are known, ˆσ 2 = N 1 ( y f ) y { ( f ) 2 f ( I E )f (15) is an unbiased method of moments estimator of σ 2 under the linkage errors model (1) and the linear model specified by (7) and (8). Note that (15) can take negative values. In practice, we replace f by ˆf = X ˆβR in (15) to then obtain a consistent estimator of σ 2. Development of an expression for ˆV is somewhat more complicated. In Appendix 1 we show that a large M approximation to V given a simple second-order extension of the exchangeable linkage errors model defined by (3) and (4) is 13

14 V diag (1 λ ) λ ( f i f ) 2 + f (2) 2 { f (16) where f = ( ) and f, f (2) f i denote the block averages of the components of f and their suares respectively. In order to calculate ˆV we replace these components in (16) by their estimated values. he approach to linear regression estimation using probability-linked data described above is in the spirit of Scheuren and Winkler (1993), where it is suggested that one corrects the naive estimator using an estimate of its bias under an appropriate model for the linkage error process. In our case the ratio-type adjustment we use for this purpose depends on knowing (or having good estimates of) the parameters (i.e. the λ ) that characterise this process. As noted earlier, all that is reuired to estimate these parameters is access to a random audit sample of the linked records in each block where the only thing we need to know is whether the declared links are correct or not. his could also be done by comparison with the output from a gold standard (e.g. clerical) linkage operation carried out on this sample of records. 2.2 Efficient linear estimation using linked data An alternative approach to fitting a linear model using the probability-linked data is based on directly modelling the regression relationship between the linked values y and the values in X. Since y = A y, and A and y are independently distributed given X, it follows hat is, the y E X ( y )= E X A ( )E X ( y )= E X β = H β. (17) also follow a linear model with regression coefficient β but with a modified set of explanatory variables H in block. Lahiri and Larsen (2005) note this relationship and suggest estimation of β using the OLS estimator for this situation, ˆβ A = ( H H ) 1 H ( y )= ( X E E X ) 1 X E ( y ). (18) However, the optimality of this estimator depends on the regression errors under (17) being homoskedastic. It is easy to see that this condition generally does not hold, since implicit in the development leading to (13) is the fact that Var X ( y )= σ 2 I + V = Σ (19) which implies that the variances of the regression errors defined by the linked data vary between blocks. he Best Linear Unbiased Estimator (BLUE) for β given these data is ˆβ C = ( H Σ 1 H ) 1 H Σ 1 ( y )= X E Σ 1 E X Note that (20) depends on Σ, and hence on σ 2 ( ) 1 X E Σ 1 ( y ). (20) and β. Its empirical (EBLUE) version is defined by substituting estimates for these parameters and iterating, using the estimate (15) for σ 2 developed in the previous sub-section, combined with the estimate of β defined by (20). Standard plug-in type sandwich-type estimators of the variances of (18) and (20) are easily developed using the estimates ˆσ 2 and ˆV developed in the previous sub-section. hese are 14

15 ( ) 1 ˆV X ( ˆβA )= X E E X in the case of (18) and in the case of (20). X { E ( ˆσ 2 I + ˆV )E X X E E X ( ) 1 1 { E X ˆV X ( ˆβC )= X E ˆσ 2 I + ˆV ( ) 1 Note that such plug-in estimators ignore the contribution to the variance associated with estimation of the linkage model parameters and hence may be biased low. his issue is further discussed in section Maximum likelihood using linked data An alternative approach to constructing an efficient estimator of β given the linked data is to use the Missing Information Principle or MIP (Orchard and Woodbury, 1972) to derive the maximum likelihood estimator of this parameter given the linked data. In order to do so, we extend the linear model (7) and (8) to include an assumption of normality. hat is, given X, we assume that y : N( f,σ 2 I ). (21) (22) When the y are known, the score function for β and σ 2 has components and sc 1 = 1 σ 2 sc 2 = N 2σ σ 4 ( y f ) (23) X ( y f ) ( y f ). (24) In order to apply the MIP, we replace (14) and (15) by their conditional expectations given y and X. Using an iterated expectations argument again, we see that Cov X ( y, y )= σ 2 E X ( A )+ Cov X ( f,a f )= σ 2 E. Combining this result with (17) and (19), it follows that and so y X y : N f E f,σ I 2 E E Σ and E X ( y y )= f + E Σ 1 ( y E f )= ŷ We therefore replace (23) by sc 1 = 1 σ 2 Var X ( y y )= σ 2 ( I E Σ 1 E ). ( ŷ f ) = 1 X and, since y y = y y, we replace (24) by σ 2 E Σ 1 y E f (25) X ( ) 15

16 sc 2 = N 2σ σ 4 he MLEs for β and σ 2 = N 2σ σ 4 y y 2f ŷ + f ( f ) parameters. Since ŷ is a function of β and σ 2 { ( y f ) ( y f ) 2f ( ŷ y ). are defined by setting (25) and (26) to zero and solving for these (26) this needs to be done numerically. Note that the solution to setting (25) to zero is the BLUE (20). Since the MLE for σ 2 obtained by setting (26) to zero is not the same as the method of moments estimator (15), the MLE and the EBLUE for β will not be the same. However, they are typically very close. In order to estimate the variances and covariances of these MLEs, we calculate the matrixvalued observed information function corresponding to the MIP-based score function for these parameters and invert it. his can be done by either numerically differentiating (25) and (26), or by using the MIP information identity. his identity states that the information function for β and σ 2 given the linked data is the conditional expectation of the y known information function given the linked data minus the conditional variance of the y known score function given the linked data. Denoting conditioning on the linked data ( y ; = 1,2,K,Q) by a superscript of *, the information function generated by these data is where and info = E X E X info 21 E X ( info 11 ) E X ( info 12 ) ( ) E X ( ) E X ( info 22 )= N 2σ σ 6 E X Var X ( sc 1 )= 1 σ 4 ( info 12 )= E X X info 22 Var X sc 1 Cov X sc 2, sc 1 ( info 11 )= 1 X σ 2 X ( ) Cov X ( sc 1, sc 2 ) ( ) Var X ( ) sc 2 { ( y f ) ( y f ) 2f ( ŷ y ) ( info 21 )= 1 σ 4 Var ( y y )X X = 1 σ 2 X ( ŷ f ) X ( I E Σ 1 E )X Cov X ( sc 1, sc 2 )= 1 Cov 2σ 6 X X ( y f ), ( y f ) { ( y f )y = 1 Cov 2σ 6 X X y, 2f ( y y ) = 1 X σ 6 Var X ( y y )f = 1 X σ 4 ( I E Σ 1 E )f. (27) 16

17 Var X ( sc 2 )= 1 Var 4σ 8 X y f ( ) { ( y f )y { y = 1 Var 4σ 8 X y y y f f y + f f = 1 Var σ 8 X f { y y = 1 f σ 6 ( I E Σ 1 E )f. he observed information for β and σ 2 is the value of info at the values of the MLEs for these parameters. he inverse of this matrix is then used as an estimate of the variance/covariance matrix of these estimators. Note that the value of the matrix at the MLEs for β and σ 2 ( sc 1 ) Cov X ( sc 1, sc 2 ) ( ) Var X ( ) Var X Cov X sc 2, sc sc 2 is a measure of the information loss caused by incorrect linkage. 2.4 A fixed population approach Suppose that we have perfectly linked data. he efficient estimator of the regression parameter β is then the y known OLS estimator ( ) 1 B = X X ( X y ). (28) So far, our emphasis has been on estimation of β. However, it is legitimate to also consider prediction of B given the fixed finite population of Y and X-values that define the Y and X-registers. In this context, we denote conditioning on these values (i.e. on the values of y and X ) by a subscript of YX and look for a predictor ˆB of B that satisfies (over repeated applications of the probability linkage process) E YX ( ˆB )= B. (29) Note that none of ˆβ R, ˆβ A and ˆβ C satisfy (29) since we have E YX ( ˆβR )= ( X E X ) 1 ( X E y ) B E YX ( ˆβA )= ( X E E X ) 1 ( X E E y ) B and ( ) 1 E YX ( ˆβC )= X E Σ 1 E X ( X E Σ 1 E y ) B. In order to derive a predictor that satisfies (29), consider the class of linear predictors of B that can be written in the form ˆB = ( X X ) 1 X ( K y ). If K E = I it is straightforward to see that E YX ( ) 1 ( ˆB )= X X ( X K E y )= B.

18 If E is of full rank (as is the case with (5) when λ > γ ), then an obvious choice is K = E 1. More generally, Kovacevic (personal communication, 2008) has suggested that one put K = E E ( ) 1 E, leading to the predictor ˆβ B = ( X X ) 1 X ( E E ) 1 E { y. (30) Since (30) is linear in the y, variance estimation for this predictor using a plug-in sandwich-based approach follows directly. he resulting variance estimator is ˆV X ( ˆβB )= X X X ( E E ) 1 E ( ˆσ 2 I + ˆV )E ( E E ) 1 X ( ) 1 { ( X X ) 1. (31) 18

19 3 Using estimating functions with probability-linked data In this section we consider extension of the ideas developed for linear regression analysis in the previous section to where the regression model of interest is fitted via the solution of an estimating euation. In particular, we assume that this model is characterised by a p- dimensional parameter θ, which is then estimated by solving H(θ) = 0 where H(θ) is a p-dimensional unbiased estimating function for θ, i.e. a function of the data that satisfies E X { H )= 0 where θ 0 is the true value of θ. Let θ denote the partial differentiation operator with respect to the components of θ. he resulting estimator ˆθ can then be shown to be approximately unbiased for θ 0 since, under appropriate smoothness conditions 0 = H( ˆθ) H ) + ( θ H 0 )( ˆθ θ 0 ). Here θ H 0 is the p p matrix of first order partial derivatives of H(θ) with respect to the components of θ, evaluated at θ 0. Since H(θ) is an unbiased estimating function, it immediately follows that E ˆθ X ( θ 0 ) ( θ H 0 ) 1 E{ H )= 0 provided θ H 0 is of full rank, and so ˆθ is approximately unbiased for θ 0. Furthermore, we then also have Var X ( ˆθ) ( θ H 0 ) 1 Var X { H ) ( θ H 0 ) 1 leading to the usual sandwich-type estimator of this variance ˆV X ( ˆθ) { ( θ H 0 ) 1 θ 0 = ˆV ˆθ X { H ) θ H 0 { is an estimate of Var X H ) { evaluated at ˆθ = θ 0. where ˆV X H ) Var X H ) { (32) {( ) 1 θ 0 = (33) ˆθ {. ypically, it is a plug-in estimate, i.e. 3.1 Correcting estimating functions for linkage error We now turn our attention to the situation where a regression model is fitted using an estimating function and data that have been linked using a probability-based method. In particular, we shall concern ourselves with situations where H(θ) is of the form N { H(θ) = G i (θ) y i f i (θ) (34) i=1 ( y i ) and G i (θ) is a vector of order p which is a function of θ and X i but where f i ) = E X not of y i. Clearly (34) defines an unbiased estimating function for θ 0, which we can write in blocked form as { H(θ) = G (θ) y f (θ) (35) 19

20 where G (θ) is the p M matrix with columns defined by the vectors G i (θ) associated with the population units making up block, and f (θ) is the vector of order M defined by their corresponding values of f i (θ). Now consider the situation described in section 1.1 where instead of y, we have access to a probability-linked version of this vector, y = A y. Here A is a random permutation matrix of order M distributed independently of y given the values in X (i.e. linkage is non-informative given the values of the explanatory variables), with values of A distributed independently between blocks and where E X ( A )= E. Let H (θ) denote the value of (35) when we use y instead of y. hat is, our naive estimator ˆθ of θ 0 that assumes no linkage errors satisfies Clearly, since { H ( ˆθ ) = G ( ˆθ ) y f ( ˆθ ) = 0. (36) { E X { H )= G ) ( E I )f ) 0 we see that H (θ) is biased if linkage is not perfect, and so the resulting estimator ˆθ also biased in this case. Given the value of E, we can correct for this bias, replacing the estimating function H (θ) by its bias-corrected version H adj { (θ) = H (θ) G (θ) ( E I )f (θ) = G (θ){ y E f (θ). (37) is Our bias-adjusted estimator of θ based on the linked data is then ˆθ adj where H adj ( ˆθ adj ) = 0. he general results for inference based on unbiased estimating functions clearly apply to H adj (θ) defined by (37). It immediately follows that the large sample variance of ˆθ adj is given by (32) with H adj (θ) substituted for H(θ). hat is, Var X ( ˆθ adj 1 ( θ=θ 0 ) ) θ H adj Var X 1 ( ) H { adj ) θ H adj θ=θ 0 with plug-in sandwich-type estimator, see (33), of the form where θ H adj ˆV X ( ˆθ adj In order to define ˆV X ) = { θ H adj ( ˆθ adj ) 1 ˆVX H adj ( ˆθ adj ) = θ H adj. θ= ˆθ adj { ( ˆθ adj ) 1 { ) θ H adj (38). (39) { H adj ) in (39) we put Var X (y ) = Ω ) and observe that then 20

21 Var X (y ) = E X { Var X ( A y A )+ Var X E X A y A = E X { A Var X ( y )A + Var X { A f ) = E X { A Ω )A + Var X { A f ) = E X { A Ω )A + V ) = Σ ) { ( ) (40) so and hence Var X H { adj )= G )Var X ( y )G ) = G )Σ )G ) ˆV X { H adj )= G ( ˆθ adj )Σ ( ˆθ adj )G ( ˆθ ) adj. (41) In order to compute (41) we need to estimate the covariance matrix Σ (θ) specified by (40). In turn, this reuires that we estimate both V ), which can be approximated via (16) after replacing f i by f i ( ˆθ adj ), and E X { A Ω )A, which will depend on the particular model that we assume for the y. Next, in order to define the matrix of partial derivatives θ H adj although in theory θ H adj { = θ G (θ) y E f (θ) ( ˆθ adj ) in (39) we note that it is often the case that G (θ) varies little as θ changes. Conseuently, we approximate this derivative by θ H adj G (θ)e θ f (θ). hat is, we put { θ H adj ( ˆθ adj ) = G ( ˆθ adj )E θ f ( ˆθ adj ) (42) where θ f ( ˆθ adj ) = θ f (θ). he final variance estimator for ˆθ θ= ˆθ adj adj substituting (41) and (42) into (39). is then obtained by 3.2 Application to linear and logistic regression Although we have already developed the theory for linear regression in section 2, it is interesting to see how the results obtained there can be obtained as special cases of the general estimating euation theory set out in the previous sub-section. In particular, the Lahiri-Larsen estimator (18) and the BLUE (20) can be obtained from (28) by setting θ β and f (β) = X β (so β f (β) = X ) with G = X E in the case of (18) and G = X E Σ 1 in the case of (20). As far as the predictor (30) of B is concerned, we note that it can be expressed as the solution to X ( E E ) 1 E ( y E X ˆβ ) = 0. It follows that in this case G = X ( E E ) 1 E which leads to β H adj ( ˆβ B ) = X X. 21

22 In contrast, the ratio-adjusted estimator (12) cannot be expressed as the solution of an estimating euation of the form G y E X β = 0, being instead the solution to the alternative ratio-type estimating euation 1 { ( ) H R (β) = X y X Dβ = 0 (43) where D = X X X E X. As a conseuence, the results in the previous subsection do not apply to it directly. However, it is not difficult to show that H R (β) also defines an unbiased estimating function under the assumed linear model, since E X X y { ( X Dβ) = { X ( E X X D) β = X E X ( X X )D { β = 0. he linearisation argument that was earlier used to define an estimator of variance in the standard estimating function approach also applies to (12) when it is written as the solution to (43). In particular, we have and β H R (β) = X X D = X E X (44) Var X { H R (β 0 )= X Σ X. (45) When (44) and (45) are substituted in (38) we obtain the variance expression (13), leading to the same plug-in estimator of variance as specified by (39). he case where the regression model of interest corresponds to linear logistic regression is of special interest. Here f (θ) = f i (θ);i It follows that f i (θ){ 1 f i (θ) { where f i (θ) = where D (θ) = diag. exp(x i θ) 1+ exp(x i θ). (46) θ f (θ) = D (θ)x (47) he standard maximum likelihood estimating function (i.e. the score function) for the logistic regression model puts G (θ) = X in (35). However, this is not the only choice for this matrix when we estimate θ via the adjusted estimating euation (37). In particular we can also use the expressions for G (θ) that lead to the linear regression estimators (18), (20) and (30) introduced in section 2. We summarise these options in able 1. Here option M defines the estimating euation for the MLE under perfect linkage, option A leads to the Lahiri-Larsen estimator (18) under a linear model and option B leads to the predictor (30) of the finite population regression vector (28) under the same model. In contrast, option C in able 1 defines the second order efficient version of (35), which in the logistic case is given by G opt (θ) = θ { E X ( y )Var 1 X ( y )= { θ f (θ)e Σ 1 (θ) = X D (θ)e Σ 1 (θ). (48) 22

23 It is easy to see that the corresponding optimal version of G (θ) in the linear case is G = X E Σ 1 and leads to the BLUE (20). It should be noted, however, that option B in able 1 does not has the same finite population interpretation for logistic regression as it has in the linear regression context. In particular, it is not clear whether use of option B leads to a predictor of the estimator of θ defined by the correctly linked data. Further research is necessary in this area. able 1 Options for G (θ) in logistic regression Option M A C B G (θ) X X E X D (θ)e Σ 1 (θ) X ( E E ) 1 E For each of the options set out in able 1, variance estimation for the solution to the adjusted estimating euation defined by (37) uses the plug-in sandwich estimator (39), with θ H adj ( ˆθ adj ) defined by (42) and with ˆV X H adj ) { given by (41). In order to compute the latter expression, we observe that under the logistic model Ω ) is D ) and so Now so E X E X { A Ω )A = E X { A D )A. A D (θ)a = diag M j=1 f j (θ){ 1 f j (θ)a ij M { A D (θ)a = diag f j (θ){ 1 f j (θ)e ij j=1 where A = a ij and E = e ij. 3.3 Variance estimation when linkage probabilities are estimated he development so far has assumed that the matrix of expected values E for the stochastic linkage matrix A is known. If this matrix is specified using the exchangeable errors model (5) then this is euivalent to assuming that the probabilities λ of correct linkage are known. his is highly unlikely to be the case, and these probabilities will usually be estimated in some way. he extra uncertainty arising from this estimation then needs to be accounted for when carrying out variance estimation for the estimators of θ that use E to correct for bias induced by linkage errors. Let λ denote the vector defined by the block-specific values of λ. he estimating function (37) then needs to be replaced by H adj { (θ,λ) = G (θ) y E (λ)f (θ) = U (θ,λ ) 23

24 which is now considered to be a function of both θ and λ, allowing us to develop a first order aylor series approximation of the form 0 = H adj ( ˆθ, ˆλ) H adj,λ 0 ) + θ H adj,λ 0 ) ˆθ ( θ 0 )+ λ H adj or ˆθ θ 0 + θ H 0 { ( ) ( ) 1 H 0 + λ H 0 ˆλ λ0,λ 0 )( ˆλ λ 0 ) where H 0 = H adj,λ 0 ). Here θ 0 and λ 0 denote the true values of these parameters with ˆθ and ˆλ their corresponding estimators. It immediately follows that we can approximate the variance of ˆθ by where Var ˆθ X ( ) θ H 0 { ( ) H θ 0 ( ) 1 Var X H 0 + λ H 0 ˆλ λ0 ( ) 1 Var X H 0 = θ H 0 = θ U,λ 0 ) {( ) 1 { + ( λ H 0 )Var X ( ˆλ ) λ H 0 { 1 Ψ 1 + Ψ 2 ( ) ( ) {( θ H 0 ) 1 { θ U,λ 0 ) 1 Ψ 1 = Var X { U,λ 0 )= G )Var X (y )G ) = G 0 Σ 0 G 0 Ψ 2 = ( λ U 0 )Var X ( ˆλ ) ( λ U 0 ) and U 0 = U,λ 0 ). Note that we have also assumed that the distribution of ˆλ is (at least approximately) independent of the distribution of H 0. o proceed further, we reuire an expression for λ U 0 = λ G 0 { y E (λ )f 0 = G 0 λ { E (λ )f 0 where f 0 = f ). Under the exchangeable model (5) for linkage errors so and hence E (λ ) = λ (1 λ ) M 1 I + (1 λ ) M = (M 1) 1 ( λ M 1)I + (1 λ )1 1 { λ { E (λ )= (M 1) 1 ( M I 1 1 ) λ U 0 = (M 1) 1 G 0 ( M I 1 1 )f 0. hat is, we have where ( ) 1 Var ˆθ X ( ) θ U 0 {( ) 1 { G 0 ( Σ )G 0 θ U 0 (49) 24

25 0 = (M 1) 2 Var X = M 2 (M 1) 2 Var X ˆλ ( ˆλ ) M I 1 1 ( )f 0 f 0 ( ) f 0 1 f 0 ( M I 1 1 ) ( )( f 0 1 f 0 ). If the estimates of the linkage probabilities are obtained by checking a random audit 1 sample of linked records in each block, then Var X ( ˆλ )= m λ 0 ( 1 λ 0 ). Variance estimation based on (49) then follows in the usual way, by plugging in estimates for unknown uantities. hat is, our estimator of Var ˆθ X ( ) is ˆV ˆθ( X )= ( θ Û ) 1 Ĝ ( ˆΣ + ˆ )Ĝ { ( θ Û ) 1 (50) where a hat denotes a plug-in estimate. able 2 shows the specification of the components of (50) for the important special case of linear regression and the linkage bias corrected estimators described in section 2. able 2 Specification of Ĝ and θû in (50) for the linear case Estimator Ĝ θ Û (12) X (18) X Ê X Ê X ĜÊ X (20) X Ê ˆσ 2 I + ˆV ( ) 1 ĜÊ X (30) X ( Ê Ê ) 1 Ê ĜÊ X 3.4 Maximum likelihood logistic regression with linked data Finally, we explore maximum likelihood estimation of a logistic model based on application of the MIP in the situation when the data are linked. If perfectly linked data were available (i.e. y and X ) the MLE for θ satisfies { sc(θ) = X y f (θ) = 0. Applying the MIP, the MLE for this parameter given the linked data therefore satisfies sc (θ) = E X X ( y y ) X f (θ) = 0. (51) { Implementing this approach reuires that we know, or can approximate, the conditional expectation E X X ( y y ). ypically, X will contain an intercept that we can assume corresponds to the first column X 1 of X. In this case it is clear that E X X y ( 1 y )= E X y y ( )= y 25

26 ( ) where X % = X 2 K X p so we only need to approximate E X %X y y denotes the remaining p 1 columns of X. We conjecture that a reasonable approximation to this conditional expectation is E X %X ( y y ) E X (%X y %X y, y ). (52) Provided M is large enough, the joint distribution of % X y, % X y approximated by a multivariate normal distribution, with E % X ( X y )= X % f (θ) E X X % ( y )= X % E f (θ) E X ( y )= f (θ) and y given X can be Var % X ( X y )= X % D (θ) X % Var % X X ( y )= X % Σ (θ) X % Var X ( y )= M 2 1 D (θ)1 Cov % X X y, X % ( y )= X % D (θ)e % X Cov % X X 1 ( y, y )= M % X D (θ)1 Cov % X X y 1 (, y )= M % X E D (θ)1 Since y = y it immediately follows that E X ( X % y X % y, y )= X % f (θ) + (θ) %X { y E f (θ) (53) y f (θ) where (θ) = X % D (θ)e X % 1 M X % D (θ)1 = 1 (θ) 2 (θ). %X Σ (θ) % 1 X M % X E D (θ)1 M 1 1 D (θ)e % X M 2 1 D (θ)1 Here 1 (θ) is of order (p 1) (p 1) and 2 (θ) is of order (p 1) 1. Substituting (53) into the estimating euation (51) for the MLE then leads to the approximate ML estimating euation 1 or euivalently y f (θ) sc (θ) = %X { y E f (θ) = 0 (θ) y f (θ) 26

Regression analysis of probability-linked data

Regression analysis of probability-linked data Regression analysis of probability-linked data Ray Chambers University of Wollongong James Chipperfield Australian Bureau of Statistics Walter Davis Statistics New Zealand 1 Overview 1. Probability linkage

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Inequality, Mobility and Income Distribution Comparisons

Inequality, Mobility and Income Distribution Comparisons Fiscal Studies (1997) vol. 18, no. 3, pp. 93 30 Inequality, Mobility and Income Distribution Comparisons JOHN CREEDY * Abstract his paper examines the relationship between the cross-sectional and lifetime

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

Handling attrition and non-response in longitudinal data

Handling attrition and non-response in longitudinal data Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

More information

An extension of the factoring likelihood approach for non-monotone missing data

An extension of the factoring likelihood approach for non-monotone missing data An extension of the factoring likelihood approach for non-monotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

Factor analysis. Angela Montanari

Factor analysis. Angela Montanari Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

More information

Life Table Analysis using Weighted Survey Data

Life Table Analysis using Weighted Survey Data Life Table Analysis using Weighted Survey Data James G. Booth and Thomas A. Hirschl June 2005 Abstract Formulas for constructing valid pointwise confidence bands for survival distributions, estimated using

More information

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter

More information

Linking methodology used by Statistics New Zealand in the Integrated Data Infrastructure project

Linking methodology used by Statistics New Zealand in the Integrated Data Infrastructure project Linking methodology used by Statistics New Zealand in the Integrated Data Infrastructure project Crown copyright This work is licensed under the Creative Commons Attribution 3.0 New Zealand licence. You

More information

The Elasticity of Taxable Income: A Non-Technical Summary

The Elasticity of Taxable Income: A Non-Technical Summary The Elasticity of Taxable Income: A Non-Technical Summary John Creedy The University of Melbourne Abstract This paper provides a non-technical summary of the concept of the elasticity of taxable income,

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

Approaches for Analyzing Survey Data: a Discussion

Approaches for Analyzing Survey Data: a Discussion Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

More information

2. Linear regression with multiple regressors

2. Linear regression with multiple regressors 2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions

More information

4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4

4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4 4. Simple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/4 Outline The simple linear model Least squares estimation Forecasting with regression Non-linear functional forms Regression

More information

Joint models for classification and comparison of mortality in different countries.

Joint models for classification and comparison of mortality in different countries. Joint models for classification and comparison of mortality in different countries. Viani D. Biatat 1 and Iain D. Currie 1 1 Department of Actuarial Mathematics and Statistics, and the Maxwell Institute

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Monitoring Software Reliability using Statistical Process Control: An MMLE Approach

Monitoring Software Reliability using Statistical Process Control: An MMLE Approach Monitoring Software Reliability using Statistical Process Control: An MMLE Approach Dr. R Satya Prasad 1, Bandla Sreenivasa Rao 2 and Dr. R.R. L Kantham 3 1 Department of Computer Science &Engineering,

More information

International Statistical Institute, 56th Session, 2007: Phil Everson

International Statistical Institute, 56th Session, 2007: Phil Everson Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: peverso1@swarthmore.edu 1. Introduction

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University Agency Internal User Unmasked Result Subjects

More information

Schools Value-added Information System Technical Manual

Schools Value-added Information System Technical Manual Schools Value-added Information System Technical Manual Quality Assurance & School-based Support Division Education Bureau 2015 Contents Unit 1 Overview... 1 Unit 2 The Concept of VA... 2 Unit 3 Control

More information

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

More information

From the help desk: Bootstrapped standard errors

From the help desk: Bootstrapped standard errors The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

More information

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and

More information

Extensions of the Partial Least Squares approach for the analysis of biomolecular interactions. Nina Kirschbaum

Extensions of the Partial Least Squares approach for the analysis of biomolecular interactions. Nina Kirschbaum Dissertation am Fachbereich Statistik der Universität Dortmund Extensions of the Partial Least Squares approach for the analysis of biomolecular interactions Nina Kirschbaum Erstgutachter: Prof. Dr. W.

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

1 Another method of estimation: least squares

1 Another method of estimation: least squares 1 Another method of estimation: least squares erm: -estim.tex, Dec8, 009: 6 p.m. (draft - typos/writos likely exist) Corrections, comments, suggestions welcome. 1.1 Least squares in general Assume Y i

More information

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1. MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

More information

Problem of Missing Data

Problem of Missing Data VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

More information

Master s Theory Exam Spring 2006

Master s Theory Exam Spring 2006 Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

More information

Integrated Data Infrastructure and prototype

Integrated Data Infrastructure and prototype Integrated Data Infrastructure and prototype Crown copyright This work is licensed under the Creative Commons Attribution 3.0 New Zealand licence. You are free to copy, distribute, and adapt the work,

More information

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series. Cointegration The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series. Economic theory, however, often implies equilibrium

More information

The Basic Two-Level Regression Model

The Basic Two-Level Regression Model 2 The Basic Two-Level Regression Model The multilevel regression model has become known in the research literature under a variety of names, such as random coefficient model (de Leeuw & Kreft, 1986; Longford,

More information

is paramount in advancing any economy. For developed countries such as

is paramount in advancing any economy. For developed countries such as Introduction The provision of appropriate incentives to attract workers to the health industry is paramount in advancing any economy. For developed countries such as Australia, the increasing demand for

More information

Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers

Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers Christine Ebling, University of Technology Sydney, christine.ebling@uts.edu.au Bart Frischknecht, University of Technology Sydney,

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Multiple Imputation for Missing Data: A Cautionary Tale

Multiple Imputation for Missing Data: A Cautionary Tale Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust

More information

Chapter 6: Point Estimation. Fall 2011. - Probability & Statistics

Chapter 6: Point Estimation. Fall 2011. - Probability & Statistics STAT355 Chapter 6: Point Estimation Fall 2011 Chapter Fall 2011 6: Point1 Estimat / 18 Chap 6 - Point Estimation 1 6.1 Some general Concepts of Point Estimation Point Estimate Unbiasedness Principle of

More information

SYSTEMS OF REGRESSION EQUATIONS

SYSTEMS OF REGRESSION EQUATIONS SYSTEMS OF REGRESSION EQUATIONS 1. MULTIPLE EQUATIONS y nt = x nt n + u nt, n = 1,...,N, t = 1,...,T, x nt is 1 k, and n is k 1. This is a version of the standard regression model where the observations

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its

More information

Imputing Missing Data using SAS

Imputing Missing Data using SAS ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs

Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs Andrew Gelman Guido Imbens 2 Aug 2014 Abstract It is common in regression discontinuity analysis to control for high order

More information

Notes on Applied Linear Regression

Notes on Applied Linear Regression Notes on Applied Linear Regression Jamie DeCoster Department of Social Psychology Free University Amsterdam Van der Boechorststraat 1 1081 BT Amsterdam The Netherlands phone: +31 (0)20 444-8935 email:

More information

Centre for Central Banking Studies

Centre for Central Banking Studies Centre for Central Banking Studies Technical Handbook No. 4 Applied Bayesian econometrics for central bankers Andrew Blake and Haroon Mumtaz CCBS Technical Handbook No. 4 Applied Bayesian econometrics

More information

Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes

Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes Yong Bao a, Aman Ullah b, Yun Wang c, and Jun Yu d a Purdue University, IN, USA b University of California, Riverside, CA, USA

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

Health Policy and Administration PhD Track in Health Services and Policy Research

Health Policy and Administration PhD Track in Health Services and Policy Research Health Policy and Administration PhD Track in Health Services and Policy INTRODUCTION The Health Policy and Administration (HPA) Division of the UIC School of Public Health offers a PhD track in Health

More information

Introduction to Engineering System Dynamics

Introduction to Engineering System Dynamics CHAPTER 0 Introduction to Engineering System Dynamics 0.1 INTRODUCTION The objective of an engineering analysis of a dynamic system is prediction of its behaviour or performance. Real dynamic systems are

More information

Methodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS).

Methodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS). Methodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS. Vladislav Beresovsky National Center for Health Statistics 3311 Toledo Road Hyattsville, MD 078

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance. Chapter 6: Behavioural models Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

More information

Chapter 3: The Multiple Linear Regression Model

Chapter 3: The Multiple Linear Regression Model Chapter 3: The Multiple Linear Regression Model Advanced Econometrics - HEC Lausanne Christophe Hurlin University of Orléans November 23, 2013 Christophe Hurlin (University of Orléans) Advanced Econometrics

More information

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014. University of Ljubljana Doctoral Programme in Statistics ethodology of Statistical Research Written examination February 14 th, 2014 Name and surname: ID number: Instructions Read carefully the wording

More information

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

More information

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR) 2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

More information

How To Model The Fate Of An Animal

How To Model The Fate Of An Animal Models Where the Fate of Every Individual is Known This class of models is important because they provide a theory for estimation of survival probability and other parameters from radio-tagged animals.

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics

More information

Multiple Choice Models II

Multiple Choice Models II Multiple Choice Models II Laura Magazzini University of Verona laura.magazzini@univr.it http://dse.univr.it/magazzini Laura Magazzini (@univr.it) Multiple Choice Models II 1 / 28 Categorical data Categorical

More information

The Gravity Model: Derivation and Calibration

The Gravity Model: Derivation and Calibration The Gravity Model: Derivation and Calibration Philip A. Viton October 28, 2014 Philip A. Viton CRP/CE 5700 () Gravity Model October 28, 2014 1 / 66 Introduction We turn now to the Gravity Model of trip

More information

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

More information

Sections 2.11 and 5.8

Sections 2.11 and 5.8 Sections 211 and 58 Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I 1/25 Gesell data Let X be the age in in months a child speaks his/her first word and

More information

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne Applied Statistics J. Blanchet and J. Wadsworth Institute of Mathematics, Analysis, and Applications EPF Lausanne An MSc Course for Applied Mathematicians, Fall 2012 Outline 1 Model Comparison 2 Model

More information

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU PITFALLS IN TIME SERIES ANALYSIS Cliff Hurvich Stern School, NYU The t -Test If x 1,..., x n are independent and identically distributed with mean 0, and n is not too small, then t = x 0 s n has a standard

More information

BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

More information

1 Short Introduction to Time Series

1 Short Introduction to Time Series ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The

More information

Clustering in the Linear Model

Clustering in the Linear Model Short Guides to Microeconometrics Fall 2014 Kurt Schmidheiny Universität Basel Clustering in the Linear Model 2 1 Introduction Clustering in the Linear Model This handout extends the handout on The Multiple

More information

On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information

On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information Finance 400 A. Penati - G. Pennacchi Notes on On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information by Sanford Grossman This model shows how the heterogeneous information

More information

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In

More information

Simultaneous or Sequential? Search Strategies in the U.S. Auto. Insurance Industry

Simultaneous or Sequential? Search Strategies in the U.S. Auto. Insurance Industry Simultaneous or Sequential? Search Strategies in the U.S. Auto Insurance Industry Elisabeth Honka 1 University of Texas at Dallas Pradeep Chintagunta 2 University of Chicago Booth School of Business September

More information

Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance. Chapter 2: Statistical models of default Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

More information

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate

More information

The Best of Both Worlds:

The Best of Both Worlds: The Best of Both Worlds: A Hybrid Approach to Calculating Value at Risk Jacob Boudoukh 1, Matthew Richardson and Robert F. Whitelaw Stern School of Business, NYU The hybrid approach combines the two most

More information

University of Lille I PC first year list of exercises n 7. Review

University of Lille I PC first year list of exercises n 7. Review University of Lille I PC first year list of exercises n 7 Review Exercise Solve the following systems in 4 different ways (by substitution, by the Gauss method, by inverting the matrix of coefficients

More information

Solution of Linear Systems

Solution of Linear Systems Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

More information

You Are What You Bet: Eliciting Risk Attitudes from Horse Races

You Are What You Bet: Eliciting Risk Attitudes from Horse Races You Are What You Bet: Eliciting Risk Attitudes from Horse Races Pierre-André Chiappori, Amit Gandhi, Bernard Salanié and Francois Salanié March 14, 2008 What Do We Know About Risk Preferences? Not that

More information

Remote data access and the risk of disclosure from linear regression

Remote data access and the risk of disclosure from linear regression Statistics & Operations Research Transactions SORT Special issue: Privacy in statistical databases, 20, 7-24 ISSN: 696-228 eissn: 203-8830 www.idescat.cat/sort/ Statistics & Operations Research c Institut

More information

A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models

A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models Grace Y. Yi 13, JNK Rao 2 and Haocheng Li 1 1. University of Waterloo, Waterloo, Canada

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

More information

Table 1 and the Standard Deviation of a Model

Table 1 and the Standard Deviation of a Model DEPARTMENT OF ACTUARIAL STUDIES RESEARCH PAPER SERIES Forecasting General Insurance Liabilities by Piet de Jong piet.dejong@mq.edu.au Research Paper No. 2004/03 February 2004 Division of Economic and Financial

More information

How To Find Out What Search Strategy Is Used In The U.S. Auto Insurance Industry

How To Find Out What Search Strategy Is Used In The U.S. Auto Insurance Industry Simultaneous or Sequential? Search Strategies in the U.S. Auto Insurance Industry Current version: April 2014 Elisabeth Honka Pradeep Chintagunta Abstract We show that the search method consumers use when

More information