# An extension of the factoring likelihood approach for non-monotone missing data

Save this PDF as:

Size: px
Start display at page:

Download "An extension of the factoring likelihood approach for non-monotone missing data"

## Transcription

1 An extension of the factoring likelihood approach for non-monotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions under ignorable non-monotone missing data. The factoring likelihood method for monotone missing data, termed by Rubin 1974), is extended to a more general case of non-monotone missing data. The proposed method is equivalent to the Newton-Raphson method for the observed likelihood, but avoids the burden of computing the first and the second partial derivatives of the observed likelihood. Instead, the maximum likelihood estimates and their information matrices for each partition of the data set are computed separately and combined naturally using the generalized least squares method. A numerical example is presented to illustrate the method. A Monte-Carlo experiment compares the proposed method with the EM method. KEY WORDS. EM algorithm, Gauss-Newton method, Generalized least squares, Maximum likelihood estimator, Missing at random. Department of Statistics, Iowa State University, Ames, IA, 50014, U.S.A., phone: , fax: Department of Statistics, Ewha University, Seoul, , Korea, phone: , fax:

2 1 Introduction Missing data is quite common in practice. Statistical analysis of data with missing values is an important practical problem because missing data is oftentimes non-ignorable. When we simply ignore missing data, the resulting estimates will have nonresponse bias if the responding part of the sample is systematically different than the nonresponding part of the sample. Also, we may lose some information observed in the partially missing data. Little and Rubin 2002) and Molenberghs and Kenward 2007) provide comprehensive overviews of the missing data problem. We consider statistical inference of data with missing values using the maximum likelihood method. Specifically, we propose a computational tool for obtaining the maximum likelihood estimator MLE) under multivariate missing data. To explain the basic idea, we use an example of bivariate normal data. Later in Section 2.3, an extension to general multivariate data is proposed. Let y i = y 1i, y 2i ) be a bivariate normal random variables distributed as y1i y 2i ) [ iid µ1 N µ 2 ) σ11 σ, 12 σ 12 σ 22 )], 1) where iid is the abbreviation of independently and identically distributed. Note that five parameters, µ 1, µ 2, σ 11, σ 12, and σ 22, are needed to identify the bivariate normal distribution. We assume that the observations are missing at random MAR) as defined in Rubin 1976) so that the relevant likelihood is the observed likelihood, or the marginal likelihood of the observed data. Under MAR, we can ignore the response mechanism when estimating the population parameters. 1

3 If the missing data pattern is monotone in the sense that the set of respondents for one variable is a subset of the respondent set of the other variable, the observed likelihood can be factored into the marginal likelihood for one variable and the conditional likelihood for the second variable conditional on the first so that the maximum likelihood estimates can be estimated separately for each likelihood. For example, assume that y 1 is fully observed with n observations and y 2 is observed with r observations. Anderson 1957) first considered maximum likelihood parameter estimation under this setup by using an alternative representation of the bivariate normal distribution as y 1i y 2i y 1i iid N µ 1, σ 11 ) 2) iid N β β 21 1 y 1i, σ 22 1 ), where β 20 1 = µ 2 β 21 1 µ 1, β 21 1 = σ 1 11 σ 12 and σ 22 1 = σ 22 β σ 11. The observed likelihood is then written as a product of marginal likelihood of a fully observed variable y 1 and the conditional likelihood of y 2 given y 1. Thus, the parameters µ 1 and σ 11 for the marginal distribution of y 1 can be estimated with n observations and the other regression parameters, β 20 1, β 21 1, and σ 22 1, can be estimated from the conditional distribution with r observations. The factoring likelihood FL) method, termed by Rubin 1974), expresses the observed likelihood as a product of the marginal likelihood and the conditional likelihood so that the maximum likelihood estimates can be obtained separately at each likelihood. Note that the FL approach consists of two steps. In the first step, the likelihood is factored, and in the second step the MLE for each likelihood is computed separately. In many cases, the MLE s are easily computed in the FL approach because the marginal and the condi- 2

4 tional likelihoods are known so that we can directly use the known solutions of the likelihood equations for each likelihood. For the monotone missing data, the MLE s for the conditional distribution are independent of those for the marginal distribution. This is because the two sets of parameters - the parameters for the marginal likelihood and those for the conditional likelihood - are orthogonal Cox and Reid, 1997) and as a result the MLE s for the conditional likelihood are not affected by the MLE s for the marginal likelihood. Rubin 1974) recommended the FL approach as a general framework in the analysis of missing data with a monotone missing pattern. The main advantage of the FL is its computational simplicity. Under non-monotone missing data patterns, however, the FL approach is not directly applicable. The EM algorithm, proposed by Dempster et al. 1977), successfully provides MLE s under a general missing pattern. Using the EM algorithm also avoids the calculation of the observed likelihood function and uses only the complete likelihood function. Despite its popularity, there are several shortcomings of using the EM algorithm. First, the computation is performed iteratively and the convergence is notoriously slow Liu and Rubin, 1994). Second, the covariance matrix of the estimated parameters is not provided directly Louis, 1982, and Meng and Rubin, 1991). The focus of this paper is to propose an alternative method that will resolve these two issues at the same time. In this paper, we consider an extension of the FL method to the nonmonotone missing data. To apply the FL method to non-monotone missing, in addition to the two steps in the original FL approach, we need another step that combines the separate MLE s computed for each likelihood to produce 3

5 the final MLE s. The proposed method turns out to be essentially the same as the direct maximum likelihood method using the Newton-Raphson algorithm which converges much faster than the EM algorithm. Furthermore, the proposed method provides the asymptotic variance-covariance matrix of the MLE s directly as a by-product of the computation. Using the variancecovariance expression, the asymptotic variances are compared with other estimators obtained by ignoring some part of partially observed data. A related work is Chen et al. 2008), who compared variances of estimators for regression models with missing responses and covariates. The proposed method is an extension of the preliminary work of Kim 2004) who considered the case of a bivariate missing data. In Section 2, some of the result of Kim 2004) is reviewed and extended to more general class of multivariate missing data. Efficiency comparisons based on the asymptotic variance-covariance matrix obtained from the proposed method are discussed in Section 3. The proposed method is applied to a categorical data example in Section 4. Results from a limited simulation study are presented in Section 5. Concluding remarks are made in Section 6. 2 Proposed method The proposed method can be described in the following three steps: [Step 1] Partition the original sample into several disjoint sets according to the missing pattern. [Step 2] Compute the MLE s for the identified parameters separately in each partition of the sample. 4

6 [Step 3] Combine the estimators to get a set of final estimates using a generalized least squares GLS) form. Kim 2004) discuss the procedures in detail for the bivariate case. We review the result of Kim 2004) for the bivariate normal case in Section 2.1. In Section 2.2, we consider a general class of bivariate distributions. In Section 2.3, the proposed method is extended to multivariate distributions. 2.1 Bivariate normal case To simplify the presentation, we describe the proposed method in the bivariate normal setup with non-monotone missing pattern. The joint distribution of y = y 1, y 2 ) is parameterized by the five parameters using model 1) or 2). For the convenience of the factoring method described in Section 1, we use the parametrization in 2) and let θ = β 20 1, β 21 1, σ 22 1, µ 1, σ 11 ). In Step 1, we partition the sample into several disjoint sets according to the pattern of missingness. In the case of a non-monotone missing pattern with two variables, we have 3 = types of respondents that contain information about the parameters. The first set H has both y 1 and y 2 observed, the second set K has y 1 observed but y 2 missing, and the third set L has y 2 observed but y 1 missing. See Table 1. Let n H, n K, n L be the sample sizes of the set H, K, L, respectively. The case of both y 1 and y 2 missing can be safely removed from the sample. In Step 2, we obtain the parameter estimators in each set: For set H, we have the five parameters η H = β 20.1, β 21.1, σ 22.1, µ 1, σ 11 ) of the conditional distribution of y 2 given y 1 and the marginal distribution of y 1, with MLE s ˆη H = ˆβ 20 1,H, ˆβ 21 1,H, ˆσ 22 1,H, ˆµ 1,H, ˆσ 11,H ). For set K, the MLE s 5

7 η θ) = β 20 1, β 21 1, σ 22 1, µ 1, σ 11, µ 1, σ 11, β β 21 1 µ 1, σ β σ 11 ) 4) Table 1. An illustration of the missing data structure under bivariate normal distribution Set y 1 y 2 Sample Size Estimable parameters H Observed Observed n H µ 1, µ 2, σ 11, σ 12, σ 22 K Observed Missing n K µ 1, σ 11 L Missing Observed n L µ 2, σ 22 ˆη K = ˆµ 1,K, ˆσ 11,K ) are obtained for η K = µ 1, σ 11 ), the parameters of the marginal distribution of y 1. For set L, the MLE s ˆη L = ˆµ 2,L, ˆσ 22,L ) are obtained for η L = µ 2, σ 22 ), where µ 2 = β β 21 1 µ 1 and σ 22 = σ β21 1σ In Step 3, we use the GLS method to combine the three estimators ˆη H, ˆη K, ˆη L to get a final estimator for the parameter θ. Let ˆη = ˆη H, ˆη K, ˆη L). Then ˆη = ˆβ20 1,H, ˆβ ) 21 1,H, ˆσ 22 1,H, ˆµ 1,H, ˆσ 11,H, ˆµ 1,K, ˆσ 11,K, ˆµ 2,L, ˆσ 22,L. 3) The expected value of this estimator is and the asymptotic covariance matrix is { Σ22.1 V = diag, 2σ2 22 1, σ 11, 2σ2 11, σ 11, 2σ2 11, σ } 22, 2σ2 22, 5) n H n H n H n H n K n K n L n L where Σ 22.1 = ) σ σ 1 11 µ 2 1 σ 1 11 σ 22 1 µ 1 σ11 1 σ 22 1 µ 1 σ11 1 σ 22 1 Note that Σ 22.1 = {E[1, y 1 )1, y 1 ) ]} 1 σ 22.1 = [ 1 µ 1 µ 1 σ 11 +µ 2 1] 1 σ Derivation for the asymptotic covariance matrix of the first five estimates in 3) is straightforward and can be found, for example, in Subsection of Little 6 ).

8 and Rubin 2002). We have a block-diagonal structure of V in 5) because ˆµ 1K and ˆσ 11K are independent due to normality and observations between different sets are independent due to the iid assumptions. Note that the nine elements in ηθ) are related to each other because they are all functions of the five elements of vector θ. The information contained in the extra four equations has not yet been utilized in constructing estimators ˆη H, ˆη K, ˆη L. The information can be employed to construct a fully efficient estimator of θ by combining ˆη H, ˆη K, ˆη L through a GLS generalized least squares) regression of ˆη = ˆη H, ˆη K, ˆη L) on θ as follows: ˆη ηˆθ S ) = η/ θ )θ ˆθ S ) + error, where ˆθ S is an initial estimator. The expected value and variance of ˆη in 4) and 5) can be viewed as a nonlinear model of the five parameters in θ. Using a Taylor series expansion on the nonlinear model, a step of the Gauss-Newton method can be formulated as e η = X θ ˆθ ) S + u, 6) ) ) where e η = ˆη η ˆθS, η ˆθS is the vector 4) evaluated at ˆθ S, X = µ 1 2β 21 1 σ β β21 1 2, 7) and, approximately, u 0, V), 7

9 Table 2. Summary for bivariate normal case. y 1 y 2 Data Set Size Estimable parameters Asymptotic variance O M K n K η K = θ 1 W K = diagσ 11, 2σ11 2 ) O O H n H η H = θ 1, θ 2 ) W H = diagw K, Σ 22.1, 2σ ) M O L n L η L = µ 2, σ 22 ) W L = diagσ 22, 2σ22 2 ) O: observed, M: missing, θ 1 = µ 1, σ 11 ), θ 2 = β 20 1, β 21 1, σ 22 1 ), η = η H, η K, η L ), θ = η H, V = diagw H /n H, W K /n K, W L /n L ), Σ 22.1 = {E[1, y 1 )1, y 1 ) ]} 1 σ 22.1, X = η/ θ, µ 2 = β β 21.1 µ 1, σ 22 = σ β σ 11 and V is the covariance matrix defined in 5). The Gauss-Newton method for the estimation of nonlinear models can be found in Seber and Wild 1989). Relations among parameters η, θ, X, and V are summarized in Table 2. A simple initial estimator is the weighted average of available estimators from data sets, defined as ˆθ S = ˆβ 20 1,H, ˆβ 21 1,H, ˆσ 11 2,H, ˆµ 1,HK, ˆσ 11,HK ), 8) where ˆµ 1,HK = 1 p K ) ˆµ 1,H + p K ˆµ 1,K, ˆσ 11,HK = 1 p K ) ˆσ 11,H + p K ˆσ 11,K, and p K = n K / n H + n K ). This initial value is a n-consistent estimate of θ and guarantees the consistency of the one-step estimators. The procedure can be carried out iteratively until convergence. Given the current value ˆθ t), the solution of the Gauss-Newton method can be obtained iteratively as ˆθ t+1) = ˆθ t) + ) 1 X 1 t) ˆV t) X t) X t) { 1 ˆθt) ˆV )} t) ˆη η, 9) where X t) and ˆV t) are evaluated from X in 7) and V in 5), respectively, using the current value ˆθ t). The covariance matrix of the estimator in 9) can be estimated by C = when the iteration is stopped at the t-th iteration. 1 X 1 t) ˆV t) t)) X, 10) 8

10 Remark 1 Bivariate normal monotone case. In the case of monotone missing which consists of H and K, the iteration 9) produces the estimator obtained by the factoring likelihood, given in Little and Rubin 2002). order to see this, note that 9) reduces to ˆθ t+1) = ˆθ t) + where and ) 1 { X 1 HK ˆV HKt) X HK X 1 ˆθt) HK ˆV )} HKt) ˆη HK η, 11) X HK = ˆV HKt) = diag { ˆΣ22.1t) n H , 2ˆσ2 22 1t) n H, ˆσ 11t) n H, 2ˆσ2 11t) n H,, ˆσ 11t), 2ˆσ2 11t) n K n K ˆη HK = ˆβ20 1,H, ˆβ 21 1,H, ˆσ 22 1,H, ˆµ 1,H, ˆσ 11,H, ˆµ 1K, ˆσ 11K ). Starting with an initial estimator ˆθ 0) = ˆβ20 1,H, ˆβ 21 1,H, ˆσ 22 1,H, ˆµ 1,H, ˆσ 11,H ), the estimator constructed from data set H, one step application of 9) leads to the one-step estimate ˆθ 1) = ˆβ20 1,H, ˆβ 21 1,H, ˆσ 22 1,H, ˆµ 1,HK, ˆσ 11,HK ) 12) }, In where ˆµ 1,HK and ˆσ 11,HK are defined after 8). This is the same as the MLE of Anderson 1957) from the original factoring likelihood method. When only y 2 is subject to missingness, the estimated regression coefficient for the regression of y 2 on y 1 using the set H only is fully efficient, but the estimated regression coefficient for the regression of y 1 on y 2 based on the set H only is not fully efficient. 9

11 2.2 General bivariate case We now consider the general bivariate case when the joint distribution is not necessarily normal. Assume that the joint distribution of y = y 1, y 2 ) is parameterized by θ. For set H, let η H = η H θ) be a parametrization for the joint distribution of y 1, y 2 ) such that the information matrix, say I H η H ), is easy to compute. One such parametrization is η H = η H1, η H2 ), where η H1 is the parameter vector for the conditional distribution of y 1 given y 2 and η H2 is the parameter vector for the marginal distribution of y 2. Since the parameters for the conditional distribution are orthogonal to those for the marginal distribution, the parametrization η H = η H1, η H2 ) results in a block diagonal I H η H ). For set K, let η K = η K θ) be a parametrization for the marginal distribution of y 1 such that the information matrix, say I K η K ), is easy to compute. Define η L similarly. The parametrization for the set H need not be the same as that for set K or for set L, which provides more flexibility in choosing the parametrization. Separate orthogonal parametrization in each set will lead to computational advantages over the direct maximum likelihood method. Let ˆη H be the MLE of η H constructed using sample set H. Define ˆη K and ˆη L similarly. Let V H, V K, and V L be the estimated covariance matrices of the MLE s ˆη H, ˆη K, and ˆη L, respectively, and let V = diagv H, V K, V L ). Note that V 1 = diag {I H η H ), I K η K ), I L η L )}. Because V is a function of θ, we write V = V θ). Also, define [ η X θ) = H θ, η K θ, η L θ and η θ) = η H θ), η K θ), η L θ) ). Note that ˆη = ˆη H, ˆη K, ˆη L, ) is ] 10

12 different from ηˆθ) in general. Using the above notation, the maximum likelihood estimator can be computed iteratively from 9) with X t) = Xˆθ t) ) and ˆV t) = Vˆθ t) ). We now show that the procedure in 9) produces a fully efficient estimator of θ in that it is equivalent to the Newton-Raphson procedure for ML estimation based on the observed likelihood. Let the score function of the observed likelihood be defined as S obs y; θ) = log l obs θ) / θ, where l obs θ) = i f y i ; θ) dy imis), with y imis) defined to be the missing part of y i, is the observed likelihood function of parameter θ. The Newton-Raphson method for maximum likelihood estimation can be defined as ˆθ t+1) = ˆθ t) + [I obs ˆθt) )] 1 Sobs y; ˆθ t)), 13) where I obs θ) = E [ 2 log l obs θ) / θ θ ] is the expected information matrix for θ. We show in Theorem 1 below, that the iterations 9) and 13) are identical in that, starting from ˆθ t), the two iterations produce identical values for ˆθ t+1) for all t. Therefore, our procedure gives us a fully efficient estimator of θ. Equivalence between the Gauss-Newton estimator and the maximum likelihood estimator will be established in a more general multivariate situation in Section 2.3. Note that evaluation of the likelihood l obs θ) and hence direct computation of the MLE through the scoring method 13) is not trivial 11

13 due to the complexities in evaluating the integral in the observed likelihood. On the other hand, our procedure is easy to implement because it involves evaluation of likelihoods corresponding to only observed parts. The following theorem establishes equivalence between the Gauss-Newton estimator in 9) and the maximum likelihood estimator in 13). Theorem 1 The Gauss-Newton estimator 9) is equivalent to the maximum likelihood estimator 13) in that, starting from ˆθ t), 9) and 13) give the same value for ˆθ t+1). Proof. See Appendix A. Note that because of the nature of the Newton-Raphson algorithm, the iteration 9) converges much faster than the usual EM-algorithm. Moreover, our procedure directly produces a simple estimator C of the variance of the MLE ˆθ, while the EM-algorithm does not give a direct estimate of the variance of ˆθ. Remark 2 - One-step estimator. Given a suitable choice of the initial estimate ˆθ S, the one-step estimator ˆθ = ˆθ S + X S ) 1 1 ˆV S X S X 1 S ˆV S e η, 14) can be a very good approximation to the maximum likelihood estimator, where X S and ˆV S are evaluated from X and V, respectively, using the initial estimator ˆθ S. The one-step Newton-Raphson estimator 13) using n-consistent initial estimates is asymptotically equivalent to the MLE Lehmann, 1983, Theorem 3.1). By Theorem 1, the one-step Gauss-Newton estimator 14) is also asymptotically equivalent to the MLE. 12

14 2.3 General multivariate case One advantage of the GLS Gauss-Newton procedure 9) is that it can easily extend to a general multivariate case having p variables y = y 1,..., y p ). Any general missing data set can be partitioned into say H 1,..., H q, mutually disjoint and exhaustive data sets such that, for each data set H j, j = 1,..., q, all the element share the same missing pattern. Therefore, each set H j can be considered to be a complete data set if only all the observed variables are considered. Let θ be a parameter vector for which the joint distribution of y is fully indexed and is identified. We choose a parameter vector η j for the joint distribution of the observed variables corresponding to H j such that the joint distribution is identified and the information matrix, say I j θ), is easy to compute. Let ˆη j be the MLE of η j computed from the data set H j, which can be easily computed because H j is complete. We have V j = varˆη j ) = I 1 j θ). Let η = η 1,..., η q) and let ˆη = ˆη 1,..., ˆη q). Then V = varˆη) = diagi 1 1 θ),..., I 1 q θ)). Letting X = η/ θ, with an initial consistent estimator ˆθ S, the following iteration ˆθ = ˆθ S + X V 1 X ) 1 X V 1 ˆη ηˆθ S )), 15) defines a one step Gauss-Newton procedure for ML estimation. The following theorem establishes equivalence of the proposed estimator and the MLE. The proof is a straightforward extension of that of Theorem 1 and thus is skipped for brevity. Theorem 2 The one-step estimator 15) is equivalent to the maximum likelihood estimator. 13

15 In order to implement our procedure 15), we need to specify θ, η, X, V) as well as an initial estimator ˆθ S. Specification of θ, η, X, V) depends on data distribution and missing type. In the following remarks, explicit expressions for θ, η, X, V) are given for some important cases. These remarks demonstrate that our procedure 15) can be easily implemented to multivariate normal cases and hence multiple regressions with any missing type. Remark 3-3-variate general non-monotone missing case. We give a detailed implementation of the Gauss-Newton procedure 15) for 3- variate case. Assume that y = y 1, y 2, y 3 ) is jointly normal N 3 µ, Σ), µ = µ 1, µ 2, µ 3 ), Σ = σ ij ). All possible 7 = missing cases are displayed in Table 3. Expressions for θ, η, X and V are given in Table 3. Parameters θ 1, θ 2, θ 3 correspond to the parameters of the distributions of y 1, the conditional distribution of y 2 given y 1, and the conditional distribution of y 3 given y 1, y 2 ), respectively. The parameter θ = θ 1, θ 2, θ 3) fully parameterizes the joint distribution of y 1, y 2, y 3 ). The conditional parameters can be written in the following regression equations y 1 = µ 1 + e 1, e 1 N0, σ 11 ), y 2 = β β 21.1 y 1 + e 2.1, e 2.1 N0, σ 22.1 ), y 3 = β β y 1 + β y 2 + e 3.12, e 3.12 N0, σ ), y 3 = β β 31.1 y 1 + e 3.1, e 3.1 N0, σ 33.1 ), y 3 = β β 31.2 y 2 + e 3.2, e 3.2 N0, σ 33.2 ), in which the regression errors are independent of the regressors. Construction of estimator ˆη j from data set H j is obvious from defini- 14

16 tion of parameter η j. For example, ˆη 1 = ˆµ 1,1, ˆσ 11,1 ) is estimated from H 1 ; ˆη 2 = ˆθ 1,2, ˆθ 2,2) is estimated from data set H 2 where ˆθ 1,2 = ˆµ 1,2, ˆσ 11,2 ) is constructed from variable y 1 and ˆθ 2,2 = ˆβ 20.1,2, ˆβ 21.1,2, ˆσ 22.1,2 ) is constructed from the regression of y 2 on y 1 ; ˆη 7 = ˆµ 2,7, ˆσ 22,7, ˆβ 30.2,7, ˆβ 31.2,7, ˆσ 33.2,7 ) is estimated from H 7 where ˆµ 2,7, ˆσ 22,7 ) is constructed from variable y 2 and ˆβ 30.2,7, ˆβ 31.2,7, ˆσ 33.2,7 ) is constructed from regression of y 3 on y 2. We then have ˆη = ˆη 7,..., ˆη 1). A simple initial estimator ˆθ S = ˆθ 1S, ˆθ 2S, ˆθ 3S) can be constructed by averaging available estimators from data sets H 1,..., H 7 as given by ˆθ 1S = n 1ˆθ1,1 + n 2ˆθ1,2 + n 3ˆθ1,3 + n 6ˆθ1,6 )/n 1 + n 2 + n 3 + n 6 ), ˆθ 2S = n 2ˆθ2,2 + n 3ˆθ2,3 )/n 2 + n 3 ), ˆθ 3S = ˆθ 3,3. For evaluation of η and X = η/ θ, we need expressions for η with respect to θ and their derivatives. This issue is addressed in Remark 4 below. We now have all the materials for implementing 15). Observe that some elements of η 4,..., η 7 are nonlinear functions of θ. Therefore, the X matrix has elements other than 0 or 1, as occurred in the last two columns of X in 7). For monotone missing pattern, sets H 4 H 7 are empty and the X matrix consists of elements of 0 or 1, as in Remark 1. Remark 4 - Evaluation of η and X. Consider the general p dimensional normal case N p µ, Σ), µ = µ 1,..., µ p ), Σ = σ ij ). As shown in Remark 3, elements of η take one of the following three forms: {θ j, j = 1,,., p}, {µ, Σ}, or { parameters, say β j0.j, β jj.j, σ jj.j) of the regression of y j on a vector, say y J, a subvector of y 1, y 2,..., y j 1 ) such that y j = β j0.j + β jj y J + e j.j, j = 2,..., p }. In order to compute η and X = η/ θ, we need expressions 15

17 Table 3. 3-dimensional normal case: non-monotone missing. y 1 y 2 y 3 Data Set Size Estimable parameters Asymptotic variance O M M H 1 n 1 η 1 = θ 1 W 1 = diagσ 11, 2σ11 2 ) O O M H 2 n 2 η 2 = θ 1, θ 2 ) W 2 = diagw 1, Σ 22.1, 2σ ) O O O H 3 n 3 η 3 = θ 1, θ 2, θ 3 ) W 3 = diagw 2, Σ 33.12, 2σ ) M M O H 4 n 4 η 4 = µ 3, σ 33 ) W 4 = diagσ 33, 2σ33 2 ) M O M H 5 n 5 η 5 = µ 2, σ 22 ) W 5 = diagσ 22, 2σ22 2 ) O M O H 6 n 6 η 6 = θ 1, β 30.1, β 31.1, σ 33.1 ) W 6 = diagw 1, Σ 33.1, 2σ ) M O O H 7 n 7 η 7 = µ 2, σ 22, β 30.2, β 31.2, σ 33.2 ) W 7 = diagσ 22, 2σ22 2, Σ 33.2, 2σ ) θ 1 = µ 1, σ 11 ), θ 2 = β 20 1, β 21 1, σ 22 1 ), θ 3 = β 30 12, β 31 12, β 32 12, σ ), µ = µ 1, µ 2, µ 3 ) and Σ = σ ij ) are computed from θ 1, θ 2, θ 3 using the recursion in Remark 4 below, β 30.1 = µ 3 β 31.1 µ 1, β 31.1 = σ 13 /σ 11, σ 33.1 = σ 33 β σ 11, β 30.2 = µ 3 β 31.2 µ 2, β 31.2 = σ 23 /σ 22, σ 33.2 = σ 33 β σ 22, Σ 22.1 = {E[1, y 1 )1, y 1 ) ]} 1 σ 22.1, Σ = {E[1, y 1, y 2 )1, y 1, y 2 ) ]} 1 σ 33.12, Σ 33.1 = {E[1, y 1 )1, y 1 ) ]} 1 σ 33.1, Σ 33.2 = {E[1, y 2 )1, y 2 ) ]} 1 σ 33.2, η = η 7, η 6,..., η 1 ), θ = η 3, V = diagw 7 /n 7, W 6 /n 6,..., W 1 /n 1 ), X = η/ θ. for the parameters µ, Σ and β j0.j, β jj.j, σ jj.j) in terms of the conditional parameter θ = θ 1,..., θ p). Using a regression expression y j+1 = β j+1) j + β j+1) j y β j+1)j 12...j y j + e j+1) 12...j, we get, for j = 0, 1, 2,..., p 1, µ j+1 = Ey j+1 ) = β j+1) j + β j+1) j µ β j+1)j 12...j µ j, σ i,j+1 = covy i, y j+1 ) = β j+1) j σ i β j+1)j 12...j σ ij, i = 1, 2,..., j, and j j σ j+1,j+1 = vary j+1 ) = σ j+1)j+1) 12...j + β j+1)k 12...j β j+1)l 12...j σ kl. k=1 l=1 Note that, in the above three equations, µ j+1, σ 1j+1), σ 2j+1),..., σ j+1)j+1) ) is expressed in terms of the conditional parameter θ j+1 = β j+1) j, β j+1) j,..., β j+1)j 12...j, σ j+1)j+1) 12...j ) and the marginal parameter µ j, σ 1j,..., σ jj ), the latter of which is a function of 16

18 θ j, θ j 1,..., θ 1. Therefore, recursive evaluation of these three equations for j = 0, 1,..., p 1 with initial values β 10 = µ 1 and σ 11 = σ 11 gives the required expression for the marginal parameters µ j, σ 1j,..., σ j 1)j, σ jj ), j = 1,..., p in terms of the conditional parameters θ j, θ j 1,..., θ 1. 1, Partial derivatives are recursively computed as follows: for j = 0, 1,..., p j l=1 β j+1)l 12...j µ l /θ t if t = 1,..., j, µ j+1 / θ t = 1, µ 1,..., µ j, 0) if t = j if t = j + 2, j + 3,..., p, σ i,j+1 / θ t = β j+1) j σ i1 / θ t β j+1)j 12...j σ ij / θ t, i = 1, 2,..., j, t = 1, 2,..., j, σ j+1,j+1 / θ t = σ j+1)j+1) 12...j + j k=1 j l=1 β j+1)k 12...jβ j+1)l 12...j σ kl / θ t, t = 1, 2,..., j, σ i,j+1 / θ j+1 = 0, σ i1,..., σ ij, 0), i = 1,..., j, σ j+1,j+1 / θ j+1 = 0, 2 j k=1 β j+1)k 12...jσ k1, 2 j k=1 β j+1)k 12...jσ k2,,..., 2 j k=1 β j+1)k 12...jσ kj, 1), σ i,j+1 / θ t = 0, i = 1, 2,..., j + 1, t = j + 2, j + 3,..., p. Given these expressions for µ, Σ and their partial derivatives in terms of θ, it is straightforward to compute β j0.j, β jj.j, σ jj.j) and the corresponding derivatives because the regression parameters are simple functions of µ and Σ. Remark 5 - A 4-variate non-monotone case. In Table 4, expressions for θ, η, X and V are given for a 4-dimensional case with a specific missing pattern. 17

19 Table 4. A 4-dimensional normal case - non-monotone missing. y 1 y 2 y 3 y 4 Data Set Size Estimable parameters Asymptotic variance O M M M H 1 n 1 η 1 = µ 1, σ 11 ) W 1 = diagσ 11, 2σ11 2 ) M O M M H 2 n 2 η 2 = µ 2, σ 22 ) W 2 = diagσ 22, 2σ22 2 ) M M O M H 3 n 3 η 3 = µ 3, σ 33 ) W 3 = diagσ 33, 2σ33 2 ) M M M O H 4 n 4 η 4 = µ 4, σ 44 ) W 4 = diagσ 44, 2σ44 2 ) O O O O H 5 n 5 η 5 = θ 1, θ 2, θ 3, θ 4 ) W 5, see below θ 1 = µ 1, σ 11 ), θ 2 = β 20 1, β 21 1, σ 22 1 ), θ 3 = β 30 12, β 31 12, β 32 12, σ ), θ 4 = β , β , β , β , σ ) µ = µ 1,..., µ 4 ) and Σ = σ ij ) are computed from θ 1,..., θ 4 using the recursion in Remark 4, W 5 = diagσ 11, 2σ11 2, Σ 22.1, 2σ22.1 2, Σ 33.12, 2σ , Σ , 2σ ) Σ 22.1 = {E[1, y 1 )1, y 1 ) ]} 1 σ 22.1, Σ = {E[1, y 1, y 2 )1, y 1, y 2 ) ]} 1 σ 33.12, Σ = {E[1, y 1, y 2, y 3 )1, y 1, y 2, y 3 ) ]} 1 σ , η = η 5, η 4,..., η 1 ), θ = η 5, V = diagw 5 /n 5, W 4 /n 4,..., W 1 /n 1 ), X = η/ θ 3 Efficiency comparison We compare efficiencies of estimators constructed from different combinations of data sets H, K, L for the bivariate normal case. Under the non-monotone missing pattern, we can compute the following four types of the estimates. 1. ˆθ H : the maximum likelihood estimator using the samples in H set. 2. ˆθ HK : the maximum likelihood estimator using the samples in H K. 3. ˆθ HL : the maximum likelihood estimator using the samples in H L. 4. ˆθ HKL : the maximum likelihood estimator using the whole sample. By Theorem 1, the Gauss-Newton estimator 9) is asymptotically equal to ˆθ HKL. Write X = [X H X K X L] where X H is the left 5 5 submatrix of X, X K is the 5 2 submatrix in the middle of X, and X L is the 5 2 submatrix in the right side of X. Similarly, we can decompose ˆη = ˆη H, ˆη K, ˆη L) and V = diag {VH, V K, V L }. 18

20 Using the arguments in the proof of Theorem 1, the asymptotic variance of ˆθ 1 H is X 1 H H) X. Similarly, we have ) 1 V ar ˆθHL = X 1 H X H + X 1 L ˆV L L) X, ) 1 V ar ˆθHK = X 1 H X H + X 1 K ˆV K K) X, and ) 1 V ar ˆθHKL = X 1 H X H + X 1 K ˆV K X K + X 1 L ˆV L L) X. Using the matrix algebra such as ) 1 X 1 H X H + X 1 L ˆV L X L = X 1 H X H ) 1 X L X H ) 1 1 ˆV H X H [ ˆV L + X L X 1 H X H we can derive expressions for the variances of the estimators. ) ] X L X L X 1 H H) X, For estimates of the slope parameter, the asymptotic variances are V ar ˆβ 21.1,HK ) = σ 22.1 σ 11 n H = V ar ˆβ 21.1,H ) V ar ˆβ 21.1,HL ) = σ 22.1 { 1 2pL ρ 2 1 ρ 2)} σ 11 n H ) V ar ˆβ21 1,HKL = σ { p } Lρ 2 1 ρ 2 ), σ 11 n H 1 p L p K ρ 4 where ρ 2 = σ 2 12/ σ 11 σ 22 ), p K = n K / n H + n K ) and p L = n L / n H + n L ). See Appendix B for derivations of these variances and other variances below. Thus, we have V ar ˆβ 21.1,H ) = V ar ˆβ 21.1,HK ) V ar ˆβ 21.1,HL ) V ar ˆβ 21.1,HKL ). 16) Here strict inequalities generally hold except for special trivial cases. Note that the asymptotic variance of ˆβ 21 1,HK is the same as the variance of ˆβ 21 1,H, 19

21 which implies that there is no gain of efficiency by adding set K missing y 2 ) to H. On the other hand, by comparing V ar ˆβ ) 21 1,HL ) with V ar ˆβ21 1,H, we observe an efficiency gain by adding a set L missing y 1 ) to H. This analysis is similar to the results from Little 1992) who summarized statistical results for regression with missing X s whose data sets are H and L. Little 1992) did not include the cases of missing y 2 s because data set K) with missing y 2 does not contain additional information in estimating the regression parameter. It is interesting to observe that even though adding K the data set with missing y 2 ) to H does not improve efficiency of regression parameter estimate, i.e., V ar ˆβ 21.1,H ) = V ar ˆβ 21.1,HK ), adding K to H, L) does improve the efficiency, i.e., V ar ˆβ 21.1,HL ) > V ar ˆβ 21.1,HKL ). Using these expressions, we can investigate variance reduction of ˆβ 21.1,HKL over ˆβ 21.1,HK. For example, we can say that relative efficiency of ˆβ 21.1,HKL over ˆβ 21.1,HK is lager for larger values of p K, p L, or ρ. If p L = p K = 0.5 and ρ = 0.5, the relative efficiency value is If p L = p K = 0.9 and ρ = 0.9, the relative efficiency value is For the other parameters of the conditional distribution, relationships similar to 16) hold. For the marginal parameters, we have V arˆµ 1,HK ) = σ 11 n H 1 p K ), V arˆµ 1,HL ) = σ 11 1 pl ρ 2), n H { V arˆµ 1,HKL ) = σ 11 n H 1 p K ) p Lρ 2 1 p K ) 2 1 p L p K ρ 2 ) }, 20

22 and V arˆσ 11,HK ) = 2σ2 11 n H 1 p K ), V arˆσ 11,HL ) = 2σ pl ρ 4), n H { V arˆσ 11,HKL ) = 2σ2 11 n H 1 p K ) p Lρ 4 1 p K ) 2 1 p L p K ρ 4 Note that efficiency of the marginal parameters µ 1, σ 11 ) of y 1 improves if additional data for y 2 with y 1 missing are provided. In particular, if n K = n L, then V arˆµ 1,H ) V arˆµ 1,HL ) V arˆµ 1,HK ) V arˆµ 1,HKL ) }. and V arˆσ 11,H ) V arˆσ 11,HL ) V arˆσ 11,HK ) V arˆσ 11,HKL ). 4 A Numerical Example For a numerical example, we consider the data set adapted from Bishop, Fienberg and Holland 1975, Table 1.4-2). Table 5 gives the data for a 2 3 table of three categorical variable Y 1 =Clinic, Y 2 =Parential care, Y 3 =survival) with one supplemental margin for Y 2 and Y 3 and another supplemental margin for Y 1 and Y 3. In this setup, Y i are all dichotomous, taking either 0 or 1, and 8 parameters can be defined as π ijk = P ry 1 = i, Y 2 = j, Y 3 = k), i = 0, 1; j = 0, 1; k = 0, 1. For the orthogonal parametrization, we use η H = ) π 1 11, π 1 10, π 1 01, π 1 00, π +1 1, π +1 0, π

23 Table 5. A 2 3 table with supplemental margins Set y 1 y 2 y 3 Count H K L

24 where π i jk = P r y 1 = i y 2 = j, y 3 = k), π +j k = P r y 2 = j y 3 = k), π ++k = P r y 3 = k). We also set θ θ 1, θ 2, θ 3, θ 4, θ 5, θ 6, θ 7 ) = η H. Note that the validity of the proposed method does not depend on the choice of the parametrization. A suitable parametrization will make the computation of the information matrix simple. From the data in Table 5, we can obtain 13 observations for 7 parameters. The observation vector can be written ˆη = ˆη H, ˆη K, ˆη L), where ˆη H = 293/316, 4/6, 176/373, 3/20, 316/689, 6/26, 689/715) ˆη K = ˆπ 1 +1,K, ˆπ 1 +0,K, ˆπ ++1,K ) = 100/182, 5/11, 182/193) ˆη L = ˆπ +1 1,L, ˆπ +1 0,L, ˆπ ++1,L ) = 90/240, 5/15, 240/255) with the expectations where η H = θ, η θ) = η H, η K, η L), η K = π 1 11 π π 1 01 π +0 1, π 1 10 π π 1 00 π +0 0, π ++1 ) = θ 1 θ 5 + θ 3 1 θ 5 ), θ 2 θ 6 + θ 4 1 θ 6 ), θ 7 ) and η L = π +1 1, π +1 0, π ++1 ) = θ5, θ 6, θ 7 ), and the variance-covariance matrix where V = diag {V H /n H, V K /n K, V L /n L } V H = diag {θ 1 1 θ 1 ), θ 2 1 θ 2 ),..., θ 7 1 θ 7 )}, V K = diag { ) ) π π1 +1, π π1 +0, π++1 1 π 1++ ) } V L = diag {θ 5 1 θ 5 ), θ 6 1 θ 6 ), θ 7 1 θ 7 )}. 23

25 The Gauss-Newton method as in 9) can be used to solve the nonlinear model of three parameters, where the initial estimator of θ is ˆθ S = 293/316, 4/6, 176/373, 3/20, 406/929, 11/41, 1111/1163) and the X matrix is X = θ θ θ θ θ 1 θ θ 2 θ The resulting one-step estimates are ˆθ 1 = 0.923, 0.678, 0.454, 0.168, 0.426, 0.272, 0.955).. Standard errors of the estimated values are computed from 10) and are ˆV 1/2 ˆθ 1 ) = diag , , , , , , ). On the other hand, the standard errors of the initial parameter estimates are ˆV 1/2 ˆθ S ) = diag , , , , , , ). Note that there is no efficiency gain for ˆπ ++1 because y 3 is fully observed throughout the sample. 5 Simulation Study To test our theory with finite sample sizes, we perform a limited simulation. For the simulated data set, we generate B = 10, 000 samples of size n from 24

26 Table 6. Monte Carlo variance of the point estimators under two difference estimation schemes based on samples of 10,000 trials. Sample size Parameter EM estimation GN estimation True variance µ µ σ σ σ µ µ σ σ σ the population x y 1 iid N1, 1) iid N 0, 1) y 2 = y 1 + e where e N 0, ). We use two levels of sample sizes, n = 100 and n = 500. Variable x is always observed and variables y 1 and y 2 are subject to missingness. The response probability for y 1 follows a logistic regression model such that logit {P ry 1 is observed)} = x, the response probability for y 1 follows a logistic regression model such that logit {P ry 1 is observed)} = 0.7x, and that the two responses are independent. The one-step Gauss-Newton GN) estimation and the EM estimation are compared. The estimates from the EM algorithm are computed after 10 iterations with the same initial values as for the one-step GN estimator. 25

27 Table 7. Monte Carlo result for the estimated variance of the proposed method based on samples of 10,000 trials. Sample size Parameter Mean Est. Var. Rel. Bias t-statistic µ µ σ σ σ µ µ σ σ σ The means and variances of the point estimators and the mean of the estimated variances are calculated. Because the point estimators are all unbiased in the simulation, their simulation means are not listed here. Table 6 displays the Monte Carlo variances of the point estimators in each estimation method. The theoretical asymptotic variances of the MLE are also computed and presented in the last column of Table 6. The simulation results in Table 6 are generally consistent with the theoretical results. The simulation variances are slightly larger than the theoretical variances because the estimators were not computed until convergence. Table 7 displays the mean, relative bias, and the t-statistic of the estimated variance of the one-step GN method. The relative bias is the Monte Carlo bias of the variance estimator divided by the Monte Carlo variance, where the variance is given in Table 6. The t-statistic for testing the hypothesis of zero bias is the Monte Carlo estimated bias divided by the Monte Carlo standard error of the estimated bias. The t-values as well as the values 26

28 of relative biases state that estimated variances of our estimators computed using 10) are close to their theoretical values. The simulation results in Table 6 show that the two procedures have similar performance in terms of point estimation. The efficiency is slightly better for the one-step GN method because the EM algorithm was terminated after 10 iterations. The efficiency improvement is larger for the variance parameters than for the mean parameters, which suggests that convergence of the EM algorithm for the mean parameters is faster than convergence for the variance parameters. Also, as can be seen in Table 7, the one-step GN method provides consistent variance estimates for all parameters. The performance is better for a larger sample size because the consistency of the variance estimator is justified from the asymptotic theory. 6 Concluding remarks We have proposed a Gauss-Newton algorithm for obtaining the maximum likelihood estimator under a general non-monotone missing data scenario. The proposed method is shown to be algebraically equivalent to the Newton- Raphson method but avoids the burden of obtaining the observed likelihood. Instead, the MLEs separately computed from each partition of the marginal likelihoods and the full likelihoods are combined in a natural way. The way we combine the information takes the form of GLS estimation and thus can be easily implemented using the existing software. The estimated covariance matrix is computed automatically and shows good finite sample performance in the simulation study. The proposed method is not restricted to the multivariate normal distribution. It can be applied to any parametric multivariate 27

29 distribution as long as the computation for the marginal likelihood and the full likelihood are relatively easier than that of the observed likelihood. The proposed method assumes an ignorable response mechanism. A more realistic situation would be the case when the probability of y 2 missing depends on the value of y 1. In this case, the assumption of missing at random no longer holds and we have to take the response mechanism into account. Further investigation in this direction is a topic for a future research. Appendix A. Proof of Theorem 1 Note that the observed log-likelihood can be written as a sum of the loglikelihood in each set: log l obs θ) = log l H θ) + log l K θ) + log l L θ), A.1) where l H = i H f y; θ) is the likelihood function defined in set H, and l K and l L are defined similarly. Under MAR, l H is the likelihood for the joint distribution of y 1 and y 2, l K is the likelihood for the marginal distribution of y 1, and l L is the likelihood for the marginal distribution of y 2. By A.1), the score function for the likelihood can be written as S obs y; θ) = S H y; θ) + S K y; θ) + S L y; θ) A.2) and the expected information matrix also satisfies the additive decomposition I obs θ) = I H θ) + I K θ) + I L θ), A.3) where I H θ) = E [ 2 log l H θ) / θ θ ], and I K θ) and I L θ) are defined similarly. 28

30 The equation in A.3) can be written as ) ) η I obs θ) = H ηh I H η θ H ) θ + ) ) ηl ηl + I L η θ L ) θ ) ηk I K η θ K ) ) ηk θ = X ˆV 1 X, A.4) where X = η H / θ, η K / θ, η L / θ) and ˆV 1 = diag {I H η H ), I K η K ), I L η L )}. Now, consider the score function in A.2). Using the chain rule, the score function can be written as ) ) ) η S obs y; θ) = H η S H y; η θ H ) + K η S K y; η θ K ) + L S L y; η θ L ). A.5) Let ˆη H be the MLE of the likelihood l H. Taking a Taylor expansion of S H y; η H ) about ˆη H leads to S H y; η H ). = S H y; ˆη H ) I H ˆη H ) η H ˆη H ), where I H η H ) = 2 log l H η H ) / η H η H. Using S H y; ˆη H ) = 0 and the convergence of the observed information matrix to the expected information matrix, we have S H y; η H ). = I H ˆη H ) η H ˆη H ). Similar results hold for the sets K and L. Thus, A.5) becomes ) ). η S obs y; θ) = H η I K ˆη θ H ) ˆη H η H ) + K I K ˆη θ K ) ˆη K η K ) ) η + L I L ˆη θ L ) ˆη L η L ) = X ˆV 1 ˆη η). A.6) Therefore, inserting A.4) and A.6) into 13), we have 9). 29

31 B. Computations for variance formula Using ) 1 X 1 H X H + X 1 K ˆV K X K = X 1 H X H ) 1 X K it can be easily shown that X H ) 1 1 ˆV H X H [ ˆV K + X K X 1 H X H ) { 1 X 1 H X H + X 1 K ˆV K X Σbb K = diag, 2σ σ 11,, n H n H n H + n K Now, to use the formula ) 1 X 1 H X H + X 1 L ˆV L X L = X 1 H X H ) 1 X L X H ) 1 1 ˆV H X H [ ˆV L + X L X 1 H X H ) ] X K X K X 1 H H) X, 2σ 2 11 n H + n K }. ) ] X L X L X 1 H H) X, note that ˆV L + X L X H ) { } 1 1 ˆV H X H X σ22 L = diag, 2σ2 22, n HL n HL where n 1 HL = n 1 H + n 1 L, and ) 1 X L X 1 H X 1 H = n H σ σ β 21.1 σ 22.1 µ 1 2β 21.1 σ σ σ12 2 ). Thus, we have ) 1 X 1 H X H + X 1 L ˆV L X L = ˆV H n HL n 2 H σ 22.1 /σ 22 β 21.1 σ 22.1 /σ β 21.1 σ 22.1 /σ σ /σ 2 22 σ 12 /σ σ 2 12/σ 2 22 σ β 21.1 σ β 21.1 σ σ σ σ 2 12, 30

32 which present the variances of ˆθ HK. where To compute the variances of ˆθ HKL, use ) 1 X 1 HK ˆV HK X HK + X 1 L ˆV L X L = X 1 HK ˆV HK X HK ) 1 X L X HK ) 1 1 ˆV HK X HK [ ˆV L + X L X 1 HK ˆV HK X HK ) 1 X 1 HK ˆV HK X HK = X 1 H X H + X 1 1. K ˆV K K) X Writing D ˆV ) 1 L + X L X 1 HK ˆV HK X HK X L { σ22 = diag 1 pl p K ρ 2), 2σ pl p K ρ 4) }, n HL n HL ) ] X L X L X 1 HK ˆV HK HK) X, where n 1 HL = n 1 H + n 1 L, p L = n L /n H + n L ), and p K = n K /n H + n K ), and X L = 1 n H X HK ) 1 1 ˆV HK X HK σ σ 12 1 p K ) 0 2β 21.1 σ 22.1 µ 1 2β 21.1 σ σ σ p K ) the variance of ˆθ HKL can be obtained by X HK 1 n 2 H ) 1 1 ˆV HK X HK + X 1 L ˆV L X L = σ β 21.1 σ β 21.1 σ σ σ 12 1 p K ) 0 0 2σ p K ) X HK D 1 ), ) 1 1 ˆV HK X HK σ β 21.1 σ β 21.1 σ σ σ 12 1 p K ) 0 0 2σ p K ) 31

33 References Anderson, T.W. 1957). Maximum likelihood estimates for the multivariate normal distribution when some observation are missing, Journal of the American Statistical Association 52, Chen, Q., Ibrahim, J. G., Chen, M-H, and Senchaudhuri, P. 2008). Theory and inference for regression models with missing responses and covariates, Journal of Multivariate Analysis 99, Cox, D.R. and Reid, N. 1987). Parameter orthogonality and approximate conditional inference with discussion). Journal of Royal Statistical Society: Series B 49, Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977). Maximum likelihood from incomplete data via the EM algorithm with discussion), Journal of Royal Statistical Society, Series B 39, Kim, J.K. 2004). Extension of Factoring Likelihood Approach To Non- Monotone Missing Data, Journal of the Korean Statistical Society 2004), 33, Lehmann, E.L. 1983). Theory of Point Estimation. Wiley, New York. Little, R.J.L. 1982). Models for nonresponse in sample surveys, Journal of the American Statistical Association 77, Little, R.J.L. 1992). Regression with missing X s: A review, Journal of the American Statistical Association 87,

34 Little, R.J.L. and Rubin, D.B. 2002). data. Wiley, New York. Statistical Analysis with missing Liu, C. and Rubin, D. B. 1994). The ECME Algorithm: A Simple Extension of EM and ECM with Faster Monotone Convergence, Biometrika 81, Louis, T. A. 1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B 44, Meng, X.-L. and Rubin, D.B. 1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association 86, Molenberghs, G. and Kenward, M. 2007). Missing Data in Clinical Studies. Wiley, New York. Rubin, D.B. 1974). Characterizing the estimation of parameters in incomplete data problems, Journal of the American Statistical Association 69, Rubin, D.B. 1976). Inference and missing data, Biometrika 63, Seber, G.A.F. and Wild, C.J. 1989). Nonlinear Regression. Wiley, New York. 33

### Parametric fractional imputation for missing data analysis

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Biometrika (????,??,?, pp. 1 14 C???? Biometrika Trust Printed in

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Statistical Analysis with Missing Data

Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### Handling attrition and non-response in longitudinal data

Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

### i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter

### Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear

### Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

### OLS in Matrix Form. Let y be an n 1 vector of observations on the dependent variable.

OLS in Matrix Form 1 The True Model Let X be an n k matrix where we have observations on k independent variables for n observations Since our model will usually contain a constant term, one of the columns

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization Function to minimize w.r.t. b 0, b 1 Q = n (Y i (b 0 + b 1 X i )) 2 i=1 Minimize this by maximizing

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Note on the EM Algorithm in Linear Regression Model

International Mathematical Forum 4 2009 no. 38 1883-1889 Note on the M Algorithm in Linear Regression Model Ji-Xia Wang and Yu Miao College of Mathematics and Information Science Henan Normal University

### Missing Data Dr Eleni Matechou

1 Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: R.J.A. Little and D.B. Rubin 2nd edition Statistical Analysis with Missing Data J.L. Schafer and J.W.

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Introduction to latent variable models

Introduction to latent variable models lecture 1 Francesco Bartolucci Department of Economics, Finance and Statistics University of Perugia, IT bart@stat.unipg.it Outline [2/24] Latent variables and their

### Logistic Regression (1/24/13)

STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

### A Basic Introduction to Missing Data

John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

### Factorization Theorems

Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

### CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In

### Multiple Imputation for Missing Data: A Cautionary Tale

Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust

### Chapter 4: Vector Autoregressive Models

Chapter 4: Vector Autoregressive Models 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie IV.1 Vector Autoregressive Models (VAR)...

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

Cointegration The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series. Economic theory, however, often implies equilibrium

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models

A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models Grace Y. Yi 13, JNK Rao 2 and Haocheng Li 1 1. University of Waterloo, Waterloo, Canada

### Factor analysis. Angela Montanari

Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

### Department of Economics

Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 1473-0278 On Testing for Diagonality of Large Dimensional

### On Small Sample Properties of Permutation Tests: A Significance Test for Regression Models

On Small Sample Properties of Permutation Tests: A Significance Test for Regression Models Hisashi Tanizaki Graduate School of Economics Kobe University (tanizaki@kobe-u.ac.p) ABSTRACT In this paper we

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Microeconometrics Blundell Lecture 1 Overview and Binary Response Models

Microeconometrics Blundell Lecture 1 Overview and Binary Response Models Richard Blundell http://www.ucl.ac.uk/~uctp39a/ University College London February-March 2016 Blundell (University College London)

### Statistics in Geophysics: Linear Regression II

Statistics in Geophysics: Linear Regression II Steffen Unkel Department of Statistics Ludwig-Maximilians-University Munich, Germany Winter Term 2013/14 1/28 Model definition Suppose we have the following

### A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

123 Kwantitatieve Methoden (1999), 62, 123-138. A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA Joop J. Hox 1 ABSTRACT. When we deal with a large data set with missing data, we have to undertake

### Poisson Models for Count Data

Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

### GLM, insurance pricing & big data: paying attention to convergence issues.

GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.

### Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics

### Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

### Life Table Analysis using Weighted Survey Data

Life Table Analysis using Weighted Survey Data James G. Booth and Thomas A. Hirschl June 2005 Abstract Formulas for constructing valid pointwise confidence bands for survival distributions, estimated using

### Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

### Item Imputation Without Specifying Scale Structure

Original Article Item Imputation Without Specifying Scale Structure Stef van Buuren TNO Quality of Life, Leiden, The Netherlands University of Utrecht, The Netherlands Abstract. Imputation of incomplete

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### An EM algorithm for the estimation of a ne state-space systems with or without known inputs

An EM algorithm for the estimation of a ne state-space systems with or without known inputs Alexander W Blocker January 008 Abstract We derive an EM algorithm for the estimation of a ne Gaussian state-space

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree

### Estimation and Inference in Cointegration Models Economics 582

Estimation and Inference in Cointegration Models Economics 582 Eric Zivot May 17, 2012 Tests for Cointegration Let the ( 1) vector Y be (1). Recall, Y is cointegrated with 0 cointegrating vectors if there

### Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

[Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator

### Dealing with Missing Data

Dealing with Missing Data Roch Giorgi email: roch.giorgi@univ-amu.fr UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January

### Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate

ALLISON 1 TABLE OF CONTENTS 1. INTRODUCTION... 3 2. ASSUMPTIONS... 6 MISSING COMPLETELY AT RANDOM (MCAR)... 6 MISSING AT RANDOM (MAR)... 7 IGNORABLE... 8 NONIGNORABLE... 8 3. CONVENTIONAL METHODS... 10

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3.

IDENTIFICATION AND ESTIMATION OF AGE, PERIOD AND COHORT EFFECTS IN THE ANALYSIS OF DISCRETE ARCHIVAL DATA Stephen E. Fienberg, University of Minnesota William M. Mason, University of Michigan 1. INTRODUCTION

### Sections 2.11 and 5.8

Sections 211 and 58 Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I 1/25 Gesell data Let X be the age in in months a child speaks his/her first word and

### 1 Short Introduction to Time Series

ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The

### Econometric Analysis of Cross Section and Panel Data Second Edition. Jeffrey M. Wooldridge. The MIT Press Cambridge, Massachusetts London, England

Econometric Analysis of Cross Section and Panel Data Second Edition Jeffrey M. Wooldridge The MIT Press Cambridge, Massachusetts London, England Preface Acknowledgments xxi xxix I INTRODUCTION AND BACKGROUND

### Visualization of missing values using the R-package VIM

Institut f. Statistik u. Wahrscheinlichkeitstheorie 040 Wien, Wiedner Hauptstr. 8-0/07 AUSTRIA http://www.statistik.tuwien.ac.at Visualization of missing values using the R-package VIM M. Templ and P.

### Hypothesis Testing in the Classical Regression Model

LECTURE 5 Hypothesis Testing in the Classical Regression Model The Normal Distribution and the Sampling Distributions It is often appropriate to assume that the elements of the disturbance vector ε within

### Weighted Least Squares

Weighted Least Squares The standard linear model assumes that Var(ε i ) = σ 2 for i = 1,..., n. As we have seen, however, there are instances where Var(Y X = x i ) = Var(ε i ) = σ2 w i. Here w 1,..., w

### MATHEMATICAL METHODS OF STATISTICS

MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS

### 2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

### Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

### Solution of Linear Systems

Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

### 4.7 Confidence and Prediction Intervals

4.7 Confidence and Prediction Intervals Instead of conducting tests we could find confidence intervals for a regression coefficient, or a set of regression coefficient, or for the mean of the response

### 1 Limiting distribution for a Markov chain

Copyright c 2009 by Karl Sigman Limiting distribution for a Markov chain In these Lecture Notes, we shall study the limiting behavior of Markov chains as time n In particular, under suitable easy-to-check

### Definition The covariance of X and Y, denoted by cov(x, Y ) is defined by. cov(x, Y ) = E(X µ 1 )(Y µ 2 ).

Correlation Regression Bivariate Normal Suppose that X and Y are r.v. s with joint density f(x y) and suppose that the means of X and Y are respectively µ 1 µ 2 and the variances are 1 2. Definition The

### What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### Time Series Analysis III

Lecture 12: Time Series Analysis III MIT 18.S096 Dr. Kempthorne Fall 2013 MIT 18.S096 Time Series Analysis III 1 Outline Time Series Analysis III 1 Time Series Analysis III MIT 18.S096 Time Series Analysis

### Introduction to mixed model and missing data issues in longitudinal studies

Introduction to mixed model and missing data issues in longitudinal studies Hélène Jacqmin-Gadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models

### Maximum likelihood estimation of mean reverting processes

Maximum likelihood estimation of mean reverting processes José Carlos García Franco Onward, Inc. jcpollo@onwardinc.com Abstract Mean reverting processes are frequently used models in real options. For

### Problem of Missing Data

VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

### Joint models for classification and comparison of mortality in different countries.

Joint models for classification and comparison of mortality in different countries. Viani D. Biatat 1 and Iain D. Currie 1 1 Department of Actuarial Mathematics and Statistics, and the Maxwell Institute

### From the help desk: Bootstrapped standard errors

The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

### SYSTEMS OF REGRESSION EQUATIONS

SYSTEMS OF REGRESSION EQUATIONS 1. MULTIPLE EQUATIONS y nt = x nt n + u nt, n = 1,...,N, t = 1,...,T, x nt is 1 k, and n is k 1. This is a version of the standard regression model where the observations

### Imputing Missing Data using SAS

ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

### Estimating an ARMA Process

Statistics 910, #12 1 Overview Estimating an ARMA Process 1. Main ideas 2. Fitting autoregressions 3. Fitting with moving average components 4. Standard errors 5. Examples 6. Appendix: Simple estimators

### Journal of Statistical Software

JSS Journal of Statistical Software August 26, Volume 16, Issue 8. http://www.jstatsoft.org/ Some Algorithms for the Conditional Mean Vector and Covariance Matrix John F. Monahan North Carolina State University

### C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)}

C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)} 1. EES 800: Econometrics I Simple linear regression and correlation analysis. Specification and estimation of a regression model. Interpretation of regression

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### Random Vectors and the Variance Covariance Matrix

Random Vectors and the Variance Covariance Matrix Definition 1. A random vector X is a vector (X 1, X 2,..., X p ) of jointly distributed random variables. As is customary in linear algebra, we will write

### The Characteristic Polynomial

Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem

### D-optimal plans in observational studies

D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

### Multiple Testing. Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf. Abstract

Multiple Testing Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf Abstract Multiple testing refers to any instance that involves the simultaneous testing of more than one hypothesis. If decisions about

### Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

### A General Approach to Variance Estimation under Imputation for Missing Survey Data

A General Approach to Variance Estimation under Imputation for Missing Survey Data J.N.K. Rao Carleton University Ottawa, Canada 1 2 1 Joint work with J.K. Kim at Iowa State University. 2 Workshop on Survey

### The Method of Least Squares

Hervé Abdi 1 1 Introduction The least square methods (LSM) is probably the most popular technique in statistics. This is due to several factors. First, most common estimators can be casted within this

### NON-LIFE INSURANCE PRICING USING THE GENERALIZED ADDITIVE MODEL, SMOOTHING SPLINES AND L-CURVES

NON-LIFE INSURANCE PRICING USING THE GENERALIZED ADDITIVE MODEL, SMOOTHING SPLINES AND L-CURVES Kivan Kaivanipour A thesis submitted for the degree of Master of Science in Engineering Physics Department

### Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

### Testing Linearity against Nonlinearity and Detecting Common Nonlinear Components for Industrial Production of Sweden and Finland

Testing Linearity against Nonlinearity and Detecting Common Nonlinear Components for Industrial Production of Sweden and Finland Feng Li Supervisor: Changli He Master thesis in statistics School of Economics

### Inner Product Spaces and Orthogonality

Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,

### How To Use A Monte Carlo Study To Decide On Sample Size and Determine Power

How To Use A Monte Carlo Study To Decide On Sample Size and Determine Power Linda K. Muthén Muthén & Muthén 11965 Venice Blvd., Suite 407 Los Angeles, CA 90066 Telephone: (310) 391-9971 Fax: (310) 391-8971

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Row Echelon Form and Reduced Row Echelon Form

These notes closely follow the presentation of the material given in David C Lay s textbook Linear Algebra and its Applications (3rd edition) These notes are intended primarily for in-class presentation

### Regression Modeling Strategies

Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

### Analyzing Structural Equation Models With Missing Data

Analyzing Structural Equation Models With Missing Data Craig Enders* Arizona State University cenders@asu.edu based on Enders, C. K. (006). Analyzing structural equation models with missing data. In G.

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes