Multivariate Logistic Regression Analysis of Complex Survey Data with Application to BRFSS Data
|
|
|
- Caroline Jennings
- 10 years ago
- Views:
Transcription
1 Journal of Data Science 10(2012), Multivariate Logistic Regression Analysis of Complex Survey Data with Application to BRFSS Data Minggen Lu and Wei Yang University of Nevada, Reno Abstract: Multiple binary outcomes that measure the presence or absence of medical conditions occur frequently in public health survey research. The multiple possibly correlated binary outcomes may compose of a syndrome or a group of related diseases. It is often of scientific interest to model the interrelationships not only between outcome and risk factors, but also between different outcomes. Applied and practical methods dealing with multiple outcomes from complex designed surveys are lacking. We propose a multivariate approach based on the generalized estimating equation (GEE) methodology to simultaneously conduct survey logistic regressions for each binary outcome in a single analysis. The approach has the following attractive features: 1) It enables modeling the complete information from multiple outcomes in a single analysis; 2) it permits to test the correlations between multiple binary outcomes; 3) it allows of discerning the outcome-specific effect and the overall risk factor effect; and 4) it provides the measurement of difference of the association between risk factors and multiple outcomes. The proposed method is applied to a study on risk factors for heart attack and stroke in 2009 U.S. nationwide Behavioral Risk Factor Surveillance System (BRFSS) data. Key words: Behavior Risk Factor Surveillance System (BRFSS), cardiovascular disease, generalized estimating equation (GEE), heart attack, odds ratio (OR), stroke. 1. Introduction Key public health data such as BRFSS are often obtained through complex design, which involves stratification, clustering, multistage sampling, and unequal probability of selection of participants and responding rates. In order to make valid inference for the interested population where samples originated, appropriate statistical methods are required to analyze the complex survey data. Corresponding author.
2 158 Minggen Lu and Wei Yang For binary outcomes that measure the presence or absence of certain medical conditions, e.g. with or without a cardiovascular disease (CVD), survey logistic regression model is the standard approach to analyze the relationship between the binary dependent variable and a set of explanatory variables by incorporating the sample design information, including stratification, clustering, and unequal weighting (Greenlund et al., 2004; Kim and Beckles, 2004). If multiple binary outcomes are assessed on the same individual, for example, several CVD related health conditions, including stroke and heart attack, they are subject to share the same characteristics to that individual and are likely to exhibit the correlations within the subject (Fitzmaurice et al., 1995; France et al., 2009; Sergeev and Carpenter, 2010). The standard survey logistic regression method is inapplicable for these situations. In this manuscript, we propose a marginal multivariate approach using GEE techniques (Liang and Zeger, 1986; Prentice, 1988; Lipsitz et al., 1991) to model the relationship of multiple responses with explanatory variables, and the association between pairs of responses in a single analysis by incorporating the sampling design adjustment for complex survey data. The first-order generalized estimating equations (Liang and Zeger, 1986) are implemented to describe the effects of explanatory variables on each binary response, while the odds ratio estimated by modified second-order estimating equations (Lipsitz et al., 1991) is applied to characterize the degree of association between binary responses. Horton and Fitzmaurice (2004) proposed independence estimating equations for purposes of estimation and made an adjustment to standard errors of estimators to account for the correlation among the outcomes. Our method explicitly models the dependence among the outcomes, while Horton and Fitzmaurice (2004) regarded it as a nuisance feature of data, assuming working independence among outcomes. Modeling the dependence has the potential to yield more efficient estimators. The proposed method has several advantages: First, it captures the complete information about multiple outcomes in a single regression model; second, it allows the assessment of correlation between outcomes by taking into account the covariance structure of responses; third, it provides the test of outcome-specific effects and overall risk factor effect; fourth, it supplies the measurement of difference of associations between risk factors and multiple outcomes; finally, the degree of dependence between the responses from the same individual is measured by the odds ratio rather than the correlation, since the odds ratio is easy to interpret and provides a natural way for modeling the within-subject dependence for binary responses. This paper is organized as follows. Section 2 describes the GEE method for complex survey sample with multiple correlated binary responses. Also, the comparison between univariate and multivariate estimations is discussed. In Section
3 Multivariate Logistic Regression for Complex Survey 159 3, the proposed method is applied to BFRSS data. Finally, Section 4 concludes with a discussion. 2. Methods 2.1 Complex Survey Data In many epidemiological studies the source data arise from complex survey sample. There are three main features that need to be accounted in the analysis: (i) stratification, (ii) clustering, and (iii) sampling weight. For example, the BFRSS used a complex survey design with stratification, multi-stage clustering, and unequal sampling weights. It is very common in survey study to divide the population into distinct subpopulations, referred to as strata. Within each stratum, a separate sample is selected from the sampling units independently. The variance of the estimate will decrease if the sampling units within each stratum are homogeneous. Failure to account for the stratification in the analysis will result in overestimation of the standard error, and hence too wide confidence interval. The second common feature in complex survey data is called clustering. In clustering, the total population is divided into some groups (or clusters) and a sample of the groups is selected. In multistage clustering, the clusters selected at first stage are called primary sampling units or PSUs. Further sample selection occurs within PSUs and so on. In general, failure to account for the clustering in the analysis may lead to underestimation of variabilities. Finally, unequal selection in each PSU occurs in many epidemiological surveys. The sampling weight is used as the measure of how many units in the population which the sampled PSU represents. The unequal selection probabilities must be taken into account in analysis to reduce the bias of the estimate and the underestimation of variabilities. 2.2 GEE Approach for Complex Survey Data In this section we consider extensions of the population-averaged marginal GEE approach for complex survey data. Assume the population is divided into H distinct strata. In each stratum h, the sample is consisted of n h clusters and each cluster is comprised of m hi units, h = 1,, H, i = 1,, n h. Let Y = (Y 1,, Y T ) T and X = (X 1,, X T ) T denote the T 1 vector of binary responses and the T p matrix of covariate for th unit, respectively, j = 1,, m hi. The sampling weight for the th unit is denoted by ω. Suppose that covariates X t are associated with each marginal observation Y t, t = 1 T. Let K be the sample size. We wish to estimate a logistic regression
4 160 Minggen Lu and Wei Yang model in the marginal means logit Pr(Y t = 1) = logitπ t = X T t β, for t = 1,, T, where β is a p-dimensional vector of regression coefficients. It is assumed that the individuals Y are independent, but the marginal observations Y 1,, Y T may be correlated, h = 1,, H, i = 1,, n h, and j = 1,, m hi. We take into account the correlation among the multiple responses by applying the first-order generalized estimating equations proposed by Liang and Zeger (1986) to estimate regression coefficients β and standard errors se(β). The efficiency is expected to increase by utilization of the covariance structure of responses. We refer to R(α) as a working correlation matrix of Y 1,, Y T, where α is a vector which fully characterizes R(α). Then the first estimating equations can be written as U 1 (α, β) = H n h m hi h=1 i=1 j=1 ω D T V 1 S = 0, (1) where D i = π / β = A X, V = A 1/2 R(α)A1/2, A = diag{π t (1 π t )} and S = Y π. These generalized estimating equations yield consistent estimators of regression parameters β under the correct specification of the form π. The covariance matrix of U 1 (α, β), which is the negative expected value of U 1 (β)/ β, can be written as I 1 = H n h m hi h=1 i=1 j=1 ω D T V 1 D. The correlation α is usually estimated by a T (T 1)/2 vector of empirical correlations with elements r st = (Y s π s )(Y t π t ) πs (1 π s )π t (1 π t ), and then use a second set of equations similar to (1). In this manuscript, we apply the odds ratio method proposed by Lipsitz et al. (1991) to measure the association between pairs of binary responses. Let Z = {Z st } be a T (T 1) random vector, where Z st = I[Y s = 1, Y t = 1] = Y s Y t, and I[ ] is an indicator function. The joint probability of success for Y s and Y t is π st = E(Z st ) = Pr[Y s = 1, Y t = 1].
5 Multivariate Logistic Regression for Complex Survey 161 Now the odds ratio, τ st, between the binary responses Y s and Y t can be written as a function π s, π t, and π st τ st = π st(1 π s π t + π st ) (π s π st )(π t π st ). Solving for π st in terms of odds ratios τ st and the two marginal probabilities π s and π s leads to f st [fst 2 4τ st(τ st 1)π s π t ] 1/2, if τ π st = st 1, 2(τ st 1) π s π t, if τ st = 1, where f st = 1 (1 τ st )(π s + π t ). Now we model the logarithm of odds ratio as the linear combination of α log τ st = e T st α st, 1 s < t T. Note that π st = E(Z st ) is a function of β through marginal means π s and π ft and α through odds ratios τ st. The first derivative of π = E(Z ) with respect to parameter α can be expressed as where π st α = π s + π t A 1/2 st η st 2(τ st 1) f st A 1/2 st 2(τ st 1) 2 η st = f st (π s + π t ) (4τ st 2)π s π t, A st = f st 4τ st (τ st 1)π s π t. Then the second set of estimating equations is U 2 (α, β) = H n h m hi h=1 i=1 j=1 τ st α, ω C T W 1 Q = 0, (2) where C = π st α, W = diag{π st (1 π st )}, Q = Z θ, and θ is the model for E(Z ). Similarly, the covariance matrix of U 2 (α, β) can be written as H n h m hi I 2 = ω C T W 1 C. h=1 i=1 j=1
6 162 Minggen Lu and Wei Yang Let (ˆα, ˆβ) be solutions of (1) and (2). By Taylor expansion, the valid estimators of variances of ˆα and ˆβ that take into accounts the stratification, clustering, and unequal selection probability are provided by where, ˆV 1 = ˆV 2 = H h=1 H h=1 m hi B hi = j=1 m hi A hi = j=1 Ĉov( ˆβ) = Î 1 1 ˆV 1 Î 1 Ĉov(ˆα) = Î 1 2 n h n h 1 n h n h 1 n h 1, (3) ˆV 2 Î2 1, (4) (B hi B hi )(B hi B hi ) T, i=1 n h (A hi Āhi)(A hi Āhi) T, i=1 ω X Â1/2 T ˆR 1 Â 1/2 X, ω Ĉ T Ŵ 1 Ĉ, n h Ā hi = 1/n h A hi, i=1 n h Bhi = 1/n h B hi. The estimates Î1, Î2, Â, Ĉ, Ŵ and ˆR are evaluated at (ˆα, ˆβ). Note that (3) and (4) are similar to those advocated by Liang and Zeger (1986) and Lipsitz et al. (1991), except that (3) and (4) account for unequal selection probability and use V 1 and V 2, the pooled within-stratum estimators of U 1 and U 2. These variance estimators are robust in the sense of being consistent even if the working covariances are misspecified. 2.3 Computational Implementation To compute the estimators (ˆα, ˆβ) of (1) and (2), a Fisher-scoring type iterative algorithm can be applied with starting values (α 0, β 0 ) for (α, β). Let the mth step estimate be (α (m), β (m) ). The (m + 1)th step estimates are β (m+1) = β (m) α (m+1) = α (m) h,i,j D (m)t h,i,j C (m)t V (m) 1 W (m) 1 D (m) C (m) 1 i=1 h,i,j D (m)t 1 h,i,j C (m)t V (m) 1 (Y π (m) ) W (m) 1 (U θ (m) ),,
7 Multivariate Logistic Regression for Complex Survey 163 where D (m), C(m), V (m), W (m), π(m) and θ (m) are evaluated at (α(m), β (m) ). The iteration converges at (m + 1)th step if β (m+1) β (m) β (m) < ε = 10 8 and α(m+1) α (m) α (m) < ε = Denote D = (D1 T,, DT K )T, C = (C1 T,, CT K )T, S = (S1 T,, ST K )T and Q = (Q T 1,, QT K )T. Let V and W be block diagonal matrices with V and W as the diagonal elements, respectively. Then the iterative procedure can be written as β (m+1) = (D T V 1 D) 1 D T V 1 (Dβ (m) S), α (m+1) = (C T W 1 C) 1 C T W 1 (Cα (m) Q). Define the modified outcomes Z 1 = Dβ S and Z 2 = Cα Q. Then the iterative procedure for calculating ˆβ and ˆα is equivalent to performing iteratively reweighted regression of Z 1 on D with weight V 1 and regression of Z 2 on C with weight W 1. The algorithm was implemented in R Comparison between Univariate and Multivariate Estimates For correlated binary responses arising from complex survey sample, the standard survey logistic regression is inapplicable. To address this issue, one approach (Greenlund et al., 2005; Hayes et al., 2006) is to pool multiple binary outcomes into a single categorical outcome and then regress the new summary response on predictors. The main limitation in this approach is that the relationship between specific outcome and predicators can be masked by combining the indicators of multiple processes. Another approach (Koziol-McLai et al., 2001; Cumyn et al., 2009; Bitton et al., 2010) is to fit separate logistic models for each outcome. Although the effects of covariates on each outcome can be discerned, this approach is inability to test the difference of risk factor effects on multiple outcomes and the association between multiple responses. To account for the possible paired correlations, there are three main approaches. The first approach (Horton and Fitzmaurice, 1994) is to assume that the observations are independent for the purposes of estimation, but make a suitable adjustment of standard errors for the possible correlation among the observations. The estimator of β is called independence estimating equation (IEE) estimator. Note that the IEE estimator is equivalent to the univariate estimator with a robust variance correction. Liang and Zeger (1986) showed that the IEE estimator is consistent for β and asymptotically normally distributed. The IEE estimator is preferable, since it is easily implemented in existing statistical packages. The second approach is to devise a generalized estimating equation
8 164 Minggen Lu and Wei Yang proposed by Liang and Zeger (1986) and Prentice (1988), which generalize the independent estimating equation to incorporate a working correlation, or some other association such as odds ratio. Under the assumption of correct specification of mean functions, the GEE estimator is consistent and asymptotically normal. The third approach is to completely specify the joint distribution of the observations, by introducing the extra parameters for the association among the observations; see, for example, Rosner (1984), Prentice (1986), Liang and Zeger (1989), and Lipsitz et al. (1990) among others. The asymptotic relative efficiency of the IEE estimator and the GEE estimator has been discussed extensively (Emrich and Piedmonte, 1992; Sharples and Breslow, 1992; Lee et al., 1993; Mancl and Leroux, 1996). McDonald (1993) considered bivariate data with possibly different covariates for each binary observation and recommended the IEE estimator for practical purpose whenever the association between the paired observations is a nuisance. Zhao et al. (1992) suggested that incorrectly assuming independence can lead to important losses of efficiency when the correlation between responses is high. Fitzmaurice (1995) demonstrated that the asymptotic relative efficiency depends on not only the strength of correlation between the paired responses, but also the covariate distribution, such as between-cluster and within-cluster covariate designs; that is, a covariate is constant or varies within the cluster. In this manuscript, we adopt the within-cluster covariate design, in which a within-cluster indicator variable is used to fit a survey logistic regression of multiple responses simultaneously. We assume the mean structure is correctly specified, but allow the correlation among the responses to be misspecified, including by incorrectly assuming independence. Following the approach of Fitzmaurice (1995), we can show that when the responses are strongly correlated and the covariate distribution is within cluster, the IEE estimator which assumes the independence can lead to considerable loss of efficiency, compared to the GEE estimator for complex survey sample. Therefore, for correlated binary responses arising from complex survey sample, we adopt the GEE estimator which is relatively straightforward, compared to the full likelihood estimator, and is more efficient than the IEE estimator for within-cluster covariate design. 3. Data Analysis: Application to BRFSS Data 3.1 Sample Data The BRFSS is a state-based, random-digit telephone survey of the U.S. noninstitutionalized, civilian population. Self-reported data from 353,280 people aged over 18 years who participated in the 2009 BRFSS were collected. The response rate, based on the Council of American Survey Research Organization (CASRO)
9 Multivariate Logistic Regression for Complex Survey 165 guidelines (White, 1984), was 52.48% ranging from 37.90% to 66.85% among states and territories for the 2009 BRFSS. Relative to other surveys, data from BRFSS have acceptable reliability and validity (Bowlin et al., 1993; Nelson et al., 2001). BRFSS data analysis involves the data weighting process which attempts to remove the bias in the sample probability of selection due to nonresponse and noncoverage errors, as well as to adjust variables of age, race, and gender between the sample and the entire population. Weight factors included are the number of residential telephones in a household, the number of adults in a household, and geographic or density stratification. Information on quality assurance and other aspects of this survey is available online at There are two dependent binary variables, heart attack and stroke which were collected as Have you ever told by a healthcare professional that you had a heart attack (myocardial infarction)? and Have you ever told by a healthcare professional that you had a stroke?. 3.2 Multivariate Survey Logistic Regression Analysis We apply the proposed multivariate survey logistic regression method to U.S BRFSS data. In addition to using covariates, such as age, sex, education level, marital status, overweight or obesity, and annual income as independent variables in the model, we also include an indicator variable in the model to allow us to fit a single logistic regression model to the data of multiple responses simultaneously. To simplify the notations, we omit subscriptions for stratum and cluster. Assume I is the indicator variable, where I = 0 stands for stroke and I = 1 stands for heart attack. Gender is defined as 1 for males and 0 for females. Age is defined as 1 for those aged 55 or older and 0 otherwise. Race is defined as 1 for non-hispanic white and 0 otherwise. Education is defined as 1 for those attended some college or technical school or graduated from college and 0 otherwise. Overweight or obesity is set to be 1 for body mass index (BMI) greater than or equal to 25 and 0 for BMI less than 25. Marital status is defined as 1 for those are married or a member of unmarried couple, 2 for those who are divorced or widowed or separated, and 3 for those who are never married. Annual income is defined as 1 for those whose annual household income is less than $35,000, 2 for those whose income is between $35,000 and $75,000, 3 for those whose income is above $75,000. We use two dummy variables for martial status and annual income. Let X be a set of covariates defined above and β = (β 0,, β 10 ) T and γ = (γ 1,, γ 10 ) T be the corresponding regression parameters for X and IX, respectively. Then the bivariate survey logistic regression model in the matrix form is π log 1 π = β 0 + X T β + Iγ 0 + IX T γ,
10 166 Minggen Lu and Wei Yang log OR = α, where π is the probability of a positive response and OR represents the odds ratio between binary outcomes stroke and heart attack, conditional on covariates. The regression parameters for stroke are β 0,, β 10, while β 0 + γ 0,, β 10 + γ 10 are corresponding coefficients for heart attack. The parameters for this interaction model are easy to interpret. For the purpose of illustration, we only interpret risk factors age and gender. The other covariates can be interpreted similarly. The response-specific log odds for stroke and heart attack are β 0 and β 0 + γ 0, respectively. The parameters β 1 and β 1 + γ 1 are the average log odds ratios for stroke and heart attack, respectively, for those who are aged fifty five or older, compared to younger individuals. Likewise, β 2 and β 2 + γ 2 are the average log odds ratios for stroke and heart attack, respectively, for males compared to females. The differences of log odds ratios between heart attack and stroke for age and gender are measured by γ 1 and γ 2, respectively. The positivity of γ 1 indicates that the effect of age on heart attack is greater than that on stroke. Likewise, the positivity of γ 2 demonstrates that the odds ratio of heart attack is greater than that of stroke for the risk factor gender. If the effects of risk factors do not vary by multiple responses, the interaction terms may be removed, and the overall effect to responses can be estimated. For example, the model log π 1 π = β 0 + X T β assumes that the effects of covariates X do not vary by the responses. The heart attack and stroke odds ratio, exp(γ 0 ), measures the extent to which individuals are more likely to contract heart attack than stroke. A value greater than 1 indicates that heart attack is more likely to occur than stroke, conditional on covariates. On the other hand, the log odds ratio α can be used to compare the strength of association between two binary responses. A value greater than 0 suggests positive association between stroke and heart attack, after adjusting for covariates. Since exp(α) is estimated with covariates, it can be considered as adjusted for risk factors. The adjusted odds ratio is usually smaller than the corresponding crude odds ratio. 3.3 Results The results of estimated regression coefficients and standard errors from univariate survey logistic regression model, pooled survey logistic regression model, and bivariate survey logistic regression model are presented in Tables 1 and 2. The estimated odds ratios are based on the estimated coefficients reported in Tables 1 and 2. Define a summary heart disease response variable HD = 1 if the subject was diagnosed with stroke or heart attack or both and 0 otherwise for
11 Multivariate Logistic Regression for Complex Survey 167 Table 1: Results of univariate survey logistic regressions for BFRSS heart disease study. Stroke and heart attack are binary outcomes in models 1 and 2, respectively. In pooled model, the binary outcome is stroke or heart disease or both. * represents p-value < 0.05 Model 1 Model 2 Pooled Model Parameter Estimate ± SE Estimate ± SE Estimate ± SE Intercept ± ± ± Age 1.622± ± ± Gender 0.117± ± ± Race ± ± ± Education ± ± ± Widowed/divorced/separated 0.386± ± ± Never married ± ± ± Overweight/obesity 0.073± ± ± Smoking 0.362± ± ± Middle income ± ± ± High income ± ± ± Table 2: Results of bivariate survey logistic regression for BFRSS heart disease study. * represents p-value < ** represents the estimates of interactions of covariates and I, where I is defined as 0 for stroke and 1 for heart attack Stroke Heart Attack Interaction Parameter Estimate ± SE Estimate ± SE Estimate ± SE Intercept ± ± ±0.089 Age 1.628± ± ±0.056 Gender 0.120± ± ±0.046 Race ± ± ±0.057 Education ± ± ±0.047 Widowed/divorced/separated 0.384± ± ±0.048 Never married ± ± ±0.114 Overweight/obesity 0.070± ± ±0.048 Smoking 0.359± ± ±0.058 Middle income ± ± ±0.055 High income ± ± ±0.075 pooled survey logistic regression. The univariate survey logistic regression model and the pooled survey logistic regression model were fitted using SAS SURVEY- LOGISTC Procedure in SAS 9.2 software, while the bivariate survey logistic regression model (GEE model) was fitted in R using a Fisher-scoring type algorithm to estimate the regression coefficients and the corresponding odds ratios. The pooled survey logistic regression yielded that age (OR = 5.86, p <.0001), male (OR = 1.74, p <.0001), non-hispanic white (OR = 1.13, p = ),
12 168 Minggen Lu and Wei Yang widowed or divorced or separate (OR = 1.36, p <.0001), overweight or obesity (OR = 1.24, p <.0001), and smoking (OR = 1.44, p <.0001) are significant risk factors for heart disease, while high education level (OR = 0.87, p <.0001), never married (OR = 0.56, p <.0001), middle annual income (OR = 0.56, p <.0001), and high annual income (OR = 0.33, p <.0001) appeared to be significant beneficial factors for heart disease. The univariate survey logistic regression models generated similar results to those of the pooled model, except that neither gender (OR = 0.98, p = 0.753) nor overweight or obesity (OR = 1.08, p = 0.065) has significant effect on stroke. The bivariate model demonstrated similar estimates of the relationships between the covariates and binary outcomes (Table 2). Moreover, the bivariate model showed that the effects of age (OR = 5.94 vs. OR = 5.09), gender (OR = 2.23 vs. OR = 0.99), race (OR = 1.18 vs. OR = 1.13), and overweight or obesity (OR = 1.33 vs. OR = 1.07) on heart attack are significantly stronger than those on stroke, with p-values 0.006, <.0001, 0.002, and <.0001, respectively, while the models demonstrated that effects of widowed or divorced or separate (OR = 1.27 vs. OR = 1.47) and never married (OR = 0.49 vs. OR = 0.68) have significantly smaller effects on heart attack than those on stroke, with p-values and 0.004, respectively. The bivariate model did not reject the possibility that there are no differences of odds ratios between heart attack and stroke in education level (OR = 0.86 vs. OR = 0.91), smoking (OR = 1.40 vs. OR = 1.43), middle annual income (OR = 0.57 vs. OR = 0.52), and high annual income (OR = 0.33 vs. OR = 0.32), with p-values 0.086, 0.721, 0.062, and 0.414, respectively. The GEE model estimated the odds ratio for within cluster (within-subject) dependence, i.e., exp(α), to be 7.53, p <.0001, after adjusting for covariates. 4. Discussion A health survey is often conducted to obtain information about the prevalence of diseases and unhealthy behaviors, exposures to potential risk factors, and cost and utilization of health-care services of a population. Binary outcomes that measure the presence or absence of certain medical conditions are common in survey research. Many survey studies measure a vector of health conditions to make an overall assessment about an individual health status. It is often an important task for health researchers to exam the relationship between multiple health conditions and predicators, including behaviors and socioeconomic measures in complex surveys. Since multiple binary outcomes are obtained from the same individual, they are likely to be correlated within the subject. Ignoring the interrelations among the multiple outcomes essentially will lead to inefficient estimates in statistical analysis. It is neither scientifically appropriate nor statistically efficient to fit separate logistic models for each binary outcome. Therefore,
13 Multivariate Logistic Regression for Complex Survey 169 the standard univariate survey logistic regression method is inapplicable for complex survey data with multiple binary outcomes. Most of the statistical approaches that simultaneously model the correlated binary response variables can be grouped into two groups: population-averaged marginal modeling using generalized estimating equations and cluster-specific hierarchical modeling using generalized linear mixed models (Stiratelli et al., 1984; Breslow and Calyton, 1993; Fahrmeir and Tutz, 1994; Das et al., 2004; Molenberghs and Verbeke, 2005). There are several advantages to utilize the population averaged marginal modeling approach. From a public health perspective, the GEE model provides population averaged estimates of the relationships between risk factors and clinic outcomes. Furthermore, the GEE approach is robust to model misspecification in terms of the covariance structure among the multiple responses. Moreover, compared with the cluster-specific modeling approach involving intensive multidimensional integral computation, the GEE estimation algorithm is relatively computationally efficient for large data sets (in our data, n = 353, 280) and widely implemented in standard statistical software, such as SAS, SPSS, Splus/R, STATA, and SUDAAN. Different from previous approaches when dealing with correlated binary responses arising from complex survey sample, e.g. combining multiple binary outcomes into a single categorical or using separated logistic models, our approach utilized GEE techniques to model the relationships between multiple clinical outcomes and risk factors, and the degree of dependence between the outcomes simultaneously. The results from BRFSS sample data can not only provide detailed odds ratios for the predicting variables to heart attack and stroke, either from univariate or bivariate survey logistic models, but also estimate the odds ratio for within cluster (within-subject) dependence (e.g. exp(α), to be 7.53, p <.0001). Our multivariate approach has added following advantages to analyze complex survey data. First, we may gain more precision by utilizing all information about multiple outcomes in a single analysis, instead of modeling separate logistic regression for each binary outcome; second, simultaneous modeling of all the outcomes allows not only to provide the test of outcome-specific effects and the overall risk factor effect, but also to characterize the association between the pairs of outcomes by taking into account the covariance structure of responses; finally, it permits the assessment of differences of associations between risk factors and multiple outcomes. This paper has illustrated the use of multivariate logistic interaction model as a flexible method for analysis of complex survey data involving multiple binary outcomes. The parameters in this interaction model have simple interpretations. The effects of covariates on specific outcome can be expressed as odds ratios, which are preferable for binary outcomes. The multivariate logistic regression interaction model has the advantage of allowing of testing whether the
14 170 Minggen Lu and Wei Yang effects of risk factors vary by responses. When the outcome-risk factor interaction term is present, separate outcome specific risk factor estimates can be computed. The difference of log odds ratio between two outcomes for specific risk factor can also be estimated and may have important interpretations. For example, the stronger association between sex and heart attack was observed than that between sex and stroke in BFRSS study. The multivariate regression approach proposed in this paper can be generalized to more general settings where multiple outcomes are measured from the same subject and the distributions of outcomes are exponential family. In fact, it is applicable to continuous, categorical, and count response variables by changing the link function of expected value of the outcome. For continuous and count variables, the link function would be identity function and logarithm function, respectively. It would also be interesting to adopt cluster specific hierarchical modeling using generalized linear mixed models (GLMM) to multistage stratified cluster sampling, especially in the situation that the estimation of effects of within-cluster covariates is of interest, and study the differences in conceptualization and interpretation between the population-averaged model and the cluster specific model for complex survey data. Acknowledgements This study was supported by University of Nevada, Reno, Junior Faculty Research Grant for first author. References Bitton, A., Zaslavsky, A. M. and Ayanian, J. Z. (2010). Health risks, chronic diseases, and access to care among US pacific islanders. Journal of General Internal Medicine 25, Bowlin, S. J., Morrill, B. D., Nafziger, A. N., Jenkins, P. L., Lewis, C. and Pearson, T. A. (1993). Validity of cardiovascular disease risk factors assessed by telephone survey: the behavioral risk factor survey. Journal of Clinical Epidemiology 46, Breslow, N. E. and Calyton, D. G. (1993). Approximate inference in generalized linear mixed model. Journal of the American Statistical Association 88, Cumyn, L., French, L. and Hechtman, L. (2009). Comorbidity in adults with attention-deficit hyperactivity disorder. Canadian Journal of Psychiatry 54,
15 Multivariate Logistic Regression for Complex Survey 171 Das, A., Poole, W. K. and Bada, H. S. (2004). A repeated measure approach for simultaneous modeling of multiple neurobehavioral outcomes in newborn exposed to cocaine in utero. American Journal of Epidemiology 159, Emrich, L. J. and Piedmonte, M. R. (1992). On some small sample properties of generalized estimating equations for multivariate dichotomous outcomes. Journal of Statistical Computation and Simulation 41, Fahrmeir, L and Tutz, G. (2001). Multivariate Statistical Modeling Based on Generalized Linear Models, 2nd edition. Springer, New York. Fitzmaurice, G. M. (1995). A caveat concerning independence estimating equations with multivariate binary data. Biometrics 51, Fitzmaurice, G. M., Laird, N. M., Zahner, G. E. P. and Daskalakis, C. (1995). Bivariate logistic regression analysis of childhood psychopathology ratings using multiple informants. American Journal of Epidemiology 142, France, E. and Velanovich, V. (2009). The relative influence of surgical disease and co-morbidities on patient responses to a generic health-related qualityof-life instrument. American Surgeon 75, Greenlund, K. J., Zheng, Z. J., Keenan, N. L., Giles, W. H., Casper, M. L., Mensah, G. A. and Croft, J. B. (2004). Trends in self-reported multiple cardiovascular disease risk factors among adults in the United States, Archives of Internal Medicine 164, Greenlund, K. J., Denny, C. H., Mokdad, A. H., Watkins, N., Croft, J. B. and Mensah, G. A. (2005). Using behavioral risk factor surveillance data for heart disease and stroke prevention programs. American Journal of Preventive Medicine 29, Hayes, D. K., Denny, C. H., Keenan, N. L., Croft, J. B., Sundaram, A. A. and Greenlund, K. J. (2006). Racial/ethnic and socioeconomic differences in multiple risk factors for heart disease and stroke in women: behavioral risk factor surveillance system. Journal of Women s Health 15, Horton, J. P. and Fitzmaurice, G. M. (2004). Regression analysis of multiple source and multiple informant data from complex survey samples. Statistics in Medicine 23,
16 172 Minggen Lu and Wei Yang Kim, C. and Beckles, G. L. (2004). Cardiovascular disease risk reduction in the behavioral risk factor surveillance system. American Journal of Preventive Medicine 27, 1-7. Koziol-McLai, J., Coates, C. J. and Lowenstein, S. R. (2001). Predictive validity of a screen for partner violence against women. American Journal of Preventive Medicine 21, Lee, A. J., Scott, A. J. and Soo, S. C. (1993). Comparing Liang-Zeger estimates with maximum likelihood in bivariate logistic regression. Journal of Statistical Computation and Simulation 44, Liang, K. Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, Liang, K. Y. and Zeger, S. L. (1989). for multivariate binary time series. Association 84, A class of logistic regression models Journal of the American Statistical Lipsitz, S. R., Laird, N. M. and Harrington, D. P. (1990). Maximum likelihood regression methods for paired binary data. Statistics in Medicine 9, Lipsitz, S. R., Laird, N. M. and Harrington, D. P. (1991). Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika 78, Mancl, L. A. and Leroux, B. G. (1996). Efficiency of regression estimates for clustered data. Biometrics 52, McDonald, B. W. (1993). Estimating logistic regression parameters for bivariate binary data. Journal of the Royal Statistical Society, Series B 55, Molenberghs, G. and Verbeke, G. (2005). Models for Discrete Longitudinal Data. Springer, Verknpfüngen. Nelson, D. E., Holtzman, D., Bolen, J., Stanwyck, C. A. and Mack, K. A. (2001). Reliability and validity of measures from the Behavioral Risk Factor Surveillance System (BRFSS). International Journal of Public Health 46, S3-S42. Prentice, R. L. (1986). Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. Journal of American Statistical Association 81,
17 Multivariate Logistic Regression for Complex Survey 173 Prentice, R. L. (1988). Correlated binary regression with covariates specific to each binary observation. Biometrics 44, Rosner, B. (1984). Multivariate methods in opthalmology with application to other paired-data situations. Biometrics 40, Sergeev, A. V. and Carpenter, D. O. (2010). Residential proximity to environmental sources of persistent organic pollutants and first-time hospitalizations for myocardial infarction with comorbid diabetes mellitus: a 12-year population-based study. International Journal of Occupational and Environmental Health 23, Sharples, K. and Breslow, N. E. (1992). Regression analysis of correlated binary data: some small sample results for estimating equations. Journal of the Statistical Computation and Simulation 42, Stiratelli, R., Laird, N. M. and Ware, J. H. (1984). Random effects models for serial observations with binary response. Biometrics 40, White, A. A. (1984). Response rate calculation in RDD telephone health surveys: current practices. In Proceedings of the Section on Survey Research Methods, American Statistical Association, Washington, District of Columbia. Zhao, L. P., Prentice, R. L. and Self, S. G. (1992). Multivariate mean parameter estimation by using a partly exponential model. Journal of the Royal Statistical Society, Series B 54, Received July 27, 2011; accepted October 4, Minggen Lu School of Community Health Sciences University of Nevada, Reno Reno, Nevada, USA [email protected] Wei Yang School of Community Health Sciences University of Nevada, Reno Reno, Nevada, USA [email protected]
Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses
Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses G. Gordon Brown, Celia R. Eicheldinger, and James R. Chromy RTI International, Research Triangle Park, NC 27709 Abstract
Logistic (RLOGIST) Example #3
Logistic (RLOGIST) Example #3 SUDAAN Statements and Results Illustrated PREDMARG (predicted marginal proportion) CONDMARG (conditional marginal proportion) PRED_EFF pairwise comparison COND_EFF pairwise
Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models
Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear
STATISTICAL MODELING OF LONGITUDINAL SURVEY DATA WITH BINARY OUTCOMES
STATISTICAL MODELING OF LONGITUDINAL SURVEY DATA WITH BINARY OUTCOMES A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the Degree of Doctor
Chapter 19 Statistical analysis of survey data. Abstract
Chapter 9 Statistical analysis of survey data James R. Chromy Research Triangle Institute Research Triangle Park, North Carolina, USA Savitri Abeyasekera The University of Reading Reading, UK Abstract
Chapter 11 Introduction to Survey Sampling and Analysis Procedures
Chapter 11 Introduction to Survey Sampling and Analysis Procedures Chapter Table of Contents OVERVIEW...149 SurveySampling...150 SurveyDataAnalysis...151 DESIGN INFORMATION FOR SURVEY PROCEDURES...152
EXPANDING THE EVIDENCE BASE IN OUTCOMES RESEARCH: USING LINKED ELECTRONIC MEDICAL RECORDS (EMR) AND CLAIMS DATA
EXPANDING THE EVIDENCE BASE IN OUTCOMES RESEARCH: USING LINKED ELECTRONIC MEDICAL RECORDS (EMR) AND CLAIMS DATA A CASE STUDY EXAMINING RISK FACTORS AND COSTS OF UNCONTROLLED HYPERTENSION ISPOR 2013 WORKSHOP
Statistics 305: Introduction to Biostatistical Methods for Health Sciences
Statistics 305: Introduction to Biostatistical Methods for Health Sciences Modelling the Log Odds Logistic Regression (Chap 20) Instructor: Liangliang Wang Statistics and Actuarial Science, Simon Fraser
Comparison of Variance Estimates in a National Health Survey
Comparison of Variance Estimates in a National Health Survey Karen E. Davis 1 and Van L. Parsons 2 1 Agency for Healthcare Research and Quality, 540 Gaither Road, Rockville, MD 20850 2 National Center
The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data ABSTRACT INTRODUCTION SURVEY DESIGN 101 WHY STRATIFY?
The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health, ABSTRACT
Statistical Rules of Thumb
Statistical Rules of Thumb Second Edition Gerald van Belle University of Washington Department of Biostatistics and Department of Environmental and Occupational Health Sciences Seattle, WA WILEY AJOHN
A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models
A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models Grace Y. Yi 13, JNK Rao 2 and Haocheng Li 1 1. University of Waterloo, Waterloo, Canada
Logistic (RLOGIST) Example #7
Logistic (RLOGIST) Example #7 SUDAAN Statements and Results Illustrated EFFECTS UNITS option EXP option SUBPOPX REFLEVEL Input Data Set(s): SAMADULTED.SAS7bdat Example Using 2006 NHIS data, determine for
A Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
Department of Epidemiology and Public Health Miller School of Medicine University of Miami
Department of Epidemiology and Public Health Miller School of Medicine University of Miami BST 630 (3 Credit Hours) Longitudinal and Multilevel Data Wednesday-Friday 9:00 10:15PM Course Location: CRB 995
It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3.
IDENTIFICATION AND ESTIMATION OF AGE, PERIOD AND COHORT EFFECTS IN THE ANALYSIS OF DISCRETE ARCHIVAL DATA Stephen E. Fienberg, University of Minnesota William M. Mason, University of Michigan 1. INTRODUCTION
Multivariate Logistic Regression
1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@
Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Handling attrition and non-response in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
Mortality Assessment Technology: A New Tool for Life Insurance Underwriting
Mortality Assessment Technology: A New Tool for Life Insurance Underwriting Guizhou Hu, MD, PhD BioSignia, Inc, Durham, North Carolina Abstract The ability to more accurately predict chronic disease morbidity
Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)
Cairo University Faculty of Economics and Political Science Statistics Department English Section Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Prepared
Missing data and net survival analysis Bernard Rachet
Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27-29 July 2015 Missing data and net survival analysis Bernard Rachet General context Population-based,
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study.
Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study Prepared by: Centers for Disease Control and Prevention National
Multinomial Logistic Regression
Multinomial Logistic Regression Dr. Jon Starkweather and Dr. Amanda Kay Moske Multinomial logistic regression is used to predict categorical placement in or the probability of category membership on a
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
NHS Diabetes Prevention Programme (NHS DPP) Non-diabetic hyperglycaemia. Produced by: National Cardiovascular Intelligence Network (NCVIN)
NHS Diabetes Prevention Programme (NHS DPP) Non-diabetic hyperglycaemia Produced by: National Cardiovascular Intelligence Network (NCVIN) Date: August 2015 About Public Health England Public Health England
Categorical Data Analysis
Richard L. Scheaffer University of Florida The reference material and many examples for this section are based on Chapter 8, Analyzing Association Between Categorical Variables, from Statistical Methods
Dealing with Missing Data
Dealing with Missing Data Roch Giorgi email: [email protected] UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January
Logistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
Logistic (RLOGIST) Example #1
Logistic (RLOGIST) Example #1 SUDAAN Statements and Results Illustrated EFFECTS RFORMAT, RLABEL REFLEVEL EXP option on MODEL statement Hosmer-Lemeshow Test Input Data Set(s): BRFWGT.SAS7bdat Example Using
Survey Data Analysis in Stata
Survey Data Analysis in Stata Jeff Pitblado Associate Director, Statistical Software StataCorp LP Stata Conference DC 2009 J. Pitblado (StataCorp) Survey Data Analysis DC 2009 1 / 44 Outline 1 Types of
Multilevel Modeling of Complex Survey Data
Multilevel Modeling of Complex Survey Data Sophia Rabe-Hesketh, University of California, Berkeley and Institute of Education, University of London Joint work with Anders Skrondal, London School of Economics
An Analysis of the Health Insurance Coverage of Young Adults
Gius, International Journal of Applied Economics, 7(1), March 2010, 1-17 1 An Analysis of the Health Insurance Coverage of Young Adults Mark P. Gius Quinnipiac University Abstract The purpose of the present
FACILITATOR/MENTOR GUIDE
FACILITATOR/MENTOR GUIDE Descriptive analysis variables table shells hypotheses Measures of association methods design justify analytic assess calculate analysis problem stratify confounding statistical
Comparison of Imputation Methods in the Survey of Income and Program Participation
Comparison of Imputation Methods in the Survey of Income and Program Participation Sarah McMillan U.S. Census Bureau, 4600 Silver Hill Rd, Washington, DC 20233 Any views expressed are those of the author
Poisson Models for Count Data
Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the
Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis
Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics
Models for Longitudinal and Clustered Data
Models for Longitudinal and Clustered Data Germán Rodríguez December 9, 2008, revised December 6, 2012 1 Introduction The most important assumption we have made in this course is that the observations
Connecticut Diabetes Statistics
Connecticut Diabetes Statistics What is Diabetes? State Public Health Actions (1305, SHAPE) Grant March 2015 Page 1 of 16 Diabetes is a disease in which blood glucose levels are above normal. Blood glucose
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
New SAS Procedures for Analysis of Sample Survey Data
New SAS Procedures for Analysis of Sample Survey Data Anthony An and Donna Watts, SAS Institute Inc, Cary, NC Abstract Researchers use sample surveys to obtain information on a wide variety of issues Many
Linear Mixed-Effects Modeling in SPSS: An Introduction to the MIXED Procedure
Technical report Linear Mixed-Effects Modeling in SPSS: An Introduction to the MIXED Procedure Table of contents Introduction................................................................ 1 Data preparation
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
A Population Based Risk Algorithm for the Development of Type 2 Diabetes: in the United States
A Population Based Risk Algorithm for the Development of Type 2 Diabetes: Validation of the Diabetes Population Risk Tool (DPoRT) in the United States Christopher Tait PhD Student Canadian Society for
SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing [email protected]
SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing [email protected] IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way
LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
Sampling Error Estimation in Design-Based Analysis of the PSID Data
Technical Series Paper #11-05 Sampling Error Estimation in Design-Based Analysis of the PSID Data Steven G. Heeringa, Patricia A. Berglund, Azam Khan Survey Research Center, Institute for Social Research
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
Life Table Analysis using Weighted Survey Data
Life Table Analysis using Weighted Survey Data James G. Booth and Thomas A. Hirschl June 2005 Abstract Formulas for constructing valid pointwise confidence bands for survival distributions, estimated using
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional
Chapter 29 The GENMOD Procedure. Chapter Table of Contents
Chapter 29 The GENMOD Procedure Chapter Table of Contents OVERVIEW...1365 WhatisaGeneralizedLinearModel?...1366 ExamplesofGeneralizedLinearModels...1367 TheGENMODProcedure...1368 GETTING STARTED...1370
INTRODUCTION TO SURVEY DATA ANALYSIS THROUGH STATISTICAL PACKAGES
INTRODUCTION TO SURVEY DATA ANALYSIS THROUGH STATISTICAL PACKAGES Hukum Chandra Indian Agricultural Statistics Research Institute, New Delhi-110012 1. INTRODUCTION A sample survey is a process for collecting
Introduction to mixed model and missing data issues in longitudinal studies
Introduction to mixed model and missing data issues in longitudinal studies Hélène Jacqmin-Gadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models
An extension of the factoring likelihood approach for non-monotone missing data
An extension of the factoring likelihood approach for non-monotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions
Does Financial Sophistication Matter in Retirement Preparedness of U.S Households? Evidence from the 2010 Survey of Consumer Finances
Does Financial Sophistication Matter in Retirement Preparedness of U.S Households? Evidence from the 2010 Survey of Consumer Finances Kyoung Tae Kim, The Ohio State University 1 Sherman D. Hanna, The Ohio
An Application of the G-formula to Asbestos and Lung Cancer. Stephen R. Cole. Epidemiology, UNC Chapel Hill. Slides: www.unc.
An Application of the G-formula to Asbestos and Lung Cancer Stephen R. Cole Epidemiology, UNC Chapel Hill Slides: www.unc.edu/~colesr/ 1 Acknowledgements Collaboration with David B. Richardson, Haitao
HLM software has been one of the leading statistical packages for hierarchical
Introductory Guide to HLM With HLM 7 Software 3 G. David Garson HLM software has been one of the leading statistical packages for hierarchical linear modeling due to the pioneering work of Stephen Raudenbush
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
The Probit Link Function in Generalized Linear Models for Data Mining Applications
Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications
Standard errors of marginal effects in the heteroskedastic probit model
Standard errors of marginal effects in the heteroskedastic probit model Thomas Cornelißen Discussion Paper No. 320 August 2005 ISSN: 0949 9962 Abstract In non-linear regression models, such as the heteroskedastic
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
Introduction to Longitudinal Data Analysis
Introduction to Longitudinal Data Analysis Longitudinal Data Analysis Workshop Section 1 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development Section 1: Introduction
Workshop on Using the National Survey of Children s s Health Dataset: Practical Applications
Workshop on Using the National Survey of Children s s Health Dataset: Practical Applications Julian Luke Stephen Blumberg Centers for Disease Control and Prevention National Center for Health Statistics
Technical report. in SPSS AN INTRODUCTION TO THE MIXED PROCEDURE
Linear mixedeffects modeling in SPSS AN INTRODUCTION TO THE MIXED PROCEDURE Table of contents Introduction................................................................3 Data preparation for MIXED...................................................3
Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88)
Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88) Introduction The National Educational Longitudinal Survey (NELS:88) followed students from 8 th grade in 1988 to 10 th grade in
Curriculum - Doctor of Philosophy
Curriculum - Doctor of Philosophy CORE COURSES Pharm 545-546.Pharmacoeconomics, Healthcare Systems Review. (3, 3) Exploration of the cultural foundations of pharmacy. Development of the present state of
Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD
Tips for surviving the analysis of survival data Philip Twumasi-Ankrah, PhD Big picture In medical research and many other areas of research, we often confront continuous, ordinal or dichotomous outcomes
A Bayesian hierarchical surrogate outcome model for multiple sclerosis
A Bayesian hierarchical surrogate outcome model for multiple sclerosis 3 rd Annual ASA New Jersey Chapter / Bayer Statistics Workshop David Ohlssen (Novartis), Luca Pozzi and Heinz Schmidli (Novartis)
Assignments Analysis of Longitudinal data: a multilevel approach
Assignments Analysis of Longitudinal data: a multilevel approach Frans E.S. Tan Department of Methodology and Statistics University of Maastricht The Netherlands Maastricht, Jan 2007 Correspondence: Frans
15 Ordinal longitudinal data analysis
15 Ordinal longitudinal data analysis Jeroen K. Vermunt and Jacques A. Hagenaars Tilburg University Introduction Growth data and longitudinal data in general are often of an ordinal nature. For example,
GLM, insurance pricing & big data: paying attention to convergence issues.
GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - [email protected] Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.
I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of
Chapter XXI Sampling error estimation for survey data* Donna Brogan Emory University Atlanta, Georgia United States of America.
Chapter XXI Sampling error estimation for survey data* Donna Brogan Emory University Atlanta, Georgia United States of America Abstract Complex sample survey designs deviate from simple random sampling,
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
Approaches for Analyzing Survey Data: a Discussion
Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata
Multivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
ABSTRACT INTRODUCTION STUDY DESCRIPTION
ABSTRACT Paper 1675-2014 Validating Self-Reported Survey Measures Using SAS Sarah A. Lyons MS, Kimberly A. Kaphingst ScD, Melody S. Goodman PhD Washington University School of Medicine Researchers often
Number 384 + June 28, 2007. Abstract. Methods. Introduction
Number 384 + June 28, 2007 Drug Use and Sexual Behaviors Reported by Adults: United States, 1999 2002 Cheryl D. Fryar, M.S.P.H.; Rosemarie Hirsch, M.D., M.P.H.; Kathryn S. Porter, M.D., M.S.; Benny Kottiri,
How to set the main menu of STATA to default factory settings standards
University of Pretoria Data analysis for evaluation studies Examples in STATA version 11 List of data sets b1.dta (To be created by students in class) fp1.xls (To be provided to students) fp1.txt (To be
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
