Stata: Bivariate Statistics Topics: Chi-square test, t-test, Pearson s R correlation coefficient

Transcription

1 Stata: Bivariate Statistics Topics: Chi-square test, t-test, Pearson s R correlation coefficient There are three situations during survey data analysis in which bivariate statistics are commonly used. 1. Compare two groups First, bivariate statistics are used to compare two study groups to see if they are similar. For example, to compare two groups at baseline before an intervention is implemented, or to compare participants who are lost to follow up to those who remained in the study. When comparing groups, we want to provide strong evidence of any group differences, so we use a conservative threshold of p<0.05 to determine statistical significance. In this course, we are learning to analyze research questions with binary outcomes. Bivariate statistics can be used to summarize and compare characteristic across groups. For example, were there differences in social-demographic characteristics of women who did and did not experience intimate partner violence in the last 12 months? 2. Identify covariates for general explanatory model When a characteristic like age is different in people who did and did not experience the outcome, we say that the characteristic is associated with the outcome. This is because the characteristic helps to explain variance in the outcome. In cross sectional data analysis, we cannot draw causal conclusions. We are not talking about causal Page 1 of 8

2 mechanisms that predict the outcome. Although woman s age group might be associated with whether or not she experienced intimate partner violence in the last 12 months, the biological process of aging does not cause her partner to act violently toward her. Rather, we are staying that a characteristic (like older age) tends to be present when the outcome is present. When we are developing a general explanatory model when the research question is Which factors are associated with [the outcome]? - then we use bivariate statistics to identify potential covariates that are worth testing in a multivariable model. If a variable is independently associated with the outcome, it might continue to explain the outcome once other factors are taken into account. In this case, when bivariate statistics are used for the purpose of filtering potential covariates in multivariate analysis, we use a generous threshold of p<0.1 to determine statistical significance to ensure that we do not drop any potentially useful variables from the analysis. Note, the same statistical test used to compare two groups (usually the chi-square test in logistic regression), is the same test and output that we use here to filter variables. The only difference is in purpose of the test, and therefore our interpretation of its results are different. Page 2 of 8

3 3. Chi-square test The chi-square test is a common bivariate statistic used to test whether the distribution in a categorical variable is statistically different in two or more groups. The chi-square test gives a yes/no answer - a p-value less than the threshold means, yes, there are differences between the two groups. In a manuscript, if you see a p-value next to a categorical variable (with data summarized as percentages), this is usually a chi-square test statistic. Source: Manzi, A., et al. (2014) BMC Pregnancy and Childbirth The chi-square test statistic p-value is easy to interpret after you have set a threshold for statistical significance either the distributions are, or are not, that same. The chi-square test is a global statistic; it tells if you if there are any differences across cells, though it does not tell you which cell(s) are different. You can often tell which cells are different qualitatively based on the percentages, though additional or different testing might be performed to isolate whether certain cells are statistically different from the rest. You should not use the chi-square test statistic if one or more cells in the cross tabulation has fewer than five observations, though this is incredibly rare in survey data analysis when tens of thousands of respondents are interviewed. If we have a response category with fewer than five observations, then we should combine it with another category. The chi-square test statistic is simple to implement in Stata. In fact, we have been doing it all along! Each time we use the tabulate command with survey data (by starting with svy:), we are producing a Pearson s chi-square F-statistic and p- value. Page 3 of 8

4 4. T-test A t-test is used to test whether the distribution of a continuous variable is statistically different across groups a p-value less than the threshold means, yes, there are differences. Do NOT use a t-test when the distribution of outcomes within groups are not normal, or when the variance is not the same across groups. In these situations, consider transforming the variable (we do not discuss this further in this course), or categorize the continuous values and test it as a categorical variable. You can produce t-test statistics for a continuous variable across two or more groups with survey data by specifying a linear regression, and testing for differences in the outcomes across group categories. Page 4 of 8

5 5. Test for collinearity among two covariates Before fitting any kind of multivariate model whether a general explanatory model or a hypothesis test model you should test for collinearity. Collinearity occurs when two covariates in a multivariable model are highly related; usually this is because the two variables represent the same thing (the same concept, or they happen simultaneously). For example, in a society where husbands and wives tend to have the same level of education, then woman s education status and men s education status represent the same construct within households. Wife s education might do a good job explaining variance in the outcome, leaving little left over variance to be explained by husband s education. As a result, the model becomes unstable. To produce parsimonious (efficient) multivariable models, and to prevent strange, unstable results, we test for strong associations among covariates and remove any collinear covariates from the analysis. The Pearson s R correlation coefficient is used to identify binary, ordinal, and continuous covariates that are correlated. Correlations of r>0.5 are often considered collinear in the social sciences. When two or more covariates are found to be collinear, we keep the one variable this is most strongly associated with the outcome, unless there is a conceptual reason to keep one over the other. For nominal variables (variables with non-ordered categories), say marriage type, you cannot use the Pearson s R correlation coefficient. If you want to be rigorous, you might test one or more binary definitions of the variable, for example, married (yes/no), or separated (yes/no), rather than a four category definition of marital status. In practice, you might only do this step if you were concerned about collinearity for conceptual reasons. Page 5 of 8

6 6. Pearson s R correlation coefficient The reason we only use Pearson s R correlation coefficient for binary, ordinal, and continuous data is that it is a measure of strength of linear association between two variables. The Pearson s R correlations answers the question: How much are two variables associated on a scale of zero to absolute one? The Pearson s R correlation statistic is related to linear regression; it tries to draw a line of best fit through the data of two variables. The strength of association is measured on a scale of -1 to +1, where 0 indicates no association (this means that as the value of one variable increase, the other is random). As r approaches +1, it denotes a positive association (this means as the value of one variable increases, the other increases). And as r approaches -1, it denotes a negative association (as the values of one variable increases, the other decreases). Page 6 of 8

7 The command used to perform correlation analysis with survey data does not come installed with Stata. So we have to use the findit command to find and install the command onto our computer. We only need to install the.ado command files once, after which the command will be integrated into your Stata. The command is corr_svy. Since the command is not part of the normal Stata package, we have to manually specify all aspects of the sample design including pweight(), psu(), and strata(). We can also include a subpop() statement, if applicable. If we include two variables in this corr_svy statement, Stata will produce the Pearson s R correlation statistic for that one pair. If we list multiple variables, Stata will produce the Pearson s R correlation statistics for all pair combinations. Note that the output shows a number of correlations equal to 1. We can ignore these. Correlation equals 1 when the same variable listed on the x axis appears on the y axis; they are the same variable and therefore perfectly correlated. Page 7 of 8

8 7. Bivariate statistics in an analysis workflow Table 2. Bivariate statistics Let us briefly review how to use bivariate statistics in an analysis workflow. Let us say that our study population is women who answered questions about domestic violence in the Rwanda 2010 Demographic and Health Survey. The outcome of our analysis is binary either a woman experienced intimate partner violence in the last 12 months, or she did not. Based on our conceptual framework, we generated 20 variables that might be associated with intimate partner violence based on a literature review, common sense, and our own experiences. We categorize all variables, and then use chi-square statistics to test whether each covariate is associated with the outcome. We summarize the findings for all variables, including those variables that are not statistically significant, in Table 2. In any presentation of these results, we can talk about differences between women who did and did not experience intimate partner violence in the last 12 months based on statistical significance of the chi-square statistic at p<0.05 [black]. Using the same output, we decide to advance all variables that are associated at p<0.1 to the next stage in the analysis [black and red]. In most analyses, we find several variables that are not independently associated with the outcome, so we do not advance them in the analysis workflow. Pearson s R Correlation Coefficients With the covariates that remain, we use the Pearson s R test for collinearity to ensure that each variable in the analysis represents a unique concept, and that our multivariate model will be stable. We use the svy_corr statement to test for collinearity among all covariate pairs, and remove any collinear covariates from the analysis. So now we are ready to move forward with multivariate modeling. Page 8 of 8