Stata: Bivariate Statistics Topics: Chi-square test, t-test, Pearson s R correlation coefficient

Similar documents
Association Between Variables

Row vs. Column Percents. tab PRAYER DEGREE, row col

Chapter 7 Factor Analysis SPSS

Additional sources Compilation of sources:

Descriptive Statistics

Paid and Unpaid Labor in Developing Countries: an inequalities in time use approach

Introduction to Quantitative Methods

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Module 3: Correlation and Covariance

SPSS Guide: Regression Analysis

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Solución del Examen Tipo: 1

Univariate Regression

Multiple logistic regression analysis of cigarette use among high school students

Regression Analysis of the Relationship between Income and Work Hours

Binary Logistic Regression

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Ordinal Regression. Chapter

The correlation coefficient

The Dummy s Guide to Data Analysis Using SPSS

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Module 5: Multiple Regression Analysis

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Gender Differences in Employed Job Search Lindsey Bowen and Jennifer Doyle, Furman University

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA

Statistical tests for SPSS

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Introduction to Regression and Data Analysis

Chapter 5 Analysis of variance SPSS Analysis of variance

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

1.0 Abstract. Title: Real Life Evaluation of Rheumatoid Arthritis in Canadians taking HUMIRA. Keywords. Rationale and Background:

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

When to Use a Particular Statistical Test

Mode and Patient-mix Adjustment of the CAHPS Hospital Survey (HCAHPS)

Statistical Analysis Using SPSS for Windows Getting Started (Ver. 2014/11/6) The numbers of figures in the SPSS_screenshot.pptx are shown in red.

Categorical Data Analysis

Data Analysis: Analyzing Data - Inferential Statistics

The Basics of Regression Analysis. for TIPPS. Lehana Thabane. What does correlation measure? Correlation is a measure of strength, not causation!

4. Multiple Regression in Practice

Organizing Your Approach to a Data Analysis

Types of Data, Descriptive Statistics, and Statistical Tests for Nominal Data. Patrick F. Smith, Pharm.D. University at Buffalo Buffalo, New York

UNIVERSITY OF NAIROBI

Poisson Models for Count Data

Pearson s Correlation

How to Get More Value from Your Survey Data

Inferential Statistics. What are they? When would you use them?

Linear Models in STATA and ANOVA

UNDERSTANDING THE TWO-WAY ANOVA

Research Methods & Experimental Design

Gerry Hobbs, Department of Statistics, West Virginia University

Analysing Questionnaires using Minitab (for SPSS queries contact -)

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab Exercise #5 Analysis of Time of Death Data for Soldiers in Vietnam

II. DISTRIBUTIONS distribution normal distribution. standard scores

Stata Walkthrough 4: Regression, Prediction, and Forecasting

Remarriage in the United States

Running Descriptive Statistics: Sample and Population Values

Dimensionality Reduction: Principal Components Analysis

STATISTICAL DATA ANALYSIS

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Business Cycles and Divorce: Evidence from Microdata *

Simple Linear Regression, Scatterplots, and Bivariate Correlation

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Multinomial and Ordinal Logistic Regression

Simple linear regression

Descriptive Inferential. The First Measured Century. Statistics. Statistics. We will focus on two types of statistical applications

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

The trend of Vietnamese household size in recent years

Statistics for Sports Medicine

Homework 11. Part 1. Name: Score: / null

Relationships Between Two Variables: Scatterplots and Correlation

Instructions for SPSS 21

Goodness of fit assessment of item response theory models

Logit Models for Binary Data

16 : Demand Forecasting

Standard errors of marginal effects in the heteroskedastic probit model

Section 3 Part 1. Relationships between two numerical variables

This chapter provides information on the research methods of this thesis. The

EMPLOYEE RECRUITMENT AND RETENTION PRACTICES IN INDIAN BANKING SECTOR. Dr. Narinder Kaur. Principal. University College, Meerapur ( Patiala)

Example: Boats and Manatees

Chapter 5 Conceptualization, Operationalization, and Measurement

High School Dropout Determinants: The Effect of Poverty and Learning Disabilities Adrienne Ingrum

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

CHAPTER 15 NOMINAL MEASURES OF CORRELATION: PHI, THE CONTINGENCY COEFFICIENT, AND CRAMER'S V

Sun Li Centre for Academic Computing

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

So, using the new notation, P X,Y (0,1) =.08 This is the value which the joint probability function for X and Y takes when X=0 and Y=1.

End User Satisfaction With a Food Manufacturing ERP

SAS Software to Fit the Generalized Linear Model

Child Marriage and Education: A Major Challenge Minh Cong Nguyen and Quentin Wodon i

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Transcription:

Stata: Bivariate Statistics Topics: Chi-square test, t-test, Pearson s R correlation coefficient - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - There are three situations during survey data analysis in which bivariate statistics are commonly used. 1. Compare two groups First, bivariate statistics are used to compare two study groups to see if they are similar. For example, to compare two groups at baseline before an intervention is implemented, or to compare participants who are lost to follow up to those who remained in the study. When comparing groups, we want to provide strong evidence of any group differences, so we use a conservative threshold of p<0.05 to determine statistical significance. In this course, we are learning to analyze research questions with binary outcomes. Bivariate statistics can be used to summarize and compare characteristic across groups. For example, were there differences in social-demographic characteristics of women who did and did not experience intimate partner violence in the last 12 months? 2. Identify covariates for general explanatory model When a characteristic like age is different in people who did and did not experience the outcome, we say that the characteristic is associated with the outcome. This is because the characteristic helps to explain variance in the outcome. In cross sectional data analysis, we cannot draw causal conclusions. We are not talking about causal www.populationsurveyanalysis.com Page 1 of 8

mechanisms that predict the outcome. Although woman s age group might be associated with whether or not she experienced intimate partner violence in the last 12 months, the biological process of aging does not cause her partner to act violently toward her. Rather, we are staying that a characteristic (like older age) tends to be present when the outcome is present. When we are developing a general explanatory model when the research question is Which factors are associated with [the outcome]? - then we use bivariate statistics to identify potential covariates that are worth testing in a multivariable model. If a variable is independently associated with the outcome, it might continue to explain the outcome once other factors are taken into account. In this case, when bivariate statistics are used for the purpose of filtering potential covariates in multivariate analysis, we use a generous threshold of p<0.1 to determine statistical significance to ensure that we do not drop any potentially useful variables from the analysis. Note, the same statistical test used to compare two groups (usually the chi-square test in logistic regression), is the same test and output that we use here to filter variables. The only difference is in purpose of the test, and therefore our interpretation of its results are different. www.populationsurveyanalysis.com Page 2 of 8

3. Chi-square test The chi-square test is a common bivariate statistic used to test whether the distribution in a categorical variable is statistically different in two or more groups. The chi-square test gives a yes/no answer - a p-value less than the threshold means, yes, there are differences between the two groups. In a manuscript, if you see a p-value next to a categorical variable (with data summarized as percentages), this is usually a chi-square test statistic. Source: Manzi, A., et al. (2014) BMC Pregnancy and Childbirth The chi-square test statistic p-value is easy to interpret after you have set a threshold for statistical significance either the distributions are, or are not, that same. The chi-square test is a global statistic; it tells if you if there are any differences across cells, though it does not tell you which cell(s) are different. You can often tell which cells are different qualitatively based on the percentages, though additional or different testing might be performed to isolate whether certain cells are statistically different from the rest. You should not use the chi-square test statistic if one or more cells in the cross tabulation has fewer than five observations, though this is incredibly rare in survey data analysis when tens of thousands of respondents are interviewed. If we have a response category with fewer than five observations, then we should combine it with another category. The chi-square test statistic is simple to implement in Stata. In fact, we have been doing it all along! Each time we use the tabulate command with survey data (by starting with svy:), we are producing a Pearson s chi-square F-statistic and p- value. www.populationsurveyanalysis.com Page 3 of 8

4. T-test A t-test is used to test whether the distribution of a continuous variable is statistically different across groups a p-value less than the threshold means, yes, there are differences. Do NOT use a t-test when the distribution of outcomes within groups are not normal, or when the variance is not the same across groups. In these situations, consider transforming the variable (we do not discuss this further in this course), or categorize the continuous values and test it as a categorical variable. You can produce t-test statistics for a continuous variable across two or more groups with survey data by specifying a linear regression, and testing for differences in the outcomes across group categories. www.populationsurveyanalysis.com Page 4 of 8

5. Test for collinearity among two covariates Before fitting any kind of multivariate model whether a general explanatory model or a hypothesis test model you should test for collinearity. Collinearity occurs when two covariates in a multivariable model are highly related; usually this is because the two variables represent the same thing (the same concept, or they happen simultaneously). For example, in a society where husbands and wives tend to have the same level of education, then woman s education status and men s education status represent the same construct within households. Wife s education might do a good job explaining variance in the outcome, leaving little left over variance to be explained by husband s education. As a result, the model becomes unstable. To produce parsimonious (efficient) multivariable models, and to prevent strange, unstable results, we test for strong associations among covariates and remove any collinear covariates from the analysis. The Pearson s R correlation coefficient is used to identify binary, ordinal, and continuous covariates that are correlated. Correlations of r>0.5 are often considered collinear in the social sciences. When two or more covariates are found to be collinear, we keep the one variable this is most strongly associated with the outcome, unless there is a conceptual reason to keep one over the other. For nominal variables (variables with non-ordered categories), say marriage type, you cannot use the Pearson s R correlation coefficient. If you want to be rigorous, you might test one or more binary definitions of the variable, for example, married (yes/no), or separated (yes/no), rather than a four category definition of marital status. In practice, you might only do this step if you were concerned about collinearity for conceptual reasons. www.populationsurveyanalysis.com Page 5 of 8

6. Pearson s R correlation coefficient The reason we only use Pearson s R correlation coefficient for binary, ordinal, and continuous data is that it is a measure of strength of linear association between two variables. The Pearson s R correlations answers the question: How much are two variables associated on a scale of zero to absolute one? The Pearson s R correlation statistic is related to linear regression; it tries to draw a line of best fit through the data of two variables. The strength of association is measured on a scale of -1 to +1, where 0 indicates no association (this means that as the value of one variable increase, the other is random). As r approaches +1, it denotes a positive association (this means as the value of one variable increases, the other increases). And as r approaches -1, it denotes a negative association (as the values of one variable increases, the other decreases). www.populationsurveyanalysis.com Page 6 of 8

The command used to perform correlation analysis with survey data does not come installed with Stata. So we have to use the findit command to find and install the command onto our computer. We only need to install the.ado command files once, after which the command will be integrated into your Stata. The command is corr_svy. Since the command is not part of the normal Stata package, we have to manually specify all aspects of the sample design including pweight(), psu(), and strata(). We can also include a subpop() statement, if applicable. If we include two variables in this corr_svy statement, Stata will produce the Pearson s R correlation statistic for that one pair. If we list multiple variables, Stata will produce the Pearson s R correlation statistics for all pair combinations. Note that the output shows a number of correlations equal to 1. We can ignore these. Correlation equals 1 when the same variable listed on the x axis appears on the y axis; they are the same variable and therefore perfectly correlated. www.populationsurveyanalysis.com Page 7 of 8

7. Bivariate statistics in an analysis workflow Table 2. Bivariate statistics Let us briefly review how to use bivariate statistics in an analysis workflow. Let us say that our study population is women who answered questions about domestic violence in the Rwanda 2010 Demographic and Health Survey. The outcome of our analysis is binary either a woman experienced intimate partner violence in the last 12 months, or she did not. Based on our conceptual framework, we generated 20 variables that might be associated with intimate partner violence based on a literature review, common sense, and our own experiences. We categorize all variables, and then use chi-square statistics to test whether each covariate is associated with the outcome. We summarize the findings for all variables, including those variables that are not statistically significant, in Table 2. In any presentation of these results, we can talk about differences between women who did and did not experience intimate partner violence in the last 12 months based on statistical significance of the chi-square statistic at p<0.05 [black]. Using the same output, we decide to advance all variables that are associated at p<0.1 to the next stage in the analysis [black and red]. In most analyses, we find several variables that are not independently associated with the outcome, so we do not advance them in the analysis workflow. Pearson s R Correlation Coefficients With the covariates that remain, we use the Pearson s R test for collinearity to ensure that each variable in the analysis represents a unique concept, and that our multivariate model will be stable. We use the svy_corr statement to test for collinearity among all covariate pairs, and remove any collinear covariates from the analysis. So now we are ready to move forward with multivariate modeling. www.populationsurveyanalysis.com Page 8 of 8