Institut für Soziologie Eberhard Karls Universität Tübingen www.maartenbuis.nl

Similar documents

The Stata Journal. Editor Nicholas J. Cox Department of Geography Durham University South Road Durham City DH1 3LE UK

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

Direct and indirect effects in a logit model

Multinomial and Ordinal Logistic Regression

Development of the nomolog program and its evolution

From this it is not clear what sort of variable that insure is so list the first 10 observations.

Discussion Section 4 ECON 139/ Summer Term II

MULTIPLE REGRESSION EXAMPLE

LOGIT AND PROBIT ANALYSIS

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Standard errors of marginal effects in the heteroskedastic probit model

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Row vs. Column Percents. tab PRAYER DEGREE, row col

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

Handling missing data in Stata a whirlwind tour

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

From the help desk: hurdle models

How to set the main menu of STATA to default factory settings standards

Competing-risks regression

III. INTRODUCTION TO LOGISTIC REGRESSION. a) Example: APACHE II Score and Mortality in Sepsis

Module 14: Missing Data Stata Practical

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

VI. Introduction to Logistic Regression

Sample Size Calculation for Longitudinal Studies

Stata Walkthrough 4: Regression, Prediction, and Forecasting

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

An assessment of consumer willingness to pay for Renewable Energy Sources use in Italy: a payment card approach.

Ordinal Regression. Chapter

Introduction to structural equation modeling using the sem command

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Nested Logit. Brad Jones 1. April 30, University of California, Davis. 1 Department of Political Science. POL 213: Research Methods

Nonlinear Regression Functions. SW Ch 8 1/54/

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Introduction. Survival Analysis. Censoring. Plan of Talk

Multiple Choice Models II

Regression 3: Logistic Regression

Using Stata for Categorical Data Analysis

IMPACT EVALUATION: INSTRUMENTAL VARIABLE METHOD

A Simple Feasible Alternative Procedure to Estimate Models with High-Dimensional Fixed Effects

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

Linear Regression Models with Logarithmic Transformations

25 Working with categorical data and factor variables

It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3.

Multivariate Logistic Regression

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Gender Effects in the Alaska Juvenile Justice System

Lecture 15. Endogeneity & Instrumental Variable Estimation

Propensity Score Matching Regression Discontinuity Limited Dependent Variables

Using Stata 9 & Higher for OLS Regression Richard Williams, University of Notre Dame, Last revised January 8, 2015

Modeling Health Care Costs and Counts

is paramount in advancing any economy. For developed countries such as

Estimating the random coefficients logit model of demand using aggregate data

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

LOGISTIC REGRESSION ANALYSIS

13. Poisson Regression Analysis

11. Analysis of Case-control Studies Logistic Regression

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

Logistic Regression. BUS 735: Business Decision Making and Research

Logit Models for Binary Data

Moderator and Mediator Analysis

Rockefeller College University at Albany

Using Heterogeneous Choice Models. To Compare Logit and Probit Coefficients Across Groups

Two Correlated Proportions (McNemar Test)

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

Analyses on Hurricane Archival Data June 17, 2014

Interaction effects between continuous variables (Optional)

xtmixed & denominator degrees of freedom: myth or magic

The first three steps in a logistic regression analysis with examples in IBM SPSS. Steve Simon P.Mean Consulting

VI. The Investigation of the Determinants of Bicycling in Colorado

10 Dichotomous or binary responses

From the help desk: Swamy s random-coefficients model

Nonlinear relationships Richard Williams, University of Notre Dame, Last revised February 20, 2015

Using Stata for One Sample Tests

Chapter 5 Analysis of variance SPSS Analysis of variance

From the help desk: Bootstrapped standard errors

A Basic Introduction to Missing Data

Correlation and Regression

Poisson Models for Count Data

Linear Models in STATA and ANOVA

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Title. Syntax. stata.com. fp Fractional polynomial regression. Estimation

A Guide to Imputing Missing Data with Stata Revision: 1.4

Basic Statistical and Modeling Procedures Using SAS

Regression with a Binary Dependent Variable

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Wooldridge, Introductory Econometrics, 4th ed. Chapter 7: Multiple regression analysis with qualitative information: Binary (or dummy) variables

Education and Wage Differential by Race: Convergence or Divergence? *

Nominal and ordinal logistic regression

Survey Data Analysis in Stata

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Finding Supporters. Political Predictive Analytics Using Logistic Regression. Multivariate Solutions

Transcription:

from Indirect Extracting from Institut für Soziologie Eberhard Karls Universität Tübingen www.maartenbuis.nl

from Indirect What is the effect of x on y? Which effect do I choose: average marginal or marginal for someone with average values for the predictors or odds ratios, or...? Is the effect of x on y in group a the same as the effect of x on y in group b? How to interpret interaction : marginal effect for interaction or ratio of odds ratios? Is such a across groups even identified? How much of the effect of x on y can I explain with variable z? How can I get indirect/mediator?

from This introduction Indirect Quick review of What is an effect? What variables should we control for? What is a model?

from What is an effect? Indirect Almost always a means. Say we have data on the income of a number of males a number of comparable females. The the mean income of males females gives us the effect of gender on income. This comparison can take the form of a difference: women earn on average x euros/yen/pounds/dollars less than men, or it can take the form of a ratio: women earn on average y% less than men.

from OK, but what about continuous variables? Indirect Say we want to know the effect of age on income. Still a groups, each 1 year apart. Easiest solution is to constrain all these to be the same. The default for difference in linear regression. The default for ratio in regression with the log link-function.

from What variables do we need to control for? Indirect Confounding variables

from What variables do we need to control for? degree of urbanization Indirect a b number of storks c human vertility

from What variables do we need to control for? Indirect Confounding variables Not intervening variables

from What variables do we need to control for? offspring s education Indirect a b father s occupation c offspring s occupation

from What variables do we need to control for? Indirect Confounding variables Not intervening variables Not idiosyncratic error/rom noise/ luck Many exist to model a probability, an odds, a rate, or a hazard rate. These concepts are defined by what we consider to be idiosyncratic error/rom noise/ luck. In these the dependent variable is defined by what we choose not to control for.

from Non-linear Indirect f (E(y)) = β 0 + β 1 x 1 + β 2 x 2 + f () is the link function, e.g. logit link: log( u 1 u ) probit link: Φ(u) log link: log(u) an important characteristic of functions is that f (E[y]) E[f (y)] Many exist to accommodate known bounds in the dependent variable, e.g. probability [0,1], odds, rate, hazard rate 0. in terms of ratios

from Adjusted predictions (1) Indirect Say we want to know what the effect of having a college-degree on the probability of never being married, while controlling for age whether or not the respondent lives in the South of USA. We do a logitstic regression: sysuse nlsw88, clear logit union collgrad age south An effect is a means, so why not get a predicted probability for a typical respondent with a college-degree without a college-degree?

from Adjusted predictions (2) Indirect We could fix age south at the mean than predict the probability for the two groups:

from Adjusted predictions (2) Indirect. margin, at(collgrad=(0 1) (mean) south age) noatlegend Adjusted predictions Number of obs = 2246 Model VCE : OIM Expression : Pr(never_married), predict() Delta-method Margin Std. Err. z P> z [95% Conf. Interval] _at 1.0874183.0068627 12.74 0.000.0739676.100869 2.149139.0154553 9.65 0.000.1188472.1794308

from Adjusted predictions (2) Indirect We could fix age south at the mean than predict the probability for the two groups: Alternatively, We could predict the probabilities for all individuals, than compute the mean probabilities within each group:

from Adjusted predictions (2) Indirect. margin, at(collgrad=(0 1)) noatlegend Predictive margins Number of obs = 2246 Model VCE : OIM Expression : Pr(never_married), predict() Delta-method Margin Std. Err. z P> z [95% Conf. Interval] _at 1.0893587.0068779 12.99 0.000.0758783.102839 2.1517616.0154767 9.81 0.000.1214279.1820954

from Adjusted predictions (2) Indirect We could fix age south at the mean than predict the probability for the two groups: Alternatively, We could predict the probabilities for all individuals, than compute the mean probabilities within each group: Why are these not the same?

from Adjusted predictions (3) Indirect The predicted probability is Λ(xb), where exp(u) 1+exp(u) Λ(u) = xb = β 0 + β 1 x 1 + β 2 x 2 + The first method consists of computing Λ(E[xb]). The second method consists of computing E[Λ(xb)].

from Adjusted predictions (3) probability (p) 0.5 1 Indirect p xb 2 0 2 4 linear predictor (xb)

from OK, but what about the effect of continuous variables? Indirect We want to summarize by how much the probability of being unmarried decreases when one gets a year older. This is a rate of change, or first derivative. In this context often called marginal effect. Problem: the relationship between age the probability is, so there are many marginal

from Effect of age Indirect Pr(never married).4.3.2.1 0 20 30 40 50 60 70 age

from Effect of age (2) Indirect. margins, dydx(age) noatlegend /// > at((mean) collgrad south age=(35 45)) Conditional marginal Number of obs = 2246 Model VCE : OIM Expression : Pr(never_married), predict() dy/dx w.r.t. : age age Delta-method dy/dx Std. Err. z P> z [95% Conf. Interval] _at 1 -.0086154.0032458-2.65 0.008 -.014977 -.0022538 2 -.0046781.0008927-5.24 0.000 -.0064277 -.0029285

from OK, but what about the effect of continuous variables? Indirect We want to summarize by how much the probability of being unmarried decreases when one gets a year older. This is a rate of change, or first derivative. In this context often called marginal effect. Problem: the relationship between age the probability is, so there are many marginal Problem: We get different when first fix the explanatory variables than compute the marginal effect or first compute the marginal for each individual than average.

from Effect of age (3) Indirect. margins, dydx(age) noatlegend /// > at((mean) collgrad south age=(35 45)) Conditional marginal Number of obs = 2246 Model VCE : OIM Expression : Pr(never_married), predict() dy/dx w.r.t. : age age Delta-method dy/dx Std. Err. z P> z [95% Conf. Interval] _at 1 -.0086154.0032458-2.65 0.008 -.014977 -.0022538 2 -.0046781.0008927-5.24 0.000 -.0064277 -.0029285

from Too many Indirect So, what is the true effect? In a strick sense none of them, but they are all valid approximations There is an alternative that is not an approximation when the link function contains a logarithm. In that case the effect in terms of ratios is assumed to be constant.

from Effect of age (4).6 Indirect Odds(never married).4.2 0 20 30 40 50 60 70 age

from Effect of age (5) Indirect Odds(never married) log scale 1.5.1.05.01 20 30 40 50 60 70 age

from Effect of age (3) Indirect. gen byte baseline = 1. gen c_age = age - 30. logit never_married south c_age collgrad baseline, nolog nocons or Logistic regression Number of obs = 2246 Wald chi2(4) = 940.91 Log likelihood = -736.65888 Prob > chi2 = 0.0000 never_marr ~ d Odds Ratio Std. Err. z P> z [95% Conf. Interval] south.8552155.1219781-1.10 0.273.6466511 1.131048 c_age.9272802.0217551-3.22 0.001.8856065.970915 collgrad 1.829794.270653 4.08 0.000 1.369295 2.445159 baseline.2041465.0461335-7.03 0.000.1310948.3179058

from Marginal of interaction Indirect An interaction between two variables is included by creating a new variable that is the product of the two. In linear regression we can interpret the multiplicative term as how much the effect of variable 1 changes for a unit change in variable 2 ( vice versa). Ai Norton (2003) pointed out that this does not work for marginal in. The aim is find out how much the effect of x 1 changes for a unit change in x 2 i.e. the cross partial derivative with respect to x 1 x 2. These can be computed in Stata by inteff inteff3.

from Ratio effect of interaction Indirect The easier solution is to interpret interaction in terms of ratio. The interaction effect can now be interpreted as the ratio by which the effect of x 2 changes for a unit change in x 1

from Indirect. sysuse nlsw88, clear (NLSW, 1988 extract). gen byte high_occ = occupation < 3 if occupation <. (9 missing values generated). gen byte black = race == 2 if race <.. drop if race == 3 (26 observations deleted). gen byte baseline = 1. logit high_occ black##collgrad baseline, or nocons nolog Logistic regression Number of obs = 2211 Wald chi2(4) = 504.62 Log likelihood = -1199.4399 Prob > chi2 = 0.0000 high_occ Odds Ratio Std. Err. z P> z [95% Conf. Interval] 1.black.4194072.0655069-5.56 0.000.3088072.5696188 1.collgrad 2.465411.293568 7.58 0.000 1.952238 3.113478 black# collgrad 1 1 1.479715.4132536 1.40 0.161.8559637 2.558003 baseline.3220524.0215596-16.93 0.000.2824512.3672059

from Indirect. margins, over(black collgrad) expression(exp(xb())) post Predictive margins Number of obs = 2211 Model VCE : OIM Expression : exp(xb()) over : black collgrad Delta-method Margin Std. Err. z P> z [95% Conf. Interval] black# collgrad 0 0.3220524.0215596 14.94 0.000.2797964.3643084 0 1.7939914.078188 10.15 0.000.6407457.9472371 1 0.1350711.0190606 7.09 0.000.097713.1724292 1 1.4927536.1032487 4.77 0.000.29039.6951173

from Indirect. lincom 0.black#1.collgrad - 0.black#0.collgrad ( 1) - 0bn.black#0bn.collgrad + 0bn.black#1.collgrad = 0 Coef. Std. Err. z P> z [95% Conf. Interval] (1).471939.081106 5.82 0.000.3129742.6309038. lincom 1.black#1.collgrad - 1.black#0.collgrad ( 1) - 1.black#0bn.collgrad + 1.black#1.collgrad = 0 Coef. Std. Err. z P> z [95% Conf. Interval] (1).3576825.1049933 3.41 0.001.1518994.5634656

from Latent variable interpretation of logistic regression Indirect Assume that there is some latent propensity of success y Someone gets a succusses if y > 0 otherwise a failure. y = β 0 + β 1 x 1 + β 2 x 2 + + ε the scale of y is fixed by fixing the stard deviation of ε to a fixed number π 3. If we compare across groups or we have to assume that the residual variance is equal otherwise the scale of the dependent variable will differ.

from Scenarios Indirect One way to get an idea about the size of this problem is to estimate various scenarios. The idea is that the heteroscedasticity comes from a (composite) unobserved variable, to make assumptions regarding the size of the effect of this variable, its distribution, how the effect changes when the observed variable of interest changes. The effect of the observed variables can than be estimated by integrating the likelihood function over this unobserved variable, which can be done by maximum simulated likelihood. This is implemented in Stata in scenreg.

from Indirect Probability odds interpretation of logistic regression The problem is that the scale of y is not defined We can solve that by interpreting the in terms of probabilities or odds, as these have a known scale. This does not do away with all arbitrariness: the probability is defined in terms of what variables we chose to designate idiosyncratic error/luck i.e. which variables we choose not to control for. Comparison of groups (interaction ) can be solved this way, but comparisons of with different explanatory variables (indirect ) are still problematic.

from Problem with naïve method Indirect. drop _all. set obs 60000 obs was 0, now 60000. gen z = ceil(_n / 20000) - 1. bys z: gen x = ceil(_n / 10000) - 1. tab x z z x 0 1 2 Total 0 10,000 10,000 10,000 30,000 1 10,000 10,000 10,000 30,000 Total 20,000 20,000 20,000 60,000. set seed 12345. gen y = runiform() < invlogit(-4 + 4*x + 2*z)

from Problem with naïve method Indirect. qui logit y x z. est store direct. local direct = _b[x].. qui logit y x. est store total. local total _b[x].. est tab direct total Variable direct total x 4.0391332 2.6256242 z 2.026339 _cons -4.0452305-1.3123133. di "naive indirect effect = " `total - `direct naive indirect effect = -1.413509

from Indirect ldecomp solution: Say we want to find the indirect effect of college education through occupation on union membership. Estimate a logistic regression with all variables. Predict the log odds for each respondent transform these to proportions. Compute the average proportion for college-graduates non-college-graduates, transform back to log odds: the difference between these is the total effect. Compute the average proportion for college graduates, assuming they have the distribution of occupation of the non-college-graduates. The only difference between the college graduates the counterfactual group is the distribution of occupation, so this difference represents the indirect effect. The distribution of occupation remains constant when comparing the counterfactual group with the

from Indirect Example. sysuse nlsw88, clear (NLSW, 1988 extract). gen byte high = occupation < 3 if occupation <. (9 missing values generated). gen byte middle = occupation >= 3 & occupation < 7 if occupation <. (9 missing values generated). ldecomp union south, direct(collgrad) indirect(high middle) at(south 0) or (running _ldecomp on estimation sample) Bootstrap replications (50) 1 2 3 4 5... 50 Bootstrap results Number of obs = 1869 Replications = 50 Observed Bootstrap Normal-based Odds Ratio Std. Err. z P> z [95% Conf. Interval] 1/0 total 1.657501.1867359 4.49 0.000 1.329096 2.06705 indirect1.8958344.0491377-2.01 0.045.8045225.9975101 direct1 1.850231.2249036 5.06 0.000 1.458004 2.347974 indirect2.8872166.0493551-2.15 0.031.7955693.9894213 direct2 1.868203.2311406 5.05 0.000 1.465921 2.380881 in equation i/j (comparing groups i j) let the fist subscript of Odds be the distribution of the the indirect variable let the second subscript of Odds be the conditional probabilities Method 1: Indirect effect = Odds_ij/Odds_jj Direct effect = Odds_ii/Odds_ij Method 2: Indirect effect = Odds_ii/Odds_ji Direct effect = Odds_ji/Odds_jj value labels 0 not college grad 1 college grad

from Indirect There are two things that make more difficult than The dependent variable is related to the independent variables via a function The dependent variable is not directly observed, but a function of our model Often we can prevent the problem by using ratio instead of difference Sometimes we can bypass this problem by using a linear model as an approximation. Sometimes we will just have to use more complicated methods (ldecomp, scenreg, inteff, inteff3)