Efficient and Practical Econometric Methods for the SLID, NLSCY, NPHS

Similar documents

ECON 523 Applied Econometrics I /Masters Level American University, Spring Description of the course

FIXED EFFECTS AND RELATED ESTIMATORS FOR CORRELATED RANDOM COEFFICIENT AND TREATMENT EFFECT PANEL DATA MODELS

Introduction to Regression Models for Panel Data Analysis. Indiana University Workshop in Methods October 7, Professor Patricia A.

Models for Longitudinal and Clustered Data

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

ON THE ROBUSTNESS OF FIXED EFFECTS AND RELATED ESTIMATORS IN CORRELATED RANDOM COEFFICIENT PANEL DATA MODELS

Handling attrition and non-response in longitudinal data

Regression with a Binary Dependent Variable

Practical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University

Generalized Linear Models

Standard errors of marginal effects in the heteroskedastic probit model

Nominal and ordinal logistic regression

Prediction for Multilevel Models

I n d i a n a U n i v e r s i t y U n i v e r s i t y I n f o r m a t i o n T e c h n o l o g y S e r v i c e s

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Master programme in Statistics

Sampling Error Estimation in Design-Based Analysis of the PSID Data

XPost: Excel Workbooks for the Post-estimation Interpretation of Regression Models for Categorical Dependent Variables

HLM software has been one of the leading statistical packages for hierarchical

BayesX - Software for Bayesian Inference in Structured Additive Regression

Missing data and net survival analysis Bernard Rachet

INTERNATIONAL UNIVERSITY OF JAPAN Public Management and Policy Analysis Program Graduate School of International Relations

Missing data in randomized controlled trials (RCTs) can

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Comparing Features of Convenient Estimators for Binary Choice Models With Endogenous Regressors

Mortgage Loan Approvals and Government Intervention Policy

Panel Data Analysis in Stata

10 Dichotomous or binary responses

Department of Epidemiology and Public Health Miller School of Medicine University of Miami

Multilevel Models for Longitudinal Data. Fiona Steele

Statistics Graduate Courses

COURSES: 1. Short Course in Econometrics for the Practitioner (P000500) 2. Short Course in Econometric Analysis of Cointegration (P000537)

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

Chapter 7: Simple linear regression Learning Objectives

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

VI. Introduction to Logistic Regression

Multivariate Analysis of Health Survey Data

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

From the help desk: Bootstrapped standard errors

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Wooldridge, Introductory Econometrics, 3d ed. Chapter 12: Serial correlation and heteroskedasticity in time series regressions

A course on Longitudinal data analysis : what did we learn? COMPASS 11/03/2011

Note 2 to Computer class: Standard mis-specification tests

Social Media and Digital Marketing Analytics ( INFO-UB ) Professor Anindya Ghose Monday Friday 6-9:10 pm from 7/15/13 to 7/30/13

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences

SAS Software to Fit the Generalized Linear Model

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Correlated Random Effects Panel Data Models

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

Factors affecting online sales

Solución del Examen Tipo: 1

11. Analysis of Case-control Studies Logistic Regression

Multilevel Modelling of medical data

A Basic Introduction to Missing Data

Introduction to Longitudinal Data Analysis

Clustering in the Linear Model

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Multilevel Modeling of Complex Survey Data

ESTIMATING AN ECONOMIC MODEL OF CRIME USING PANEL DATA FROM NORTH CAROLINA BADI H. BALTAGI*

The University of North Carolina at Chapel Hill School of Social Work. SOWO 919 Longitudinal and Multilevel Analysis Spring Semester, 2009

Do People Mean What They Say? Implications for Subjective Survey Data

Mgmt 469. Fixed Effects Models. Suppose you want to learn the effect of price on the demand for back massages. You

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Simple Linear Regression

[This document contains corrections to a few typos that were found on the version available through the journal s web page]

LOGIT AND PROBIT ANALYSIS

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

Stephen du Toit Mathilda du Toit Gerhard Mels Yan Cheng. LISREL for Windows: PRELIS User s Guide

Introduction to Fixed Effects Methods

Numerical Issues in Statistical Computing for the Social Scientist

Introduction to Regression and Data Analysis

Qualitative vs Quantitative research & Multilevel methods

1. When Can Missing Data be Ignored?

Marketing Mix Modelling and Big Data P. M Cain

A Latent Variable Approach to Validate Credit Rating Systems using R

Teaching model: C1 a. General background: 50% b. Theory-into-practice/developmental 50% knowledge-building: c. Guided academic activities:

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

2013 MBA Jump Start Program. Statistics Module Part 3

Cost implications of no-fault automobile insurance. By: Joseph E. Johnson, George B. Flanigan, and Daniel T. Winkler

Transcription:

Efficient and Practical Econometric Methods for the SLID, NLSCY, NPHS Philip Merrigan ESG-UQAM, CIRPÉE Using Big Data to Study Development and Social Change, Concordia University, November 2103

Intro Longitudinal Survey of Income and Earnings (SLID) National Longitudinal Survey of Children and Youth (NLSCY) National Population Health Survey (NPHS) All short panels (3 to 8 periods) Only available in Stats Can Research Data Centres

Data Formatting SLIDRETS RDC software will sample SLID for you cross-sectionally and longitudinally. NPHS - Each wave includes data for all waves, one record per individual must be transformed in panel data form, one record per individual per wave. NLSCY - One data set per wave, different names in each wave, must be transformed as in the NPHS. (Programs soon available at CIQSS to perform this task.)

Non-Linear Models for Longitudinal Data Mixed models for Longitudinal Data are difficult to work with if data sets are large and specifications rich. Advances in computing power have greatly increased the speed with which parameters of these models can be estimated. Major Statistical Software packages now include mixed model procedures for nonlinear models.

Non-Linear Models for Longitudinal Data STATA has offered for many years efficient routines for the panel probit, logit, poisson and has recently added the ordered logit and ordered probit procedures. They admit individual heterogeneity for the intercept.

Non-Linear Models for Longitudinal Data GLAMM, a STATA package written by Rabe- Hesketh, S., Skrondal, A. and Pickles, A. performs the same procedures as the STATA package but adds several others including Survival models, and the multinomial logit. GLAMM admits random slopes in the models as well as random intercepts, greatly increasing the number of numerical computations for estimation.

Non-Linear Models for Longitudinal Data Given the size of the panels, fixed effects as implemented with dummy variables for each individual produce inconsistent estimates in all non-linear methods (probit, logit, poisson, multinomial logit, etc.). However, in most cases, with non-experimental data, spurious correlation is a major problem and the introduction of individual fixed effects is crucial to enhance the credibility of estimates as causal.

Non-linear models For the logit binary form, the conditional logit of Chamberlain (stata xtlogit, fe) will handle fixed effects at the individual level. Unfortunately, average marginal effects cannot be recovered because the method provides no estimates of the fixed effect distribution. If estimation must be conditioned on fixed effects and if marginal effects are necessary to achieve the goals of the paper, (as is generally the case), Wooldridge (2011) provides an easily implementable procedure to include fixed effects in the estimation.

Simple method to include fixed individual effects in Non-linear models Say y*(it) is the latent dependent variable, x(it) is a regressor, c(i) is the individual effect, and u(it) the error term. Hence, y*(it) = b(0) + b(1)*x(it) + c(i) + u(it), i=1 N, t=1 T. Woolridge s suggestion is to write: c(i) = a(0) + g(1)*mx(i) + v(i) where mx(i) is the mean of x(i) over T.

Simple method to include fixed individual effects in Non-linear models We get y*(it) = c(0) + b(1)*x(it) + g(1)*mx(i)+ v(i)+ u(it), i=1 N, t=1 T. We assume that all the correlation between x(it) and c(i) is captured by the presence of mx(i) that we include in the regression. Given that most non-linear models have a linear index, or several of these linear indexes (multinomial models for example), this «trick» can be used in a large number of cases.

Simple method to include fixed individual effects in Non-linear models A simple test of individual heterogeneity is the null that the coefficients on the means are equal to 0 which, in most cases, will be rejected. If there are several time varying regressors in the model, there will be a large number of coefficients to estimate - even more so, if one includes interaction terms for the means. But with advances in computing power, all this becomes feasible.

Simple method to include fixed individual effects in Non-linear models Caveat: this method adds credibility to a causal interpretation for variables that change value from one period to the next, but does little to address spurious correlation for variables that are constant during the sample period although they must be included in the model. Very useful, when one is trying to estimate the effects of income on discrete outcomes such as good health or outcomes on child development. The idea is to obtain credible estimates without using instrumental variable procedures.

Simple method to include fixed individual effects in Non-linear models Unfortunately, the STATA xt methods with the «random effect» option will not provide average marginal effects, but average marginal effects for a specific value of v(i), i.e. 0. However, the «pa» option in the xt commands with the «exchangeable» option for the correlation structure of the error term will provide estimates very similar to the «re» option, but will also provide estimates of the average marginal effects with standard errors.

Simple method to include fixed individual effects in Non-linear models Must be careful when estimating marginal effects not to confuse effects computed and different values of v(i) with average marginal effects. (Good discussion in Rabe-Hesketh, S. and Skrondal, A. (2012). Multilevel and Longitudinal Modeling Using Stata, Third Edition. College Station, TX: Stata Press.)

Simple method to include fixed individual effects in Non-linear models For models with random slopes as well as random intercepts, GLAMM does provide estimates of marginal probabilities as well as estimated probabilities for particular values of the individual heterogeneity term. However, there are no procedures for estimating marginal effects. Because predicted probabilities are available, it is quite simple to program estimates of marginal effects. However, standard errors or marginal effects will take longer to compute if one uses bootstrap methods, even with very powerful computers.

Longitudinal estimation with bootstrap weights Stats Can data sets provide bootstrap weights permitting inference that takes into account the complex sampling design for these surveys. With longitudinal data, in general, weights are provided for individuals who are sampled each wave. For the NLSCY and NPHS, this means much smaller data sets if one uses the 8 waves (they remain large).

Missing values With high attrition, the issue of missing values becomes more important. SAS and STATA now offer superb packages (PROC MI and mi) to perform multiple imputation for almost any type of continuous or discrete dependent or explanatory missing value using bayesian markov chain methods. Again, when data sets are large, this involves a very large number of computation. For example, STATA suggests 100 burn-ins.

Missing values Suggestions: first, perform regressions without missing values or with dummy variables indicating missing values. If a preferred specification is found, then, re-run with 10 imputed data sets to measure sensitivity of results to missing values. The same can be said of inference with bootstrap «weights».

Conclusion Computing time is an important challenge for Big Data, in particular, for non-linear panel data models. The most recent generation of computers with parallel processing permits the estimation of complex models in reasonable amounts of time. The addition of fixed individual effects in models which seek to find causal links is problematic. When spurious correlation is an evident problem, we detailed a fixed effect procedure that can be easily implemented. (example: impact of income on health or child development outcomes). Multiple imputations at a minimum should be used to address missing value problems.