y i1 = x i1 + f i + u i1 y i2 = x i2 + f i + u i2

Similar documents
What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Chapter 2. Dynamic panel data models

1. THE LINEAR MODEL WITH CLUSTER EFFECTS

Panel Data Econometrics

CAPM, Arbitrage, and Linear Factor Models

Chapter 3: The Multiple Linear Regression Model

SYSTEMS OF REGRESSION EQUATIONS

Clustering in the Linear Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Chapter 10: Basic Linear Unobserved Effects Panel Data. Models:

ON THE ROBUSTNESS OF FIXED EFFECTS AND RELATED ESTIMATORS IN CORRELATED RANDOM COEFFICIENT PANEL DATA MODELS

IDENTIFICATION IN A CLASS OF NONPARAMETRIC SIMULTANEOUS EQUATIONS MODELS. Steven T. Berry and Philip A. Haile. March 2011 Revised April 2011

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Chapter 1. Vector autoregressions. 1.1 VARs and the identi cation problem

Lecture 3: Differences-in-Differences

1 Another method of estimation: least squares

FIXED EFFECTS AND RELATED ESTIMATORS FOR CORRELATED RANDOM COEFFICIENT AND TREATMENT EFFECT PANEL DATA MODELS

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Correlated Random Effects Panel Data Models

Empirical Methods in Applied Economics

Wooldridge, Introductory Econometrics, 3d ed. Chapter 12: Serial correlation and heteroskedasticity in time series regressions

Normalization and Mixed Degrees of Integration in Cointegrated Time Series Systems

Common sense, and the model that we have used, suggest that an increase in p means a decrease in demand, but this is not the only possibility.

1 Present and Future Value

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Solución del Examen Tipo: 1

Optimal insurance contracts with adverse selection and comonotonic background risk

Conditional Investment-Cash Flow Sensitivities and Financing Constraints

LOGIT AND PROBIT ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Herding, Contrarianism and Delay in Financial Market Trading

Out-of-Sample Forecast Tests Robust to the Choice of Window Size

Quality differentiation and entry choice between online and offline markets

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Chapter 4: Vector Autoregressive Models

The Binomial Distribution

Employment E ects of Service O shoring: Evidence from Matched Firms

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Financial Risk Management Exam Sample Questions/Answers

Comparing Features of Convenient Estimators for Binary Choice Models With Endogenous Regressors

Lecture 9: Keynesian Models

UNIVERSITY OF WAIKATO. Hamilton New Zealand

Our development of economic theory has two main parts, consumers and producers. We will start with the consumers.

Lecture Notes 10

Marketing Mix Modelling and Big Data P. M Cain

From the help desk: Bootstrapped standard errors

Employer Learning, Productivity and the Earnings Distribution: Evidence from Performance Measures Preliminary and Incomplete

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

Midterm March (a) Consumer i s budget constraint is. c i b i c i H 12 (1 + r)b i c i L 12 (1 + r)b i ;

An introduction to Value-at-Risk Learning Curve September 2003

The Dynamics of UK and US In ation Expectations

Voluntary Voting: Costs and Bene ts

Multiple Linear Regression in Data Mining

Introducing the Multilevel Model for Change

Trade Liberalization and the Economy:

The E ect of Trading Commissions on Analysts Forecast Bias

Clustered Standard Errors

Sharp and Diffuse Incentives in Contracting

WORKING PAPER NO OUT-OF-SAMPLE FORECAST TESTS ROBUST TO THE CHOICE OF WINDOW SIZE

1 Short Introduction to Time Series

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Identifying Moral Hazard in Car Insurance Contracts 1

Binary Outcome Models: Endogeneity and Panel Data

Centre for Central Banking Studies

Chapter 4: Statistical Hypothesis Testing

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Understanding Order Flow

Using instrumental variables techniques in economics and finance

On Marginal Effects in Semiparametric Censored Regression Models

Panel Data: Linear Models

Multivariate Analysis of Variance (MANOVA)

Mathematics. Rosella Castellano. Rome, University of Tor Vergata

Adverse Selection. Chapter 3

Econometric Analysis of Cross Section and Panel Data

Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes

The Wage E ects of Not-for-Pro t and For-Pro t. Certi cations: Better Data, Somewhat Di erent Results

A Basic Introduction to Missing Data

The E ect of U.S. Agricultural Subsidies on Farm Expenses and the Agricultural Labor Market

Module 5: Multiple Regression Analysis

Investment and Financial Constraints: Empirical Evidence for Firms in Brazil and China

Random Effects Models for Longitudinal Survey Data

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Tiered and Value-based Health Care Networks

Paid Placement: Advertising and Search on the Internet

Investigating the Relationship between Gold and Silver Prices

Topic 5: Stochastic Growth and Real Business Cycles

A Subset-Continuous-Updating Transformation on GMM Estimators for Dynamic Panel Data Models

Information and Human Capital Management

Chapter 1. Linear Panel Models and Heterogeneity

Review of Bivariate Regression

Association Between Variables

The Microstructure of Currency Markets

Introduction to Longitudinal Data Analysis

Testosterone levels as modi ers of psychometric g

Decision-Based Forecast Evaluation of UK Interest Rate Predictability*

The Loss in Efficiency from Using Grouped Data to Estimate Coefficients of Group Level Variables. Kathleen M. Lang* Boston College.

Chapter 3: Section 3-3 Solutions of Linear Programming Problems

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

problem arises when only a non-random sample is available differs from censored regression model in that x i is also unobserved

Representation of functions as power series

Transcription:

1. Economics 245A: Cluster Sampling & Matching (This document was created using the AMS Proceedings Article shell document.) Cluster sampling arises in a number of contexts. For example, consider a study of retirement saving. It is likely the case that retirement saving for employees within a rm will be correlated, because of common features of the rm (such as type of retirement plan) or because of common (often unobserved) characteristics of employees within a rm. Each rm represents a group, or cluster, and we may sample several workers from each of a large number of rms. Other examples might be a study of teenage peer e ects, in which we have a few teenagers from each of a large number of neighborhoods (the neighborhoods are the cluster) or high schools, or a study of siblings in a large sample of families (families are the cluster). The key is that we sample a large number of clusters and each cluster consists of a relatively small number of observations compared with the overall sample size. We allow the units within the cluster to be correlated, but we assume independence across clusters. 2. Matched Pairs Let us begin with a study of siblings in a large sample of families. The idea is to use siblings to control for unobserved family backgrounds. Our thought experiment is to have two identical individuals, for whom we vary one exogenous e ect. We attempt to capture our two identical individuals by studying siblings. For each family i there are two siblings y i1 = x i1 + f i + u i1 y i2 = x i2 + f i + u i2 where the equations are for siblings 1 and 2 and f i is an unobserved family e ect. The strict exogeneity assumption now implies that the error u is in each sibling s equation is uncorrelated with the explanatory variables in both equations. For example, let y be log(wage) and let x contain years of schooling. Then we must assume that sibling s schooling has no e ect on wages once we control for own schooling, the family e ect and other observed covariates. If f i is assumed to be uncorrelated with x i1 and x i2, then random e ects analysis can be used. 1

2 More commonly, f i is assumed to be correlated with x i1 and x i2, in which case di erencing across siblings to remove f i is the appropriate strategy. Under this strategy, x cannot contain common observable family background variables, as these are indistinguisable from f i. Standard IV estimators can be applied directly to the di erence equation y i1 y i2 = (x i1 x i2 ) + (u i1 u i2 ): 3. General Cluster Samples Matched pairs are a special case of a cluster sample. As noted above, observations within a cluster are thought to be correlated due to an unobserved cluster e ect. Suppose we model the retirement saving of individual m in cluster ( rm) g y gm = + x g + z gm + v gm ; where x g are explanatory variables that vary only at the rm level (i.e. rm characteristics), z gm are explanatory variables that vary within (and across) rms (that is, they vary at the employee level), there are G clusters and M g observations within each cluster (so there are di erent numbers of employees sampled from each rm). 3.1. Cluster Intercept. A simple starting point, that is surprisingly exible, is to let x g consist only of a constant term, so that each rm has it s own mean level of saving y gm = c g + z gm + v gm : A (standard) strict exogeneity assumption requires that the error v gm be uncorrelated with the explanatory variables z gm for all individuals from cluster g. That is, the error for one employee in a rm must be uncorrelated with z for all other employees within the same rm. The cluster e ect c g usually renders this assumption plausible. If we assume that c g is uncorrelated with z gm (that is the di erences in average retirement saving across rms are not related to the characteristics of the employees within rms), then pooled OLS is consistent. If we allow for correlation between c g and z gm, then we demean within clusters to remove the cluster e ect and then use pooled OLS (or IV) on the demeaned data. 3.2. General Cluster: Large Group Asymptotics. Logic: from a population of clusters, we randomly draw G clusters, where each cluster has M g observations. It should be the case that G is su ciently large relative to M g that we can allow for unrestricted correlation within cluster.

3 We rst assume E (v gm jx g ; z gm ) = 0 m = 1; : : : ; M g and g = 1; : : : ; G: Note, we could replace this assumption with a weaker assumption, requiring only that the variables be uncorrelated. Note also that this is a weaker assumption than made above in that we only require the error v gm to be uncorrelated with z gm, hence the error for one employee may be correlated with the explanatory variables for other employees. Under this assumption the pooled OLS estimator is consistent if the number of groups grows (G! 1) and the group size remains constant (M g is xed). The estimator is p G asymptotically normal. To construct a robust variance estimtaor, note that v gm is likely correlated across individuals within a cluster and the variance may also vary across individuals (conditional heteroskeasticity), so we write the model at the group level and y g = y g1 ; : : : ; y gmg 0 y g = W g + v g where W g is the M g (1 + K + L) matrix of all regressors. The robust standard errors are obtained from! 1 WgW 0 g! Wg^v 0 g^v gw 0 g! 1 WgW 0 g ; where ^v g is the M g 1 vector of residuals from pooled OLS regression. 3.2.1. GLS. The pooled OLS estimator ignores the within cluster correlation of v gm. To take advantage, we must strenghten the exogeneity assumption to E (v gm jx g ; Z g ) = 0 where Z g is the M g L matrix of individual covariates for cluser g. Thus we return to the assumption under which the error for an individual is exogenous to the covariates for all other individuals. With this assumption, we rewrite the error as v gm = c g + u gm :

4 (In statistics, this equation in combination with the original linear model spec cation is termed a hierarchical linear model). The resulting covarince matrix for the error vector v g is the M g M g matrix 2 2 c + 2 u 2 3 c 6 V ar (v g ) = 4 2...... 7 c 5 :.... 2 c + 2 u While we typically assume that V ar (v g ) = V ar (v g jx g ; Z g ), so that we have conditional homoskedasticity, we can still gain e ciency by using GLS. We then estimate the model via GLS, using a consistent estimator of the covariance matrix. 3.3. Large Group Size Asymptotics. Logic: We stratify the population into G groups and then sample randomly M g times from each group. For example, Card and Kruger have G = 2 states (NJ and Pa), Bound has G = 34 and all states would have G = 50. To understand the pitfalls of applying standard analysis with small G, consider the case in which x g is scalar and z gm is not present y gm = + x g + c g + u gm ; where c g and u gm are independent of x g and fu gm g is iid with mean zero for each g. If c g is absent from the model, then pooled OLS is consistent and inference is straightforward. If V ar (u gm ) is constant across g, then standard OLS t-statistics are correct. If we allow for heteroskedasticity, then we simply use the Eicker-White correction (or feasible GLS, as we have multiple observations on each cluster). With cluster e ects, the analysis is quite di erent. Let c g N (0; 2 c), which we assume to be independent of fu gm g. The pooled OLS estimator ^ is identical to regression of y g on 1; x g for g = 1; : : : ; G: (This is sometimes referred to as the between-groups estimator). Conditional on x g, ^ inherits its distribution from fv g g, the within-group averages of the composite errors v gm = c g +u gm. Because c g is present, new observations do not add information about, beyond how they a ect the group average, y g. If we add strong assumptions, we can solve the inference problem. Speci cally, if we assume u gm N (0; 2 u) and M g = M for all g, then v g N 0; 2 c + 2 u. Hence M y g = + x g + v g

5 satis es the classic linear assumptions and we use inference on the t G 2 distritubion (note that M 1 + + M G 2 is not the correct number of degrees-of-freedom). If the common group size, M, is large, then we can use large sample approximation to treat u gm as approximately normal. Further, even if group size di ers, if M g is large for all groups, then V ar (v g ) = 2 c + 2 u M g will be dominated by the rst term, and the approximation should work well (also if 2 u is small). In essence, we are ignoring estimation error in y g and analyzing the simple regression g = + x g + c g where we use y g in place of y g. This is very close to a standard check: estimate the model both with individual data and with cluster averages. With the cluster averages we lose e ciency but we do not need to make standard errors robust to within-cluster correlation. The main point is that above regression allows for conservative inference, as long as cluster sizes are large and cluster e ects are normal. For small G and large M g inference will be very conservative if cluster e ects are not present. While this may be desirable, it rules out some widely used tools for policy analysis. Return to our comparison of mean levels across two groups (perhaps the treated and control). Under random sampling and normality, the di erence in means between the two groups usually has M 1 +M 2 2 degrees-of-freedom. With even moderate group sizes we can relax normality and allow for di erent group variances and still conduct accurate inference. But in the above setup, we cannot conduct di erence in means analysis because G = 2. Such analysis was used to criticize Card and Kruger, because they failed to account for the state e ect c g in the composite error term v gm. But this is close to the common issue with di erence-in-di erence estimators, namely how to know if the observed e ect is all due to the policy change. Perhaps c g is part of the e ect to be estimated. Consider the following example. Over the summer a school district with two high schools, A and B, decides to provide computers to students at school B who have just nished their rst year. The announcement is made just prior to the start of the school year, so students cannot switch high schools. The response is the change in a standardized test score given to these students. If the students are randomly sampled, then a comparison of means should be accurate. Of course there may be other confounding factors, say the average increase in test scores at school B would have been higher anyway.