1. Economics 245A: Cluster Sampling & Matching (This document was created using the AMS Proceedings Article shell document.) Cluster sampling arises in a number of contexts. For example, consider a study of retirement saving. It is likely the case that retirement saving for employees within a rm will be correlated, because of common features of the rm (such as type of retirement plan) or because of common (often unobserved) characteristics of employees within a rm. Each rm represents a group, or cluster, and we may sample several workers from each of a large number of rms. Other examples might be a study of teenage peer e ects, in which we have a few teenagers from each of a large number of neighborhoods (the neighborhoods are the cluster) or high schools, or a study of siblings in a large sample of families (families are the cluster). The key is that we sample a large number of clusters and each cluster consists of a relatively small number of observations compared with the overall sample size. We allow the units within the cluster to be correlated, but we assume independence across clusters. 2. Matched Pairs Let us begin with a study of siblings in a large sample of families. The idea is to use siblings to control for unobserved family backgrounds. Our thought experiment is to have two identical individuals, for whom we vary one exogenous e ect. We attempt to capture our two identical individuals by studying siblings. For each family i there are two siblings y i1 = x i1 + f i + u i1 y i2 = x i2 + f i + u i2 where the equations are for siblings 1 and 2 and f i is an unobserved family e ect. The strict exogeneity assumption now implies that the error u is in each sibling s equation is uncorrelated with the explanatory variables in both equations. For example, let y be log(wage) and let x contain years of schooling. Then we must assume that sibling s schooling has no e ect on wages once we control for own schooling, the family e ect and other observed covariates. If f i is assumed to be uncorrelated with x i1 and x i2, then random e ects analysis can be used. 1
2 More commonly, f i is assumed to be correlated with x i1 and x i2, in which case di erencing across siblings to remove f i is the appropriate strategy. Under this strategy, x cannot contain common observable family background variables, as these are indistinguisable from f i. Standard IV estimators can be applied directly to the di erence equation y i1 y i2 = (x i1 x i2 ) + (u i1 u i2 ): 3. General Cluster Samples Matched pairs are a special case of a cluster sample. As noted above, observations within a cluster are thought to be correlated due to an unobserved cluster e ect. Suppose we model the retirement saving of individual m in cluster ( rm) g y gm = + x g + z gm + v gm ; where x g are explanatory variables that vary only at the rm level (i.e. rm characteristics), z gm are explanatory variables that vary within (and across) rms (that is, they vary at the employee level), there are G clusters and M g observations within each cluster (so there are di erent numbers of employees sampled from each rm). 3.1. Cluster Intercept. A simple starting point, that is surprisingly exible, is to let x g consist only of a constant term, so that each rm has it s own mean level of saving y gm = c g + z gm + v gm : A (standard) strict exogeneity assumption requires that the error v gm be uncorrelated with the explanatory variables z gm for all individuals from cluster g. That is, the error for one employee in a rm must be uncorrelated with z for all other employees within the same rm. The cluster e ect c g usually renders this assumption plausible. If we assume that c g is uncorrelated with z gm (that is the di erences in average retirement saving across rms are not related to the characteristics of the employees within rms), then pooled OLS is consistent. If we allow for correlation between c g and z gm, then we demean within clusters to remove the cluster e ect and then use pooled OLS (or IV) on the demeaned data. 3.2. General Cluster: Large Group Asymptotics. Logic: from a population of clusters, we randomly draw G clusters, where each cluster has M g observations. It should be the case that G is su ciently large relative to M g that we can allow for unrestricted correlation within cluster.
3 We rst assume E (v gm jx g ; z gm ) = 0 m = 1; : : : ; M g and g = 1; : : : ; G: Note, we could replace this assumption with a weaker assumption, requiring only that the variables be uncorrelated. Note also that this is a weaker assumption than made above in that we only require the error v gm to be uncorrelated with z gm, hence the error for one employee may be correlated with the explanatory variables for other employees. Under this assumption the pooled OLS estimator is consistent if the number of groups grows (G! 1) and the group size remains constant (M g is xed). The estimator is p G asymptotically normal. To construct a robust variance estimtaor, note that v gm is likely correlated across individuals within a cluster and the variance may also vary across individuals (conditional heteroskeasticity), so we write the model at the group level and y g = y g1 ; : : : ; y gmg 0 y g = W g + v g where W g is the M g (1 + K + L) matrix of all regressors. The robust standard errors are obtained from! 1 WgW 0 g! Wg^v 0 g^v gw 0 g! 1 WgW 0 g ; where ^v g is the M g 1 vector of residuals from pooled OLS regression. 3.2.1. GLS. The pooled OLS estimator ignores the within cluster correlation of v gm. To take advantage, we must strenghten the exogeneity assumption to E (v gm jx g ; Z g ) = 0 where Z g is the M g L matrix of individual covariates for cluser g. Thus we return to the assumption under which the error for an individual is exogenous to the covariates for all other individuals. With this assumption, we rewrite the error as v gm = c g + u gm :
4 (In statistics, this equation in combination with the original linear model spec cation is termed a hierarchical linear model). The resulting covarince matrix for the error vector v g is the M g M g matrix 2 2 c + 2 u 2 3 c 6 V ar (v g ) = 4 2...... 7 c 5 :.... 2 c + 2 u While we typically assume that V ar (v g ) = V ar (v g jx g ; Z g ), so that we have conditional homoskedasticity, we can still gain e ciency by using GLS. We then estimate the model via GLS, using a consistent estimator of the covariance matrix. 3.3. Large Group Size Asymptotics. Logic: We stratify the population into G groups and then sample randomly M g times from each group. For example, Card and Kruger have G = 2 states (NJ and Pa), Bound has G = 34 and all states would have G = 50. To understand the pitfalls of applying standard analysis with small G, consider the case in which x g is scalar and z gm is not present y gm = + x g + c g + u gm ; where c g and u gm are independent of x g and fu gm g is iid with mean zero for each g. If c g is absent from the model, then pooled OLS is consistent and inference is straightforward. If V ar (u gm ) is constant across g, then standard OLS t-statistics are correct. If we allow for heteroskedasticity, then we simply use the Eicker-White correction (or feasible GLS, as we have multiple observations on each cluster). With cluster e ects, the analysis is quite di erent. Let c g N (0; 2 c), which we assume to be independent of fu gm g. The pooled OLS estimator ^ is identical to regression of y g on 1; x g for g = 1; : : : ; G: (This is sometimes referred to as the between-groups estimator). Conditional on x g, ^ inherits its distribution from fv g g, the within-group averages of the composite errors v gm = c g +u gm. Because c g is present, new observations do not add information about, beyond how they a ect the group average, y g. If we add strong assumptions, we can solve the inference problem. Speci cally, if we assume u gm N (0; 2 u) and M g = M for all g, then v g N 0; 2 c + 2 u. Hence M y g = + x g + v g
5 satis es the classic linear assumptions and we use inference on the t G 2 distritubion (note that M 1 + + M G 2 is not the correct number of degrees-of-freedom). If the common group size, M, is large, then we can use large sample approximation to treat u gm as approximately normal. Further, even if group size di ers, if M g is large for all groups, then V ar (v g ) = 2 c + 2 u M g will be dominated by the rst term, and the approximation should work well (also if 2 u is small). In essence, we are ignoring estimation error in y g and analyzing the simple regression g = + x g + c g where we use y g in place of y g. This is very close to a standard check: estimate the model both with individual data and with cluster averages. With the cluster averages we lose e ciency but we do not need to make standard errors robust to within-cluster correlation. The main point is that above regression allows for conservative inference, as long as cluster sizes are large and cluster e ects are normal. For small G and large M g inference will be very conservative if cluster e ects are not present. While this may be desirable, it rules out some widely used tools for policy analysis. Return to our comparison of mean levels across two groups (perhaps the treated and control). Under random sampling and normality, the di erence in means between the two groups usually has M 1 +M 2 2 degrees-of-freedom. With even moderate group sizes we can relax normality and allow for di erent group variances and still conduct accurate inference. But in the above setup, we cannot conduct di erence in means analysis because G = 2. Such analysis was used to criticize Card and Kruger, because they failed to account for the state e ect c g in the composite error term v gm. But this is close to the common issue with di erence-in-di erence estimators, namely how to know if the observed e ect is all due to the policy change. Perhaps c g is part of the e ect to be estimated. Consider the following example. Over the summer a school district with two high schools, A and B, decides to provide computers to students at school B who have just nished their rst year. The announcement is made just prior to the start of the school year, so students cannot switch high schools. The response is the change in a standardized test score given to these students. If the students are randomly sampled, then a comparison of means should be accurate. Of course there may be other confounding factors, say the average increase in test scores at school B would have been higher anyway.