Panel Data Analysis in Stata

Panel Data Analysis in Stata Anton Parlow Lab session Econ710 UWM Econ Department??/??/2010 or in a S-Bahn in Berlin, you never know..

Our plan Introduction to Panel data Fixed vs. Random effects Testing for fixed effects Testing for random effects Fixed or random effect? Example: Gravity model Exercises

Introduction to Panel data Panel data are cross-sectional data observed over time e.g we observe the same households, firms or countries over a couple of years. Panel data are also known as longitudinal data. In general: y it = α + βx it + ɛ it where i = 1..N cross-sectional observations and t = 1..T years Panel data have following advantages over pooled data (Baltagi 2004): (1) account for heterogeneity across individual units which is assumed away in pooled data (2) deal with time-invariant omitted variables as we can find in pooled data (3) are less likely to have problems with autocorrelation and multicollinearity as time series data do Aranello (2003) emphasizes (1) as the advantage from using panel data. There are basically two types of panel models, the fixed effects and the random effects model. They differ by their assumptions how the heterogeneity is captured and estimation techniques (fixed = OLS, random = GLS).

Fixed vs. random effects The fixed effect model assumes that individual heterogeneity is captured by the intercept term. This means every individual gets his own intercept α i while the slope coefficients are the same. This also means that the heterogeneity is associated with the regressors on the right hand side. The fixed effects model is also known as Least square dummy variable estimator (LSDV) because we assign pretty much a dummy to every individual. The random effects model assume in some sense that the individual effects are captured by the intercept and a random component µ i. This random component is not associated with the regressors on the right hand side and part of the error term. The intercept becomes α + µ i. That is the reason why some textbooks write both capture the heterogeneity by the intercept term. The assumption of the random effects model that individual effects are not associated with explanatory variables is a big one! But it allows us to estimate the effect of time-invariant variables which cancel out in a fixed effects estimation. Baltagi (and Hsiao) introduce both estimators as a one-way-error component model. For both estimators the error-term ɛ it equals µ i + v it where µ i captures the individual effect and is assumed to be fixed in the fixed effects model. For the random effects model it is stochastic and distributed. In other words individual effect are not correlated with the error-term but with the regressors in the fixed effects model (vice versa in the random effects model).

Fixed vs. random effects continued The regression equations come down to for the fixed effects model: y it = α i + β 1 X it + ɛ it where ɛ it = µ i + v it and µ i = 0 and for the random effects model: y it = α + β 1 X it + ɛ it where ɛ it = µ i + v it You know that for the random effects model you need to use a GLS-estimator which is a weighted average of between and within effects. It tells you where the variation comes from e.g. from within the individuals or between the individuals. The LSDV estimator assumes all the variation (or heterogeneity) comes from the within or from the individuals. If you assume all the variation comes from between the groups you have a between-estimator still using OLS. Let the random effects estimator be: ˆβ GLS = Wxy + Φ2 B xy W xx + Φ 2 B xx where Φ 2 is the weight on the between variation. The Stata output will tell you where the variations came from.

Testing for fixed effects Testing for fixed effects involves a F-test comparing for the pooled OLS results with the results from the LSDV-estimation. The pooled OLS is the restricted model and if we reject H 0 fixed effects are present. The F-test has following form: F = (RSS URSS)/(N 1) URSS/(NT N K) F N 1,N(T 1) K Don t worry this is part of the fixed effect output. Although always a nice exercise to this by hand..

Testing for random effects Involves a LaGrange Multiplier test developed by Breusch and Pagan. After a random effects regression this tests for the presence of random effects in the underlying pooled OLS. Following Baltagi (2004) that λ LM = The null hypothesis is H 0 = var(µ) = 0 nt 2(T 1) ni=1 ( T t=1 ɛ it ) 2 ni=1 Tt=1 ɛ 2 it 1 and χ 2 1 If we can reject the null random effects are present (remember p < 0.05 and you can reject any null!) Does it mean random effects are more efficient than fixed effects if random effects are present? Not necessarily but the Hausman specification test helps a bit to decide.

Fixed or random effect? The Hausman specification test is a very general test and can be used if two models could be used for the same question. In our example we have the fixed and the random effects model. Both models will be consistent estimator but we assume that the random effects estimator is more efficient e.g. uses less degrees of freedom. The null hypothesis tells us pretty much the same while the alternative is that only the fixed effect model is consistent. If we reject the Null we cannot use the random effects model. The problem is that the Hausman test rejects the random effects model very often and does not work very well in small samples (Baum 2006). It comes down what you think which model is more appropriate given your data and your question. But in general the Hausman test looks likes this (Hosny 2009): [ H = β ˆ FE β ˆ ] RE [Var( β ˆ FE β ˆ 1 [ RE )] β ˆ FE β ˆ ] RE and χ 2 k 1

Example: Gravity model Imagine someone gives you data for trade and conflicts between countries. Furthermore he is very generous in gives you also GDP, per capita income and the actual distance between country pairs. You want to know if conflicts affect trade negatively or in other terms if trade promotes somehow peace. Big question and there is a big debate in political science. Someone tells you a gravity model is similar to the one in physics and using your variables could look like this: ln(trade ij ) = β 0 + β 1 ln(gdp i ) + β 2 ln(gdp j ) + β 3 distance ij +β 3 ln(pci i ) + β 4 ln(pci j ) + n i=1 γ i at i + n i=1 γ j at j + ɛ ij who knows maybe you should also add country attributes at i. What would you use? Now imagine there are many papers out there just estimating pooled regression models in varies forms. Do you think if we observe countries over time trading with each other, that they miss something while assuming countries stay the same over time? Likely that you would say yes and want to use a panel estimation.

Example: Gravity model continued Usually you have your cross-section in annual form meaning for every year one data-file. If you want to use them together you have to merge them into on data-set. Before you can merge every observations needs an unique identifier id and you need for every annual dataset also a variable indicating which year it is. Example: You observe trade between the US and Germany over 2 years. This is the same trade relationship, so give it an id-number equals 1 or id = 1 for both years. Imagine you observe them 1988 and 1989. The year variable takes the value 1988 in 1988 (!) and of course 1989 in 1989 (!). The identifier variable allows to follow this trade relationship over time in a panel. Before you can merge data-sets, they have to be sorted individually. You open every data-set and sort them: e.g. if you have your data-set for 1989 use: sort id year do the same for every year following. Open the first data set and use following command for merging another to it: merge id year using location and name of the other data-set.dta sort again! And merge another to it.. do it until you have all years merged into one data-set (=your panel)

Example: Gravity model continued Tell Stata you want to use it as a panel xtset id year if your panel is strongly balanced, then you don t have to worry about unbalanced panels Let us do a simple pooled OLS: reg ldyt conflict ij lrgdp i lrgdp j lpci i lpci j ldist1 distance2 and compare it to a fixed effects estimation (=LSDV) xtreg ldyt conflict ij lrgdp i lrgdp j lpci i lpci j ldist1 distance2, fe where Stata uses xt-commands for panel models the option fe tells Stata to estimate a fixed effect model Look at the attached output! At the bottom you see the F-test for pooled OLS vs. fixed effects. You should be able to reject the null. We can conclude fixed effects are present.

Example: Gravity model continued Now let us estimate a random effects model xtreg ldyt conflict ij lrgdp i lrgdp j lpci i lpci j ldist1 distance2, re See only the option changed to re. If you don t specify an option, Stata assume a random effects model anyway. Let us test for random effects in the underlying pooled OLS using the random effects regression results. The Breusch-Pagan test has following command: xttest0 You should find random effects! Or you can reject the Null!

Example: Gravity model continued Finally let us do a Hausman specification test for testing fixed against random effects. We have to use the estimates of the fixed and random effects models. xtreg ldyt conflict ij lrgdp i lrgdp j lpci i lpci j ldist1 distance2, fe est store fixed which saves the results in fixed xtreg ldyt conflict ij lrgdp i lrgdp j lpci i lpci j ldist1 distance2, re est store random which saves the results in random and finally hausman fixed random where the second model is the one you think, which is more efficient. We should be able to reject the Null and conclude that fixed effects are more efficient.

Exercises Estimate the above pooled regression. Do the fixed effect model again. Use the RSS from the pooled and the fixed effects regression to compute the F-test by hand. Hint use the help for the xtreg command to figure out how to find the RSS from the fixed effects model. (okay: use display e(rss after the regression)