Introduction to Panel Data Analysis

Transcription

1 Introduction to Panel Data Analysis Oliver Lipps / Ursina Kuhn Swiss Centre of Expertise in the Social Sciences (FORS) c/o University of Lausanne Lugano Summer School, August

2 Introduction panel data, data management 1 Introducing panel data (OL) 2 The SHP (OL) 3 Data Management with Stata (UK) Regressions with panel data: basic 4 Regression refresher (UK) 5 Fixed effects (FE) models (OL) 6 Introducing random effects (RE) models (OL) 7 Nonlinear regression (UK) Start: 8.3 Breaks: Lunch Breaks: End: 17.3 Additional topics 8 Level 1 and 2 growth models (OL) 9 Missing data (OL) 1 Dynamic models (UK)

3 1 Introducing panel data

4 Surveys over time: repeated cross-sections vs. panels Cross-Section: Survey conducted at several points in time ( rounds ) using different sample members in each round Panel: Survey conducted at several points in time ( waves ) using the same individuals over waves -> panel data mostly from panel surveys -> If from cross-sectional surveys: retrospective ( biographical ) questionnaire 1-4

5 Panel Surveys: to distinguish Length and sample size: Time Series: N small (mostly=1), T large (T ) time series models Panel Surveys: N large, T small (N ) social science panel surveys Sample General population: - rotating: only few (pre-defined number) waves per individual (in CH: SILC, LFS) - indefinitely long (in CH: SHP) Special population: - e.g., age/birth cohorts (in CH e.g.: TREE, SHARE, COCON) representative for population of special agegroup / birthyears 1-5

6 Panel surveys increasingly important Changing focus in social sciences Life course research: effects of events within individuals Large investments in social science household panels surveys, high data quality! Concern about causality in cross-sectional studies Analysis potential of panel data - close to experimental design: before and after studies - control unobserved individual characteristics (exogenous independent variables; implicit in regression analyses) 1-6

7 Identification of individual dynamics (poverty in SHP) 1% 96% 4% 89.5% 91.% 92.3% 91.2% 9.6% 5% 52% 49% 46% % 5% 48% 51% 54% 1.5% 9.% 7.7% 8.8% 9.4% poor not poor -> individual dynamics can only be measured with panel data! 1-7

8 Identification of age, time, and (birth) cohort effects Fundamental relationship: a it = t - c i Effects from formative years (childhood, youth) -> cohort effect (eg taste in music ) Time may affect behavior -> time effect (eg computer performance) Behavior may change with age -> age effect (eg health) In a cross-section, t is constant age and cohort collinear (only joint effect estimable) In a cohort study, cohort is constant age and time collinear (only joint effect estimable) In a panel, t varies, but A it, t, and c i collinear. only two of the three effects can be estimated we can use (t,c i ), (A it,c i ), or (A it,t), but not all three 1-8

9 Age, time, cohort effects: interpretation (no cohort effect) (aging effect) (age effect) (no aging effect) (no age effect) (aging and cohort effect) 1-9

10 Problems of panels Fieldwork / data quality related High costs (panel care, tracking households, incentives) Initial nonresponse (wave 1) and attrition (=drop-out of panel after wave 1) Finally: you design a panel for the next generation Modeling related Sometimes strong assumptions for applicability of appropriate models necessary (later) 1-1

11 Advantages of panel data Allow to study individual dynamics + Control for unobserved characteristics between individuals by repeated observation + Higher precision of change + Data can be pooled 1-11

12 2 Introducing the Swiss Household Panel (SHP)

13 Swiss Household Panel: history Primary goal: observe social change and changing life conditions in Switzerland Started as a common project of the Swiss National Science Foundation, Swiss Federal Statistical Office, University of Neuchâtel First wave in 1999, more than 5, households Refreshment sample in 24, more than 2,5 households, several new questions Since 28, integrated into FORS (Swiss Centre of Expertise in the Social Sciences ), c/o University of Lausanne 2-2

14 Disciplines working with SHP data 2-3

15 SHP sample and methods Representative of the Swiss residential population Each individual surveyed every year (Sept.-Jan.) All household members from 14 years on surveyed (proxy questionnaire if child or unable) Telephone interviews (central CATI), languages D/F/I Metadata: biographic, interviewers Paradata: call data (from address management) Following rules: OSM followed if moving, from 27 on all individuals All new household entrants surveyed 2-4

16 SHP sample size (households) and attrition Befragte Haushalte SHPI SHP II Cross-sectional and longitudinal (SHPI and SHP I+II combined) weights to account for nonresponse and attrition 2-5

17 SHP: Survey process and questionnaires Grid Questionnaire: Inventory and characteristics of hh-members Persons 18+ years «reference person» Persons 14+ years Persons 13- years + «unable to respond» Household Questionnaire: housing, finances, family roles, Individual Questionnaires: work, income, health, politics, leisure, satisfaction of life Individual Proxy Questionnaires: school, work, income, health, 2-6

18 SHP: Questionnaire Content Social structure: socio-demography, socio-ecomomy, work, education social stratification and social mobility Life events: marriages, births, deaths, deceases, accidents, conflicts with close persons, etc. life course Social participation: politics (attitudes, elections, party preferences and -choice), culture, social network, leisure social integration, political attitude and behavior Perception and values: trust, confidence, gender values and social capital Satisfaction and health: physical and mental health selfevaluation, chronic problems, different satisfaction issues quality of life 2-7

19 SHP Household: composition and housing «objective» elements Characteristics of household members (sex, age, civil status, education, occupation) Relationships between all household members «subjective» elements Satisfaction with house, noise, pollution, etc. Assessment of state of house Since when at this place Type of house Size, number of rooms of house State of house (heating, noise, pollution, etc.) Costs and subsidy, etc. External help for domestic work Child care Division of labor Who takes decisions, etc. 2-8

20 SHP Household: standard of living «objective» elements Activities and (durable) goods Reason why absence of goods (financial, other) «subjective» elements Satisfaction with financial situation Financial difficulties Debts (+reason) Total household income Taxes Social and private financial transfers 2-9

21 SHP Individual level: family «objective» elements Children out of the house Division of housework, care for dependents Disagreement about family problems etc. «subjective» elements Satisfaction with private situation Satisfaction with living alone or together Satisfaction with division of housework SHP Individual level: health, well-being «objective» elements Health problems Physical activities Doctor visits, hospitalization Improvement of health Long and short term handicaps «subjective» elements Subjective health state Satisfaction with health Satisfaction with life in general 2-1

22 Profession of parents Level of education of parents Nationality of parents Financial problems in childhood Social origin and education Education level Current training Language capabilities Leisure «objective» elements Activities: holidays, invitation of and meeting friends, reading, Internet use, restaurant, etc. «subjective» elements Satisfaction with leisure time and leisure activities Satisfaction with work-life balance 2-11

23 Individual level work «objective» elements Job sector Social stratification Private or public Position Working time, commuting time Size of company «subjective» elements Satisfaction with work (general, income, interests, working conditions, amount, atmosphere) Risk of unemployment Job security Chances to get promoted Income «objective» elements Total personal income Total personal income from work Social transfers received Private transfers received Other income «subjective» elements Satisfaction with financial situation Assessment whether financial situation improved or not Possible reactions on financial problems 2-12

24 Values and politics «objective» elements Right to vote Political activities Member of a political party «subjective» elements Satisfaction with democracy Confidence in federal government Political interest Left-right political positioning Opinions on political questions Participation, integration, social network «objective» elements Frequency of contacts Voluntary work outside the household Participation and membership in associations Belief and religious participation «subjective» elements Satisfaction with personal relationships Assessment of amount of practical help received from partner, parents, friends, etc. Assessment of amount of emotional help received from partner, parents, friends, etc. General trust in people 2-13

25 Biographical (retrospective) questionnaire N = 5 56 Written questionnaire in 21/22, sent to all individuals surveyed in 2, aged 14 or over Questions since birth about family, education, and professional biography: - with whom lived together - periods out of Switzerland - changes of civil status - learned professions - education - professional and non-professional biography - family life events (divorce-re-marriage of parents) 2-14

26 International Context SHP is part of the Cross National Equivalent File (CNEF): General population panel surveys from: USA (PSID since 198) D (SOEP, since 1984) UK (BHPS since 1991) Canada (SLID since 1993) CH (SHP since 1999) Australia (HILDA since 21) Korea (KLIPS since 27) More countries will join (Russia, South Africa, ) Each panel includes subset of all variables (variables from original files can be merged) Variables ex-post harmonized, names, categories Missing income variables are imputed Frick, Jenkins, Lillard, Lipps and Wooden (27): The Cross-National Equivalent File (CNEF) and its member country household panel studies. Journal of Applied Social Science Studies (Schmollers Jahrbuch) 2-15

27 SHP Questionnaire: Rotation Module Social network X X X X Religion X X X Social participation X X X X Politics X X X X Leisure X X X X Psychological Scales X X X 2-16

28 Outlook: new sample (LIVES) SHP III (213, based on individual register) Biographic questionnaire in 1. Wave SHP III (with NCCR LIVES) NCCR LIVES (Precarious) life course University of Lausanne and Geneva 15 research projects, 12 years Use of SHP 2-17

29 SHP structure of the data 2 yearly files (currently available: (+beta 211)) household Individual 5 unique files master person (mp) master household (mh) social origin (so) last job (lj) activity (employment) calendar (ca) Complementary files biographical questionnaire Interviewer data (2, and yearly since 23) Call data (since 25) CNEF SHP data variables 2-18

30 Documentation (Website: D/E/F) e.g.,: User Guide Questionnaires Variable Search (by variable name and topic) Construction of variables Syntax examples - Merge data files with SPSS, SAS, Stata - Documentation Data Management with SHP 2-19

31 SHP data delivery Data ready about 1 year after end of fieldwork downloadable from SHP-server: Signed contract with FORS Upon contract receipt, login and password sent by Data free of charge Users become member of SHP scientific network and document all publications based on SHP data Data on request: Imputed income Call data Interviewer matching ID Context data (special contract); data is matched at FORS More info: 2-2

32 3 Stata and panel data

33 Why Stata? Capabilities Data management Broad range of statistics Powerful for panel data! Many commands ready for analysis User-written extensions Beginners and experienced users For beginners: analysis through menus (point and click) Advanced users: good programmable capacities 3_2

34 Starting with Stata Basics Look at the data, variables Descriptive statistics Regression analysis Handout Stata basics Working with panel data Merge Creating «long files» Working with the long file Add information from other household members Handout data management of SHP with Stata (includes Syntax examples, exercises) 3_3

35 1. Merge: _merge variable Master file idpers p7c idpers p8c using file idpers p7c44 p8c _merge Merge variable 1 only in master file 2 only in user file 3 in both files 3_4

36 Merge: identifier Master file idpers p7c idpers p8c using file idpers p7c44 p8c _merge _5

37 Merge files: identifiers filename identifiers Individual master file shp_mp idpers, idhous$$, idfath, idmoth Individual annual files shp$$_p_user idpers, idint, idhous$$, idspou, refper$$ Additional ind. files (Social origin, last job, calendar, biographic) shp_so, shp_lj shp_ca, shp_* idpers Interviewer data shp$$_v_user idint Household annual files shp$$_h_user Biographic files idhous$$, refpers, idint, canton$$, (gdenr) idpers CNEF files shpequiv_$$$$ x1111ll (=idpers) 3_6

38 Stata merge command The merge command merge [type] [varlist] using filename [filename...] [, options] varlist filename identifier(s), e.g. idpers data set to be merged type 1:1 each observation has a unique identifier in both data sets 1:m, m:1 some observations have the same identifier in one data set 3_7

39 2 annual individual files Basic merge example I use shp8_p_user, clear merge 1:1 idpers using shp_p_user _merge Freq. Percent Cum , , , Total 16, _8

40 Basic merge example II annual individual file and individual master file use shp8_p_user, clear // opens the file (master) count //there are cases merge 1:1 idpers using shp_mp //identif. & using file tab _merge _merge Freq. Percent Cum , , Total 22, 1. drop if _merge==2 //if only ind. from 28 wanted drop _merge 3_9

41 Basic merge example III annual individual file and annual household file use shp8_p_user, clear //master file merge m:1 idhous8 using shp8_h_user /*identifier & using file */ _merge Freq. Percent Cum , Total 1, _1

42 More on merge Options of merge command keepusing (varlist): selection of variables from using file keep: selection of observations from master and/or using file for more options: type help merge Merge many files loops (see handout) Create partner files (see handout) 3_11

43 2. Wide and long format Wide format idpers i4empyn i5empyn i6empyn i7empyn Long format (person-period-file) idpers year iempyn _12

44 Use of long data format in stata All panel applications: xt commands descriptives panel data models fixed effects models, random effects, multilevel discrete time event-history analysis declare panel structure panel identifier, time identifier xtset idpers wave 3_13

45 Convert wide form to long form reshape long command in stata reshape long varlist, i(idpers) j(wave) But: stata does not automatically detect years in varname reshape long /// i (idpers) /// j(wave "99" "" "1" "2" "3" "4" /// "5" "6" "7" "8" ),atwl () 3_14

46 Create a long file with append 1. Modify datasets for each wave idpers i99wyn idpers wave iwyn temp1.dta 2. Stack data sets use temp1, clear forval y = 2/1 { append using temp`y' } idpers i99wyn idpers wave iwyn temp2.dta 3_15

47 Work with time lags If data in long format and defined as panel data (xtset) l. indicates time lag Example: social class of last job (see handout) 3_16

48 Missing data in the SHP Missing data in the SHP: negative values -1 does not know -2 no answer -3 inapplicable (question has not been asked) -8/-4 other missings Missing data in Stata:..a.b.c.d etc negative values are treated as real values missing data (..a.b etc) are defined as the highest possible values;. <.a <.b <.c <.d recode to missing or analyses only positive values e.g. sum i8empyn if i8empyn>= care with operator > e.g. count if i8empyn>1 counts also missing values write <. instead of!=. 3_17

49 Longitudinal data analysis with Stata xt commands descriptive statistics xtdescribe xtsum, xttab, xttrans regression analysis xtreg, xtgls, xtlogit, xtpoisson, xtcloglog xtmixed, xtmelogit, diagrams: xtline 3_18

50 Descriptive analysis Get to know the data Usually: similar findings to complicated models Visualisation Accessible results to a wider public Assumptions more explicit than in complicated models 3_19

51 Example: variability of party preferences Kuhn (29), Swiss Political Science Review 15(3): _2

52 Happiness with life 8.5 Example: becoming unemployed West East German-speaking French-speaking Employed Year before unemployed 1st year unemployed 2nd year unemployed 5. Employed Year before unemployed 1st year unemployed Germany, Switzerland, nd year unemployed Oesch and Lipps (212), European Sociological Review (online first) 3_21

53 Example: Income mobility Switzerland Low income 29 Middle income 29 High income 29 Total Low income % 4.8 % 3.1 % 1 % Middle income % 75.8 % 1.8 % 1 % High income % 34.4 % 61.1 % 1 % Germany Low income 29 Middle income 29 High income 29 Total Low income % 36.4 % 1.9 % 1 % Middle income % 78.4 % 9.2 % 1 % High income % 29.6 % 67.8 % 1 % Grabka and Kuhn (212), Swiss Journal of Sociology 38(2), _22

54 4 Linear regression (Refresher course) 4_1

55 Aim and content Refresher course on linear regression What is a regression? How do we obtain regression coefficients? How to interpret regression coefficients? Inference from sample to population of interest (significance tests) Assumptions of linear regression Consequences when assumptions are violated 4_2

56 What is a regression? A statistical method for studying the relationship between a single dependent variable and one or more independent variables. Y: dependent variable X: independent variable(s) Simplest form: bivariate linear regression linear relationship between a dependent and one independent variable for a given set of observations Example Does the wage level affect the number of hours worked? Gender discrimination in wages? Do children increase happiness? 4_3

57 Y yi ŷi Linear regression: fitting a line Y = x 1 unit X slope xi ei 4_4

58 yearly income from employment number of years spent in paid work scatter plot of observations 4_5

59 yearly income from employment a b 1 unit x number of years spent in paid work Regression line: ŷ i = a + bx i = *x i 4_6

60 yearly income from employment number of years spent in paid work Regression line: ŷ i = a + bx i = *xi Estimated regression equation: y i = a + bx i + e i 4_7

61 Components (linear) regression equation Estimated regression equation: y i = a + bx i + e i y dependent variable x independent variable(s) (predictor(s), regressor(s)) a intercept (predicted value of Y if x =) b regression coefficients (slope) measure of the effect of X on Y e part of y not explained by x (residual), due to -omitted variables - measurement errors - stochastic shock -disturbance 4_8

62 Scales of independent variables Independent variables Continuous variables: linear Binary variables (Dummy variables) (, 1) (e.g. female=1, male=) Ordinal or multivariate variables (n categories) Create n-1 dummy variables (base category) Examples: educational levels 1 low educational level 2 intermediate educational level 3 high educational level Include 2 dummy variables in regression model 4_9

63 Example: multivariate regression Including other covariates: Regression coefficients represent the portion of y explained by x that is not explained by the other x s Example: gender wage gap (sample: full-time employed, yearly salary between 2 and 2 CHF) Bivariate model y = a + b x + e i salary = female + e i Multivariate model b constant 45'369 y = a + b 1 x 1 + b 2 x 2 + b 3 x e i female -9'9 education (Ref: compulsory) secondary education 9'197 tertiary education 3'786 supervision 17'128 financial sector 15'592 number of years in paid work 729 4_1

64 Assumptions for OLS-estimations: coefficients Assumptions for OLS-estimation (necessary to calculate slope coefficients) 1) No perfect multicollinearity (None of the regressors can be written as a linear function of the other regressors) 2) E(e) = 3) None of the x is correlated with e; Cov(x,e) = ; (all x s are exogenous) If assumptions 1-3 hold: OLS is consistent (regression coefficients asymptotically unbiased) 4_11

65 Inference from linear regression I Inference from OLS-estimations if random sample But: OLS coefficients are estimations Estimated regression equation: y i = a + bx i + e i True regression equation: y i = α + βx i + ε i True coefficients (α, β) unknown, true «error term» unknown Distribution of coefficients (a, b) E( b) Var( b) ˆ 2 σ β E(β) 4_12

66 Inference from linear regression II Var ( b) ( i ) where 2 ( xi x) n p Variation of b (σ β2 ): decreases if n increases x are more spread out squared residuals decrease Distribution of b Student t-distribution Depends on n and number of x s = normal distribution if n large σ β E(β) 4_13

67 Inference from linear regression: testing whether b If β = (in population), there is no relationship between x and y test how likely it is, that β = H : Distribution if β = critical values for coefficients compare estimated coefficient with critical value if abs(b) >abs(critical value), b significant b stand b t value b b Critical value for standardized normal distribution and 95% confidence level: _14

68 Inference from linear regression: example yearly income from employment number of years spent in paid work Regression line: ŷ i = a + b *x i example: ŷ i = *x i 4_15

69 Inference from linear regression: example Sample n=53 Coef. st.e. t P> t [95% Conf. Interval] years work _cons R 2 :.11 Sample n=1787 Coef. St.e. t P> t [95% Conf. Interval] years work _cons R 2 :.159 4_16

70 4_17 Inference : assumptions Assumptions on error terms Independence of error terms, no autocorrelation: Cov (ε i, ε k ) = for all i,k, i k Constant error variance : Var(ε i )=σ 2 ε for all i; (Homoscedasticity) Preferentially: e is normally distributed Matrix of error terms ; n n n n k i

71 4_18 Autocorrelation Reason: Nested observations (e.g. households, schools, time, communities) standard errors underestimated OLS, adjust standard errors 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ ; n n n n k i ; n n n n k i autocorrelation no autocorrelation

72 4_19 Heteroskedasticity Variance is not consistent standard errors overestimated or underestimated OLS, adjust standard errors (White standard errors) Weighted least squares (WLS) σ σ 1 σ 5 σ 4 σ 3 σ 2 σ ; n n n n k i ; n n n n k i Homoskedasticity Heteroskedasticity

73 Summary: assumptions of OLS regression General Continuous dependent variable Random sample Coefficient estimation No perfect multicollinearity E(e) = No endogeneity Cov(x,e) = Omitted variables Measurement error in indep. variables Simultaneity Nonlinearity in parameters Inference No autocorrelation Cov (ei, ek)= Constant variance (no heterogeneity) Preferentially: residuals normally distributed Coefficients biased (inconsistent) Standard errors of coefficients biased 4_2

74 Endogeneity Traditional meaning Variable is determined within a model Econometrics Any situation where an explanatory variable is correlated with the residual If a variable is endogenous Care with interpretation: model cannot be interpreted as causal 4_21

75 Endogeneity: reasons Omitted variables Measurement error (in explanatory variables) Simultaneity Nonlinearity in parameters x contains lagged values of y (see later, dynamic models) 4_22

76 Endogeneity: consequences and detection Consequence of endogeneity ALL estimators may be biased! (exception: if a variable is completely exogenous controlling for the endogenous variable) Detection of endogeneity Difficult to detect and correct! Caution for causal interpretation Theory, literature (variable selection and interpretation)!!!! 4_23

77 Endogeneity: correction Test for nonlinear relationship and interactions Omitted variables Proxy variables, instrumental variables (2sls estimation) Panels: FE-models (within estimators), Difference-in-Difference models Propensity score analysis, Regression discontinuity, Heckman selection models, Simultaneity Structural equations modelling, Panel data for time ordering Theory, literature (variable selection and interpretation)!!!! 4_24

78 Example: Control of unobserved heterogeneity Example: effect of partnership on happiness happiness = e (partnership) We know (from literature, e.g.): happiness = f (attractiveness, leisure activities, health,...) [not measured!] Similarly: partnership = g (attractiveness, leisure activities, health,...) Cross-sectional data happiness = e(partnership) but we know that there are positive effects on partnership from income, fitness, attractiveness, health,... -> which part of effects are due to attractiveness, leisure activities, health,...? Panel data 1. Happiness (at times with a partner) 2. Happiness (at times without partner) of the same individuals Care: reversed causality, time-dependent unmeasured effects!) 4_25

79 Example: Regression analysis using Stata Sample: individuals in employment, 2 to 6 years Dependent variable: Working hours per week (paid work) Independent variables: hourly wage, number of children, married, sex, age Summary statistics sum workhours wage age8 married_nokid onekid twokids threepkids /// if workhours> & age8>=2 & age8<=6 Variable Obs Mean Std. Dev. Min Max workhours wage age married_no kids onekid twokids three+ kids _26

80 Example: check hourly wages graph box wage 4_ gross hourly wage Frequency gross hourly wage

81 Example: transform hourly wages Frequency Frequency gross hourly wage histogram wage, freq histogram lnwage, freq lnwage 4_28

82 reg workhours wagesd wagesq agerec agerecsq married_nokid /// onekid twokids threepkids if workhours> & age8>2 & age8<=6 Source SS df MS Number of obs = F( 8, 36) = Model Prob > F =. Residual R-squared = Adj R-squared =.3598 Total Root MSE = workhours Coef. Std. Err. t P> t [95% Conf. Interval] female lnwage lnwagesq onekid twokids threepkids married_no~d agerec _cons _29

83 Diagnostic plot: heteroskedasticity standardised residuals Fitted values 4_3

84 Diagnostic plot: normal distribution of residuals Density standardised residuals Standardized residuals normal scores 4_31

85 Regression with panel data: Data structure Wide data format Long data format (person-period-file, pooled data) idpers wage4 wave5 wave6 wage idpers year wage _32

86 OLS with pooled panel data: problems I OLS for cross-sectional analysis (one wave) no particular problem! OLS for pooled data (different years in one file) Problem: assumption of independent observations violated (autocorrelation) Possible correction: Correct for clustering in error terms (coefficients unaffected) But: OLS is not the best estimator for pooled data (not efficient) number of working hours OLS OLS, cluster in se per week b t b t lnwage (13.76) lnwage squared (-9.57) (-5.55) female (-15.34) (-11.4) 1 child.78 (2.5).78 (1.53) female*1 child (-2.43) (-14.) 2 children 1.74 (4.91) 1.74 (3.84) female*2 children (-33.4) (-16.36) 3+ children 2.85 (5.88) 2.85 (2.85) female*3+ children (-26.73) (-18.84) married, no child 2.4 (5.63) 2.4 (4.53) female*married, no child (-19.95) (-13.63) age -.6 (-6.89) -.6 (-4.39) 4_33

87 OLS with panel data: problems II OLS does not take advantage of panel structure Two different types of variation in panel data Variation within individuals Variation between individuals Control for unobservable variables (stable personal characteristics) Fixed Effects Models (only within variation) Random Effect Models (multilevel /random intercept / frailty for event history) 4_34

88 Comparisons of different regression models number of working OLS OLS, cluster Random effects Fixed effects hours per week b t b t b t b t lnwage (13.76) (14.2) 13.3 (1.26) lnwage squared (-9.57) (-5.55) (-12.3) (-9.91) female (-15.34) (-11.4) (-16.19) dropped 1 child.78 (2.5).78 (1.53) 1.33 (3.31).22 (.39) female*1 child (-2.43) (-14.) (-19.) -6.8 (-7.76) 2 children 1.74 (4.91) 1.74 (3.84) 1.84 (4.23).19 (.3) female*2 children (-33.4) (-16.36) (-23.3) (-8.17) 3+ children 2.85 (5.88) 2.85 (2.85) 2.55 (4.23) -.23 (.26) female*3+ children (-26.73) (-18.84) (-19.28) (-6.73) married, no child 2.4 (5.63) 2.4 (4.53) 1.79 (4.23).21 (.36) female*married, no chi (-19.95) (-13.63) (-11.71) -.54 (-.63) age -.6 (-6.89) -.6 (-4.39) -.3 (-2.42) constant R squared _35

89 5 Introducing Fixed Effects Models ( within effects)

90 Hypothesis: Example: BMI after stopping smoking BMI increases after stopping smoking Hypothetical data: Random sample of former smokers with year after stopping and BMI at that time for 3 individuals time bmi1 bmi2 bmi

91 BMI after stopping smoking: pooled OLS BMI years after stop smoke bmi Fitted values 5-3

92 Pooled regression. * pooled regression:. reg bmi time Source SS df MS Number of obs = F( 1, 16) =.26 Model Prob > F =.6149 Residual R-squared = Adj R-squared = Total Root MSE = bmi Coef. Std. Err. t P> t [95% Conf. Interval] time _cons No BMI increase over time 5-4

93 BMI after stopping smoking: individual data BMI years after stop smoke P1 P2 P3 OLS P1 OLS P2 OLS P3 OLSpooled Autocorrelat. with pooled regression-unobserved individual heterogeneity 5-5

94 Individual regressions. forval j=1/3 { /* loop over each individual*/ 2. reg bmi`j' time 3. } bmi1 Coef. Std. Err. t P> t [95% Conf. Interval] time _cons bmi2 Coef. Std. Err. t P> t [95% Conf. Interval] time _cons bmi3 Coef. Std. Err. t P> t [95% Conf. Interval] time _cons All individuals have significant BMI increase over time 5-6

95 Excursus: unobserved heterogeneity Omitted variables bias: Many individual characteristics are not observed e.g. enthusiasm, ability, willingness to take risks, our example: physical activities, calories intake, muscle mass, genes, cohort These have generally an effect on dependent variable, and are correlated with independent variables. Then regression coefficients will be biased! Note: these (formerly) unobserved measures are increasingly included in surveys 5-7

96 What about the between-effect? BMI years after stop smoke P1 P1 P2 P2 P3 P3 OLS P1 OLS P1 OLS P2 OLS OLS P2 P3 OLSpooled P3 5-8

97 Which models are appropriate to analyze the effects of time? Data transformation necessary 5-9

98 Panel Data and within- Regression (FE) 5-1

99 Error components in panel data models We separate the error components: e it = u i + ε u it, i = person-specific unobserved heterogeneity (level) = fixed effects (e.g., physical activities, calories intake, genetics, cohort) ε it = residual Model: bmi it 1 x it u i it Remember: Pooled OLS assumes that x is not correlated with both error components u i and ε it (omitted variable bias) 5-11

100 Fixed effects regression We can eliminated the fixed effects u i by estimating them as person specific dummies -> remains only within-variation Corresponds to de-meaning for each individual: bmi it 1 x it u i it (1) bmi i individual mean: (2) i 1 x i u i subtract (2) from (1): bmi it bmi i ( x x ) ( ) 1 it i it i -> Fixed (all time invariant) effects u i disappear, i.e. timeconstant unobserved heterogeneity is eliminated 5-12

101 De-meaned values with OLS regression de-meaned 25 BMI BMI de-meaned years after years stop after smoke stop smoke P1 P1 P2 P2 P3 P3 OLS P1 OLS OLS P2 P2 OLS OLS P3 P3 5-13

102 OLS of individually de-meaned Data We de-mean and regress the Data:. bysort id: egen bmi_m=mean(bmi). gen bmi_dem=bmi-bmi_m. bysort id: egen time_m=mean(time). gen time_dem=time-time_m. reg bmi_dem time_dem Source SS df MS Number of obs = F( 1, 16) = Model Prob > F =. Residual R-squared = Adj R-squared =.844 Total Root MSE = bmi_dem Coef. Std. Err. t P> t [95% Conf. Interval] time_dem _cons -2.12e

103 Direct modeling of fixed Effects in Stata xtreg bmi time, fe (calculates correct df; this causes higher Std. Err.). xtreg bmi time, fe Fixed-effects (within) regression Number of obs = 18 Group variable: id Number of groups = 3 R-sq: within =.8532 Obs per group: min = 6 between =.992 avg = 6 time since stop smoking explains parts of individual heterogeneity! max = 6 overall =.162 (from pooled OLS) F(1,14) = corr(u_i, Xb) = Prob > F = bmi Coef. Std. Err. t P> t [95% Conf. Interval] time _cons sigma_u sigma_e rho (fraction of variance due to u_i) F test that all u_i=: F(2, 14) = Prob > F =. 5-15

104 Alternative: OLS with individual dummies controlled. xi i.id, noomit. reg bmi time _I*, noconst Source SS df MS Number of obs = F( 4, 14) = Model Prob > F =. Residual R-squared =.9996! Adj R-squared =.9994 Total Root MSE = bmi Coef. Std. Err. t P> t [95% Conf. Interval] time _Iid_ _Iid_ _Iid_ useful for small N, the u i are estimated (only approximate) 5-16

105 FE estimation can solve the problem of unobserved heterogeneity But: Summary: Fixed Effects Estimation If number of groups large, many extra parameters Enough variance needed in data With FE-Regressions, estimation of time-constant covariates not possible. Are dropped from the model. But: possibility to use interactions (like male*nrchildren) What about comparing with people who never stopped smoking or who never smoked? (later) 5-17

106 No identification of time-invariant covariates z i Consider the model: y it = az i + bx it + u i + ε it (1) let be an arbitrary number; add and subtract z i on the rhs: y it = (az i + z i ) + bx it + (u i - z i ) + ε it and rewrite this as: y it = a*z i + bx it + u i *+ ε it with a* = a + and u i * = u i - z i (2) But (1) and (2) have exactly the same form so it is not clear if a or a* = a + is estimated -> separate effects of az i and u i cannot be distinguished without further assumptions (e.g., no correlation between z i and u i ) 5-18

107 Example: DiD (control group comparison) Hypothesis: financial support increases BMI of low-income women (Schmeisser 29)* (hypothetical) experiment: Survey: Sample of low income female patients of doctoral surgeries randomized into program social aid (5% in program and 5% not) 4 measurements of bmi (2 x before program start, 2 x after) *Expanding wallets and waistlines: the impact of family income on the bmi of women and men eligible for the earned income tax credit. Health Econ. 18:

108 Data: BMI-increase through social aid? bmi BMI of low income Women time Effects: causal effect: BMI increase due to more fast food (more money available) Time (age-) effect: BMI increases with age NO aid aid Start social aid Programm 5-2

109 Pooled Regression. reg bmi aid Source SS df MS Number of obs = F( 1, 14) =.89 Model Prob > F =.3627 Residual R-squared = Adj R-squared = -.77 Total Root MSE = bmi Coef. Std. Err. t P> t [95% Conf. Interval] aid _cons

110 Fixed effects xtreg bmi aid, fe. xtreg bmi aid, fe Fixed-effects (within) regression Number of obs = 16 Group variable: id Number of groups = 4 R-sq: within =.3596 Obs per group: min = 4 between =.54 avg = 4. overall =.595 max = 4 F(1,11) = 6.18 corr(u_i, Xb) = -.74 Prob > F = bmi Coef. Std. Err. t P> t [95% Conf. Interval] aid _cons sigma_u sigma_e rho (fraction of variance due to u_i) F test that all u_i=: F(3, 11) = Prob > F =. 5-22

111 One step back: a causal model With cross-sectional data: only between-estimation: Crucial assumption: Y i, Tt ( reatment) C( ontrol) Yi,t Random sample (no unobserved heterogeneity) With Panel data I: within-estimation (before and after) T C Y i, t 1 Yi,t problem: time effects, panel conditioning With Panel data II: difference-in-difference (DID): ( T C C C Y i, t 1Yi,t j,t 1 Yj, t ) ( Y ) 5-23

112 Now: causal effect of aid We have - Before-after comparison (within) - Treatment- and control groups (between) We compare the within-effect of aid ( treatment ) with that without aid ( control ) i.e., we calculate treatment effect and control for time DID estimator: =(after aid -before aid ) (after noaid before noaid )

113 DiD effects. xi i.time, noomit. xtreg bmi aid _I*, fe note: _Itime_4 omitted because of collinearity Fixed-effects (within) regression Number of obs = 16 Group variable: id Number of groups = 4 R-sq: within =.8876 Obs per group: min = 4 between =.54 avg = 4. overall =.1528 max = 4 F(4,8) = 15.8 corr(u_i, Xb) = -.55 Prob > F = bmi Coef. Std. Err. t P> t [95% Conf. Interval] aid _Itime_ _Itime_ _Itime_ _Itime_4 (omitted) _cons sigma_u sigma_e rho (fraction of variance due to u_i) F test that all u_i=: F(3, 8) = Prob > F =. 5-25

114 FE easier, no control group Summary: within estimators DID: control group can control simultaneous effects (like time): find statistical twin, such that research variables is only variable -> DID useful for estimating causal effects from nonexperimental data. Especially for small samples Excursus: First-Difference (FD) estimators: Stable similarities of adjacent observations eliminated. Problem: level differences not taken into account (6-5 children = 1- ch) for lasting effects (like children): FD not useful because only immediate changes taken into account 5-26

115 6 Introducing Random Effects Models 6-1

116 Towards RE: error components in panel models We have both -within and -between variance: y it = a + e it = a + u i + ε it fixed effects = within: u i for each person ( ANOVA) two residuals: - it on lowest (1.) level: time point a u u i on highest (2.) level: individuals 6-2

117 Necessary, if data have different levels with - observations are not independent of levels - true social interactions Examples: Motivation: multilevel models Schools classes students: first applications Networks: people are influenced by their peers Spatial context: from environment (e.g., poor people are less happy if they live in a rich environment) US: neighborhood-effects Interviewer - effects: respondents clustered in interviewers Panel-surveys: waves clustered in respondents (households) 6-3

118 Levels in clustered data Hierarchical - Households in neighborhoods - Students in schools in classes (three levels) - Respondents in interviewers - Panel Surveys: Waves in respondents (crossed?) Crossed - Questions in respondents longitudinal? Attention: Do not confuse variables and levels: (total) variance can be attributed to levels, not to variables! E.g., 1 hospitals are probably a level, 7 nations are probably dummy variables. Think of a population the sample represents! Note: 3 rule of thumb in contexts: 3 second level units, randomly chosen 6-4

119 Multilevel models: analytic advantages Improved regression models - unbiased estimators for regression coefficients - unbiased estimators for standard errors (usually higher std.err. than OLS) - model true covariance structure (autocorrelation, heteroscedasticity, ) Decomposition of total variance into those in different levels (within-between) Similar to Analysis of variance (ANOVA), but parsimonious (number of estimation parameter independent on number of contexts) can handle large number of contexts (in parts) modeling of unobserved heterogeneity / self selection 6-5

120 Typical results: one-level vs. multilevel model Dependent Variable mixed school boy school girl school Underestimated variance: Kish design effect deff : larger N necessary 6-6

121 Illustration: variance decomposition between levels Example: 2 individuals each asked 2x about their happiness (continuous), here measurements not time ordered! Which variance is due to individuals, which to observations? 3 Happiness Total ( Grand ) Mean (=) Indiv. 1 Indiv. 2 Total Mean= (3+2+(-1)+(-4)) / 4 = 6-7

122 Illustration: calculation of the total variance Happiness n i 1 T i ( t 1 y it T yy y ) 2 n T i i1 t1 ( y it W y yy Total Mean (=) i ) 2 n T i i1 t1 B yy (y i y) 2 Indiv. 1 Indiv. 2 Total Variance is equal to the Square of the Differences of all Observations from the Total Mean divided by the Sample Size (4) = { (3-) 2 + (2-) 2 + (-1-) 2 +(-4-) 2 } / 4 = ( )/4 =

123 Happiness Illustration: individual specific variance ( between ) n T i i1 t1 ( y T it yy y) 2 n i i1 t1 ( y W y yy Total Mean (=) -2.5 T it i ) 2 i 1t n Ti ( 1 B yy yi y ) 2 Indiv. 1 Indiv. 2 variance of the individual means = between -variance ( between = u ) ( (-2.5) 2 )/2 = 6.25 (remember: total variance = 7.5) 6-9

124 Happiness Illustration: measurement specific variance ( within ) n T i i1 t1 ( y T it yy -2.5 y) 2 i 1t n Ti ( 1 yit W yy yi ) 2 n T i i1 t1 B yy (y i y) 2 Indiv. 1 Indiv. 2 variance of measurements within individuals= within -Variance (later: within = ) ((3-2.5) 2 + (2-2.5) 2 + (-1-(-2.5)) 2 + (-4-(-2.5)) 2 )/4 =1.25 (=18% of total variance) variance of individual means = ( (-2.5) 2 )/2 = 6.25 (=82% of total variance = ρ = ICC (intra-class-correlation) 6-1

125 ICC = (zero clustering use OLS) Examples of Θ ICC.8 ICC.2 ICC = 1 (maximum clustering) 6-11

126 y where : u ε it i it u Starting point: null ( Variance Components (VC)) model i it individual specific random variable (N(,σ deviation from individual (note : no intercept a in VC model) u specific mean (N(,σ ) assumed) ε ) assumed) the VC model allows for variance decomposition : ρ correlation between different time points t within an individual i: 2 σu ρ ( ICC intra - class - correlation autocorrelation in Panels) 2 2 σ σ u ε (note : ρ significant multilevel model necessary) 6-12

127 Idea RE: better estimate of u i (modeling intercept) random intercept ( borrowing strength from others): u high edu To estimate mean of individual i (=u i ) only within (FE) suboptimal if - sample small (T i small) - variance high - n large (inefficient) a low edu idea: use information u j from other sample members (between) in the same population group (e.g., education) 6-13

128 Idea RE: weighted within and between RE - Regression is equivalent to pooled OLS after the Transformation : (y it with θ y ) β i (1 θ) β (x θ 1, and σ 1 it θ x ) (u (1 θ) (ε 2 ε i 2 σε Tσ 2 u i, θ 1 it θ ε i )) RE uses optimal combination of within and between variation RE allows estimation of time invariant RE biased because u i variables u remains in error term (if cov(x,u ) ) i i 6-14

129 . xtreg bmi children, re theta Example ρ based on SHP Random-effects GLS regression Number of obs = 18 Group variable: id Number of groups = 3 R-sq: within =.3921 Obs per group: min = 6 between =.1125 avg = 6. overall =.24 max = 6 Wald chi2(1) = 9.76 corr(u_i, X) = (assumed) Prob > chi2 =.18 theta = bmi Coef. Std. Err. z P> z [95% Conf. Interval] children _cons sigma_u sigma_e rho (fraction of variance due to u_i) 6-15