Introduction to Panel Data Analysis

Similar documents

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Panel Data Analysis Josef Brüderl, University of Mannheim, March 2005

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Correlated Random Effects Panel Data Models

Panel Data Analysis Fixed and Random Effects using Stata (v. 4.2)

Lecture 15. Endogeneity & Instrumental Variable Estimation

MULTIPLE REGRESSION EXAMPLE

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Sample Size Calculation for Longitudinal Studies

Chapter 10: Basic Linear Unobserved Effects Panel Data. Models:

Introduction to Regression Models for Panel Data Analysis. Indiana University Workshop in Methods October 7, Professor Patricia A.

Introduction to Longitudinal Data Analysis

Handling missing data in Stata a whirlwind tour

From the help desk: Swamy s random-coefficients model

Milk Data Analysis. 1. Objective Introduction to SAS PROC MIXED Analyzing protein milk data using STATA Refit protein milk data using PROC MIXED

Stata Walkthrough 4: Regression, Prediction, and Forecasting

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

Clustering in the Linear Model

August 2012 EXAMINATIONS Solution Part I

17. SIMPLE LINEAR REGRESSION II

25 Working with categorical data and factor variables

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Introduction to Regression and Data Analysis

I n d i a n a U n i v e r s i t y U n i v e r s i t y I n f o r m a t i o n T e c h n o l o g y S e r v i c e s

problem arises when only a non-random sample is available differs from censored regression model in that x i is also unobserved

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Regression Analysis (Spring, 2000)

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015

Nonlinear Regression Functions. SW Ch 8 1/54/

Correlation and Regression

Models for Longitudinal and Clustered Data

Efficient and Practical Econometric Methods for the SLID, NLSCY, NPHS

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Multilevel Models for Longitudinal Data. Fiona Steele

Standard errors of marginal effects in the heteroskedastic probit model

5. Linear Regression

From the help desk: Bootstrapped standard errors

Panel Data Analysis in Stata

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

Simple Linear Regression Inference

A Simple Feasible Alternative Procedure to Estimate Models with High-Dimensional Fixed Effects

2. Simple Linear Regression

2. Linear regression with multiple regressors

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Basic Statistical and Modeling Procedures Using SAS

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Econometrics Simple Linear Regression

2013 MBA Jump Start Program. Statistics Module Part 3

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

11. Analysis of Case-control Studies Logistic Regression

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

Rethinking the Cultural Context of Schooling Decisions in Disadvantaged Neighborhoods: From Deviant Subculture to Cultural Heterogeneity

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

Chapter 4: Vector Autoregressive Models

Moderation. Moderation

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Regression Analysis: A Complete Example

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Multinomial and Ordinal Logistic Regression

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Module 5: Multiple Regression Analysis

MODELING AUTO INSURANCE PREMIUMS

Note 2 to Computer class: Standard mis-specification tests

Moderator and Mediator Analysis

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Discussion Section 4 ECON 139/ Summer Term II

Interaction effects between continuous variables (Optional)

Applied Panel Data Analysis

Part 2: Analysis of Relationship Between Two Variables

10. Analysis of Longitudinal Studies Repeat-measures analysis

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

1 Simple Linear Regression I Least Squares Estimation

SYSTEMS OF REGRESSION EQUATIONS

Additional sources Compilation of sources:

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Forecasting in STATA: Tools and Tricks

Data Analysis Methodology 1

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

Longitudinal Data Analysis: Stata Tutorial

IMPACT EVALUATION: INSTRUMENTAL VARIABLE METHOD

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

International Statistical Institute, 56th Session, 2007: Phil Everson

Do Supplemental Online Recorded Lectures Help Students Learn Microeconomics?*

Linear Models in STATA and ANOVA

Module 14: Missing Data Stata Practical

Section 1: Simple Linear Regression

STATA Commands for Unobserved Effects Panel Data

A Panel Data Analysis of Corporate Attributes and Stock Prices for Indian Manufacturing Sector

Multiple Linear Regression in Data Mining

Overview of Factor Analysis

Transcription:

Introduction to Panel Data Analysis Oliver Lipps / Ursina Kuhn Swiss Centre of Expertise in the Social Sciences (FORS) c/o University of Lausanne Lugano Summer School, August 27-31 212

Introduction panel data, data management 1 Introducing panel data (OL) 2 The SHP (OL) 3 Data Management with Stata (UK) Regressions with panel data: basic 4 Regression refresher (UK) 5 Fixed effects (FE) models (OL) 6 Introducing random effects (RE) models (OL) 7 Nonlinear regression (UK) Start: 8.3 Breaks: 1.3-1.45 15.3-15.45 Lunch Breaks: 12.3-13.3 End: 17.3 Additional topics 8 Level 1 and 2 growth models (OL) 9 Missing data (OL) 1 Dynamic models (UK)

1 Introducing panel data

Surveys over time: repeated cross-sections vs. panels Cross-Section: Survey conducted at several points in time ( rounds ) using different sample members in each round Panel: Survey conducted at several points in time ( waves ) using the same individuals over waves -> panel data mostly from panel surveys -> If from cross-sectional surveys: retrospective ( biographical ) questionnaire 1-4

Panel Surveys: to distinguish Length and sample size: Time Series: N small (mostly=1), T large (T ) time series models Panel Surveys: N large, T small (N ) social science panel surveys Sample General population: - rotating: only few (pre-defined number) waves per individual (in CH: SILC, LFS) - indefinitely long (in CH: SHP) Special population: - e.g., age/birth cohorts (in CH e.g.: TREE, SHARE, COCON) representative for population of special agegroup / birthyears 1-5

Panel surveys increasingly important Changing focus in social sciences Life course research: effects of events within individuals Large investments in social science household panels surveys, high data quality! Concern about causality in cross-sectional studies Analysis potential of panel data - close to experimental design: before and after studies - control unobserved individual characteristics (exogenous independent variables; implicit in regression analyses) 1-6

Identification of individual dynamics (poverty in SHP) 1% 96% 4% 89.5% 91.% 92.3% 91.2% 9.6% 5% 52% 49% 46% % 5% 48% 51% 54% 1.5% 9.% 7.7% 8.8% 9.4% 1999 2 21 22 23 poor not poor -> individual dynamics can only be measured with panel data! 1-7

Identification of age, time, and (birth) cohort effects Fundamental relationship: a it = t - c i Effects from formative years (childhood, youth) -> cohort effect (eg taste in music ) Time may affect behavior -> time effect (eg computer performance) Behavior may change with age -> age effect (eg health) In a cross-section, t is constant age and cohort collinear (only joint effect estimable) In a cohort study, cohort is constant age and time collinear (only joint effect estimable) In a panel, t varies, but A it, t, and c i collinear. only two of the three effects can be estimated we can use (t,c i ), (A it,c i ), or (A it,t), but not all three 1-8

Age, time, cohort effects: interpretation (no cohort effect) (aging effect) (age effect) (no aging effect) (no age effect) (aging and cohort effect) 1-9

Problems of panels Fieldwork / data quality related High costs (panel care, tracking households, incentives) Initial nonresponse (wave 1) and attrition (=drop-out of panel after wave 1) Finally: you design a panel for the next generation Modeling related Sometimes strong assumptions for applicability of appropriate models necessary (later) 1-1

Advantages of panel data Allow to study individual dynamics + Control for unobserved characteristics between individuals by repeated observation + Higher precision of change + Data can be pooled 1-11

2 Introducing the Swiss Household Panel (SHP)

Swiss Household Panel: history Primary goal: observe social change and changing life conditions in Switzerland Started as a common project of the Swiss National Science Foundation, Swiss Federal Statistical Office, University of Neuchâtel First wave in 1999, more than 5, households Refreshment sample in 24, more than 2,5 households, several new questions Since 28, integrated into FORS (Swiss Centre of Expertise in the Social Sciences ), c/o University of Lausanne 2-2

Disciplines working with SHP data 2-3

SHP sample and methods Representative of the Swiss residential population Each individual surveyed every year (Sept.-Jan.) All household members from 14 years on surveyed (proxy questionnaire if child or unable) Telephone interviews (central CATI), languages D/F/I Metadata: biographic, interviewers Paradata: call data (from address management) Following rules: OSM followed if moving, from 27 on all individuals All new household entrants surveyed 2-4

SHP sample size (households) and attrition Befragte Haushalte SHPI SHP II 2538 1799 1684 1494 1546 1476 1557 152 574 4425 4139 3582 3227 2837 2457 2537 2817 2718 293 2985 2977 1999 2 21 22 23 24 25 26 27 28 29 21 211 Cross-sectional and longitudinal (SHPI and SHP I+II combined) weights to account for nonresponse and attrition 2-5

SHP: Survey process and questionnaires Grid Questionnaire: Inventory and characteristics of hh-members Persons 18+ years «reference person» Persons 14+ years Persons 13- years + «unable to respond» Household Questionnaire: housing, finances, family roles, Individual Questionnaires: work, income, health, politics, leisure, satisfaction of life Individual Proxy Questionnaires: school, work, income, health, 2-6

SHP: Questionnaire Content Social structure: socio-demography, socio-ecomomy, work, education social stratification and social mobility Life events: marriages, births, deaths, deceases, accidents, conflicts with close persons, etc. life course Social participation: politics (attitudes, elections, party preferences and -choice), culture, social network, leisure social integration, political attitude and behavior Perception and values: trust, confidence, gender values and social capital Satisfaction and health: physical and mental health selfevaluation, chronic problems, different satisfaction issues quality of life 2-7

SHP Household: composition and housing «objective» elements Characteristics of household members (sex, age, civil status, education, occupation) Relationships between all household members «subjective» elements Satisfaction with house, noise, pollution, etc. Assessment of state of house Since when at this place Type of house Size, number of rooms of house State of house (heating, noise, pollution, etc.) Costs and subsidy, etc. External help for domestic work Child care Division of labor Who takes decisions, etc. 2-8

SHP Household: standard of living «objective» elements Activities and (durable) goods Reason why absence of goods (financial, other) «subjective» elements Satisfaction with financial situation Financial difficulties Debts (+reason) Total household income Taxes Social and private financial transfers 2-9

SHP Individual level: family «objective» elements Children out of the house Division of housework, care for dependents Disagreement about family problems etc. «subjective» elements Satisfaction with private situation Satisfaction with living alone or together Satisfaction with division of housework SHP Individual level: health, well-being «objective» elements Health problems Physical activities Doctor visits, hospitalization Improvement of health Long and short term handicaps «subjective» elements Subjective health state Satisfaction with health Satisfaction with life in general 2-1

Profession of parents Level of education of parents Nationality of parents Financial problems in childhood Social origin and education Education level Current training Language capabilities Leisure «objective» elements Activities: holidays, invitation of and meeting friends, reading, Internet use, restaurant, etc. «subjective» elements Satisfaction with leisure time and leisure activities Satisfaction with work-life balance 2-11

Individual level work «objective» elements Job sector Social stratification Private or public Position Working time, commuting time Size of company «subjective» elements Satisfaction with work (general, income, interests, working conditions, amount, atmosphere) Risk of unemployment Job security Chances to get promoted Income «objective» elements Total personal income Total personal income from work Social transfers received Private transfers received Other income «subjective» elements Satisfaction with financial situation Assessment whether financial situation improved or not Possible reactions on financial problems 2-12

Values and politics «objective» elements Right to vote Political activities Member of a political party «subjective» elements Satisfaction with democracy Confidence in federal government Political interest Left-right political positioning Opinions on political questions Participation, integration, social network «objective» elements Frequency of contacts Voluntary work outside the household Participation and membership in associations Belief and religious participation «subjective» elements Satisfaction with personal relationships Assessment of amount of practical help received from partner, parents, friends, etc. Assessment of amount of emotional help received from partner, parents, friends, etc. General trust in people 2-13

Biographical (retrospective) questionnaire N = 5 56 Written questionnaire in 21/22, sent to all individuals surveyed in 2, aged 14 or over Questions since birth about family, education, and professional biography: - with whom lived together - periods out of Switzerland - changes of civil status - learned professions - education - professional and non-professional biography - family life events (divorce-re-marriage of parents) 2-14

International Context SHP is part of the Cross National Equivalent File (CNEF): General population panel surveys from: USA (PSID since 198) D (SOEP, since 1984) UK (BHPS since 1991) Canada (SLID since 1993) CH (SHP since 1999) Australia (HILDA since 21) Korea (KLIPS since 27) More countries will join (Russia, South Africa, ) Each panel includes subset of all variables (variables from original files can be merged) Variables ex-post harmonized, names, categories Missing income variables are imputed Frick, Jenkins, Lillard, Lipps and Wooden (27): The Cross-National Equivalent File (CNEF) and its member country household panel studies. Journal of Applied Social Science Studies (Schmollers Jahrbuch) 2-15

SHP Questionnaire: Rotation Module 21 211 212 213 214 215 216 217 218 219 22 Social network X X X X Religion X X X Social participation X X X X Politics X X X X Leisure X X X X Psychological Scales X X X 2-16

Outlook: new sample (LIVES) SHP III (213, based on individual register) Biographic questionnaire in 1. Wave SHP III (with NCCR LIVES) NCCR LIVES (Precarious) life course University of Lausanne and Geneva 15 research projects, 12 years Use of SHP 2-17

SHP structure of the data 2 yearly files (currently available: 1999-21 (+beta 211)) household Individual 5 unique files master person (mp) master household (mh) social origin (so) last job (lj) activity (employment) calendar (ca) Complementary files biographical questionnaire Interviewer data (2, and yearly since 23) Call data (since 25) CNEF SHP data variables 2-18

Documentation (Website: D/E/F) http://www.swisspanel.ch/ e.g.,: User Guide Questionnaires Variable Search (by variable name and topic) Construction of variables Syntax examples - Merge data files with SPSS, SAS, Stata - Documentation Data Management with SHP 2-19

SHP data delivery Data ready about 1 year after end of fieldwork downloadable from SHP-server: http://www.swisspanel.ch/spip.php?rubrique181&lang=en Signed contract with FORS Upon contract receipt, login and password sent by e-mail Data free of charge Users become member of SHP scientific network and document all publications based on SHP data Data on request: Imputed income Call data Interviewer matching ID Context data (special contract); data is matched at FORS More info: http://www.swisspanel.ch 2-2

3 Stata and panel data

Why Stata? Capabilities Data management Broad range of statistics Powerful for panel data! Many commands ready for analysis User-written extensions Beginners and experienced users For beginners: analysis through menus (point and click) Advanced users: good programmable capacities 3_2

Starting with Stata Basics Look at the data, variables Descriptive statistics Regression analysis Handout Stata basics Working with panel data Merge Creating «long files» Working with the long file Add information from other household members Handout data management of SHP with Stata (includes Syntax examples, exercises) 3_3

1. Merge: _merge variable Master file idpers p7c44 42 8 43 7 44 9 45 1 idpers p8c44 44 9 45 5 46 1 47 3 using file idpers p7c44 p8c44 42 8. 43 7. 46. 1 47. 3 44 9 9 45 1 5 _merge 1 1 2 2 3 3 Merge variable 1 only in master file 2 only in user file 3 in both files 3_4

Merge: identifier Master file idpers p7c44 42 8 43 7 44 9 45 1 idpers p8c44 44 9 45 5 46 1 47 3 using file idpers p7c44 p8c44 42 8. 43 7. 46. 1 47. 3 44 9 9 45 1 5 _merge 1 1 2 2 3 3 3_5

Merge files: identifiers filename identifiers Individual master file shp_mp idpers, idhous$$, idfath, idmoth Individual annual files shp$$_p_user idpers, idint, idhous$$, idspou, refper$$ Additional ind. files (Social origin, last job, calendar, biographic) shp_so, shp_lj shp_ca, shp_* idpers Interviewer data shp$$_v_user idint Household annual files shp$$_h_user Biographic files idhous$$, refpers, idint, canton$$, (gdenr) idpers CNEF files shpequiv_$$$$ x1111ll (=idpers) 3_6

Stata merge command The merge command merge [type] [varlist] using filename [filename...] [, options] varlist filename identifier(s), e.g. idpers data set to be merged type 1:1 each observation has a unique identifier in both data sets 1:m, m:1 some observations have the same identifier in one data set 3_7

2 annual individual files Basic merge example I use shp8_p_user, clear merge 1:1 idpers using shp_p_user _merge Freq. Percent Cum. ------------+----------------------------------- 1 5,845 34.93 34.93 2 5,56 3.21 65.14 3 5,833 34.86 1. ------------+----------------------------------- Total 16,734 1. 3_8

Basic merge example II annual individual file and individual master file use shp8_p_user, clear // opens the file (master) count //there are 1 889 cases merge 1:1 idpers using shp_mp //identif. & using file tab _merge _merge Freq. Percent Cum. ------------+----------------------------------- 2 11,111 5.5 5.5 3 1,889 49.5 1. ------------+----------------------------------- Total 22, 1. drop if _merge==2 //if only ind. from 28 wanted drop _merge 3_9

Basic merge example III annual individual file and annual household file use shp8_p_user, clear //master file merge m:1 idhous8 using shp8_h_user /*identifier & using file */ _merge Freq. Percent Cum. ------------+----------------------------------- 3 1,889 1. 1. ------------+----------------------------------- Total 1,889 1. 3_1

More on merge Options of merge command keepusing (varlist): selection of variables from using file keep: selection of observations from master and/or using file for more options: type help merge Merge many files loops (see handout) Create partner files (see handout) 3_11

2. Wide and long format Wide format idpers i4empyn i5empyn i6empyn i7empyn 411 1319 1773 1134 12247 4211 6318 695.. 5612 35473. 414 455 Long format (person-period-file) idpers year iempyn 411 24 1319 411 25 1773 411 26 1134 411 27 12247 4211 24 6318 4211 25 695 5612 24 35473 5612 26 414 5612 27 455 3_12

Use of long data format in stata All panel applications: xt commands descriptives panel data models fixed effects models, random effects, multilevel discrete time event-history analysis declare panel structure panel identifier, time identifier xtset idpers wave 3_13

Convert wide form to long form reshape long command in stata reshape long varlist, i(idpers) j(wave) But: stata does not automatically detect years in varname reshape long i@wyn p@w32 age@ status@, /// i (idpers) /// j(wave "99" "" "1" "2" "3" "4" /// "5" "6" "7" "8" ),atwl () 3_14

Create a long file with append 1. Modify datasets for each wave idpers i99wyn 411 5 412 24 8 5683 18 idpers wave iwyn 411 1 5 412 1 24 8 5683 1 18 temp1.dta 2. Stack data sets use temp1, clear forval y = 2/1 { append using temp`y' } idpers i99wyn 411 5 412 24 8 5683 18 idpers wave iwyn 411 2 51 412 2 24 5 852 2 28 3 temp2.dta 3_15

Work with time lags If data in long format and defined as panel data (xtset) l. indicates time lag Example: social class of last job (see handout) 3_16

Missing data in the SHP Missing data in the SHP: negative values -1 does not know -2 no answer -3 inapplicable (question has not been asked) -8/-4 other missings Missing data in Stata:..a.b.c.d etc negative values are treated as real values missing data (..a.b etc) are defined as the highest possible values;. <.a <.b <.c <.d recode to missing or analyses only positive values e.g. sum i8empyn if i8empyn>= care with operator > e.g. count if i8empyn>1 counts also missing values write <. instead of!=. 3_17

Longitudinal data analysis with Stata xt commands descriptive statistics xtdescribe xtsum, xttab, xttrans regression analysis xtreg, xtgls, xtlogit, xtpoisson, xtcloglog xtmixed, xtmelogit, diagrams: xtline 3_18

Descriptive analysis Get to know the data Usually: similar findings to complicated models Visualisation Accessible results to a wider public Assumptions more explicit than in complicated models 3_19

Example: variability of party preferences Kuhn (29), Swiss Political Science Review 15(3): 463-94 3_2

Happiness with life 8.5 Example: becoming unemployed 8.5 8. 7.5 West East 8. 7.5 German-speaking French-speaking 7. 7. 6.5 6.5 6. 6. 5.5 5.5 5. Employed Year before unemployed 1st year unemployed 2nd year unemployed 5. Employed Year before unemployed 1st year unemployed Germany, 1984-21 Switzerland, 2-21 2nd year unemployed Oesch and Lipps (212), European Sociological Review (online first) 3_21

Example: Income mobility Switzerland Low income 29 Middle income 29 High income 29 Total Low income 25 56.2 % 4.8 % 3.1 % 1 % Middle income 25 13.5 % 75.8 % 1.8 % 1 % High income 25 4.4 % 34.4 % 61.1 % 1 % Germany Low income 29 Middle income 29 High income 29 Total Low income 25 61.7 % 36.4 % 1.9 % 1 % Middle income 25 12.4 % 78.4 % 9.2 % 1 % High income 25 2.6 % 29.6 % 67.8 % 1 % Grabka and Kuhn (212), Swiss Journal of Sociology 38(2), 311 334 3_22

4 Linear regression (Refresher course) 4_1

Aim and content Refresher course on linear regression What is a regression? How do we obtain regression coefficients? How to interpret regression coefficients? Inference from sample to population of interest (significance tests) Assumptions of linear regression Consequences when assumptions are violated 4_2

What is a regression? A statistical method for studying the relationship between a single dependent variable and one or more independent variables. Y: dependent variable X: independent variable(s) Simplest form: bivariate linear regression linear relationship between a dependent and one independent variable for a given set of observations Example Does the wage level affect the number of hours worked? Gender discrimination in wages? Do children increase happiness? 4_3

Y 16 14 12 1 8 6 4 2 yi ŷi Linear regression: fitting a line 1 2 3 4 5 6 7 8 Y = 1 + 2 x 1 unit X slope xi ei 4_4

yearly income from employment 12 1 8 6 4 2 1 2 3 4 5 number of years spent in paid work scatter plot of observations 4_5

yearly income from employment 12 1 8 6 4 2 a b 1 unit x 1 2 3 4 5 number of years spent in paid work Regression line: ŷ i = a + bx i = 51375 + 693 *x i 4_6

yearly income from employment 12 1 8 6 4 2 1 2 3 4 5 number of years spent in paid work Regression line: ŷ i = a + bx i = 51375 + 693 *xi Estimated regression equation: y i = a + bx i + e i 4_7

Components (linear) regression equation Estimated regression equation: y i = a + bx i + e i y dependent variable x independent variable(s) (predictor(s), regressor(s)) a intercept (predicted value of Y if x =) b regression coefficients (slope) measure of the effect of X on Y e part of y not explained by x (residual), due to -omitted variables - measurement errors - stochastic shock -disturbance 4_8

Scales of independent variables Independent variables Continuous variables: linear Binary variables (Dummy variables) (, 1) (e.g. female=1, male=) Ordinal or multivariate variables (n categories) Create n-1 dummy variables (base category) Examples: educational levels 1 low educational level 2 intermediate educational level 3 high educational level Include 2 dummy variables in regression model 4_9

Example: multivariate regression Including other covariates: Regression coefficients represent the portion of y explained by x that is not explained by the other x s Example: gender wage gap (sample: full-time employed, yearly salary between 2 and 2 CHF) Bivariate model y = a + b x + e i salary = 98 79 17 737 female + e i Multivariate model b constant 45'369 y = a + b 1 x 1 + b 2 x 2 + b 3 x 3 + + e i female -9'9 education (Ref: compulsory) secondary education 9'197 tertiary education 3'786 supervision 17'128 financial sector 15'592 number of years in paid work 729 4_1

Assumptions for OLS-estimations: coefficients Assumptions for OLS-estimation (necessary to calculate slope coefficients) 1) No perfect multicollinearity (None of the regressors can be written as a linear function of the other regressors) 2) E(e) = 3) None of the x is correlated with e; Cov(x,e) = ; (all x s are exogenous) If assumptions 1-3 hold: OLS is consistent (regression coefficients asymptotically unbiased) 4_11

Inference from linear regression I Inference from OLS-estimations if random sample But: OLS coefficients are estimations Estimated regression equation: y i = a + bx i + e i True regression equation: y i = α + βx i + ε i True coefficients (α, β) unknown, true «error term» unknown Distribution of coefficients (a, b) E( b) Var( b) ˆ 2 σ β E(β) 4_12

Inference from linear regression II Var ( b) 2 2 2 2 ( i ) where 2 ( xi x) n p Variation of b (σ β2 ): decreases if n increases x are more spread out squared residuals decrease Distribution of b Student t-distribution Depends on n and number of x s = normal distribution if n large σ β E(β) 4_13

Inference from linear regression: testing whether b If β = (in population), there is no relationship between x and y test how likely it is, that β = H : Distribution if β = critical values for coefficients compare estimated coefficient with critical value if abs(b) >abs(critical value), b significant b stand b t value b b Critical value for standardized normal distribution and 95% confidence level: 1.96 4_14

Inference from linear regression: example yearly income from employment 12 1 8 6 4 2 1 2 3 4 5 number of years spent in paid work Regression line: ŷ i = a + b *x i example: ŷ i = 51375 + 693 *x i 4_15

Inference from linear regression: example Sample n=53 Coef. st.e. t P> t [95% Conf. Interval] -----------+--------------------------------------------------- years work. 692.6 289.1 2.4.2 112.1 1273. _cons 51375.9 734.4 7.. 36639.4 66112.5 R 2 :.11 Sample n=1787 Coef. St.e. t P> t [95% Conf. Interval] ---------------------------------------------------------------- years work. 931.4 5.6 18.4. 832.1 13.7 _cons 51218.6 1271.2 4.29. 48725.5 53711.7 ----------------------------------------------------------------- R 2 :.159 4_16

4_17 Inference : assumptions Assumptions on error terms Independence of error terms, no autocorrelation: Cov (ε i, ε k ) = for all i,k, i k Constant error variance : Var(ε i )=σ 2 ε for all i; (Homoscedasticity) Preferentially: e is normally distributed Matrix of error terms 2 2 2 2 2 2 2 1 5 4 3 2 1 1 5 4 3 2 1 ; n n n n k i

4_18 Autocorrelation Reason: Nested observations (e.g. households, schools, time, communities) standard errors underestimated OLS, adjust standard errors 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 1 2 2 1 1 2 2 1 1 2 2 2 1 2 2 2 1 1 5 4 3 2 1 1 5 4 3 2 1 ; n n n n k i 2 2 2 2 2 2 1 5 4 3 2 2 1 1 5 4 3 2 1 ; n n n n k i autocorrelation no autocorrelation

4_19 Heteroskedasticity Variance is not consistent standard errors overestimated or underestimated OLS, adjust standard errors (White standard errors) Weighted least squares (WLS) 2 3 2 3 2 2 2 2 2 1 2 1 2 1 σ σ 1 σ 5 σ 4 σ 3 σ 2 σ 1 1 5 4 3 2 1 ; n n n n k i 2 2 2 2 2 2 1 5 4 3 2 2 1 1 5 4 3 2 1 ; n n n n k i Homoskedasticity Heteroskedasticity

Summary: assumptions of OLS regression General Continuous dependent variable Random sample Coefficient estimation No perfect multicollinearity E(e) = No endogeneity Cov(x,e) = Omitted variables Measurement error in indep. variables Simultaneity Nonlinearity in parameters Inference No autocorrelation Cov (ei, ek)= Constant variance (no heterogeneity) Preferentially: residuals normally distributed Coefficients biased (inconsistent) Standard errors of coefficients biased 4_2

Endogeneity Traditional meaning Variable is determined within a model Econometrics Any situation where an explanatory variable is correlated with the residual If a variable is endogenous Care with interpretation: model cannot be interpreted as causal 4_21

Endogeneity: reasons Omitted variables Measurement error (in explanatory variables) Simultaneity Nonlinearity in parameters x contains lagged values of y (see later, dynamic models) 4_22

Endogeneity: consequences and detection Consequence of endogeneity ALL estimators may be biased! (exception: if a variable is completely exogenous controlling for the endogenous variable) Detection of endogeneity Difficult to detect and correct! Caution for causal interpretation Theory, literature (variable selection and interpretation)!!!! 4_23

Endogeneity: correction Test for nonlinear relationship and interactions Omitted variables Proxy variables, instrumental variables (2sls estimation) Panels: FE-models (within estimators), Difference-in-Difference models Propensity score analysis, Regression discontinuity, Heckman selection models, Simultaneity Structural equations modelling, Panel data for time ordering Theory, literature (variable selection and interpretation)!!!! 4_24

Example: Control of unobserved heterogeneity Example: effect of partnership on happiness happiness = e (partnership) We know (from literature, e.g.): happiness = f (attractiveness, leisure activities, health,...) [not measured!] Similarly: partnership = g (attractiveness, leisure activities, health,...) Cross-sectional data happiness = e(partnership) but we know that there are positive effects on partnership from income, fitness, attractiveness, health,... -> which part of effects are due to attractiveness, leisure activities, health,...? Panel data 1. Happiness (at times with a partner) 2. Happiness (at times without partner) of the same individuals Care: reversed causality, time-dependent unmeasured effects!) 4_25

Example: Regression analysis using Stata Sample: individuals in employment, 2 to 6 years Dependent variable: Working hours per week (paid work) Independent variables: hourly wage, number of children, married, sex, age Summary statistics sum workhours wage age8 married_nokid onekid twokids threepkids /// if workhours> & age8>=2 & age8<=6 Variable Obs Mean Std. Dev. Min Max ----------------+-------------------------------------------------------- workhours 3712 36.97117 14.74141 1 98 wage 3199 44.58235 4.1537.2325581 15 age8 3922 42.25268 1.65961 2 6 married_no kids 3922.2353391.4242647 1 onekid 3922.1573177.3641465 1 twokids 3922.1932687.3949123 1 three+ kids 3922.739419.261796 1 4_26

Example: check hourly wages graph box wage 4_27 1 2 3 4 gross hourly wage 2 4 6 8 Frequency 1 2 3 4 gross hourly wage

Example: transform hourly wages Frequency 2 4 6 8 Frequency 1 2 3 1 2 3 4 gross hourly wage histogram wage, freq histogram lnwage, freq 2 3 4 5 6 lnwage 4_28

reg workhours wagesd wagesq agerec agerecsq married_nokid /// onekid twokids threepkids if workhours> & age8>2 & age8<=6 Source SS df MS Number of obs = 369 -------------+------------------------------ F( 8, 36) = 216.53 Model 213427.567 8 26678.4459 Prob > F =. Residual 37724.5 36 123.21459 R-squared =.3615 -------------+------------------------------ Adj R-squared =.3598 Total 59451.572 368 192.45488 Root MSE = 11.1 ------------------------------------------------------------------------------ workhours Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- female -12.4199.41451-29.94. -13.22372-11.59826 lnwage 16.26171 2.97445 5.47. 1.4338 22.934 lnwagesq -1.371871.4193693-3.27.1-2.194145 -.549597 onekid -5.1246.6194891-8.27. -6.335117-3.9583 twokids -6.561456.5773265-11.37. -7.693443-5.429469 threepkids -7.2494.857949-8.4. -8.88732-5.522776 married_no~d -3.18137.593523-5.36. -4.344781-2.17293 agerec -.426564.214263-1.99.47 -.846678 -.6449 _cons 6.83153 5.289422 1.15.25-4.28826 16.45433 ------------------------------------------------------------------------------ 4_29

Diagnostic plot: heteroskedasticity standardised residuals -4-2 2 4 6 1 2 3 4 5 Fitted values 4_3

Diagnostic plot: normal distribution of residuals Density.2.4.6 standardised residuals -4-2 2 4 6-4 -2 2 4 6 Standardized residuals -4-3 -2-1 1 normal scores 4_31

Regression with panel data: Data structure Wide data format Long data format (person-period-file, pooled data) idpers wage4 wave5 wave6 wage7 411 1319 1773 1134 12247 4211 6318 695.. 5612 35473. 414 455 idpers year wage 411 24 1319 411 25 1773 411 26 1134 411 27 12247 4211 24 6318 4211 25 695 5612 24 35473 5612 26 414 5612 27 455 4_32

OLS with pooled panel data: problems I OLS for cross-sectional analysis (one wave) no particular problem! OLS for pooled data (different years in one file) Problem: assumption of independent observations violated (autocorrelation) Possible correction: Correct for clustering in error terms (coefficients unaffected) But: OLS is not the best estimator for pooled data (not efficient) number of working hours OLS OLS, cluster in se per week b t b t lnwage 17.44 (13.76) 17.44-7.85 lnwage squared -1.71 (-9.57) -1.71 (-5.55) female -4.48 (-15.34) -4.48 (-11.4) 1 child.78 (2.5).78 (1.53) female*1 child -1.72 (-2.43) -1.72 (-14.) 2 children 1.74 (4.91) 1.74 (3.84) female*2 children -16.36 (-33.4) -16.36 (-16.36) 3+ children 2.85 (5.88) 2.85 (2.85) female*3+ children -18.8 (-26.73) -18.8 (-18.84) married, no child 2.4 (5.63) 2.4 (4.53) female*married, no child -9.56 (-19.95) -9.36 (-13.63) age -.6 (-6.89) -.6 (-4.39) 4_33

OLS with panel data: problems II OLS does not take advantage of panel structure Two different types of variation in panel data Variation within individuals Variation between individuals Control for unobservable variables (stable personal characteristics) Fixed Effects Models (only within variation) Random Effect Models (multilevel /random intercept / frailty for event history) 4_34

Comparisons of different regression models number of working OLS OLS, cluster Random effects Fixed effects hours per week b t b t b t b t lnwage 17.44 (13.76) 17.44-7.85 15.51 (14.2) 13.3 (1.26) lnwage squared -1.71 (-9.57) -1.71 (-5.55) -1.84 (-12.3) -1.75 (-9.91) female -4.48 (-15.34) -4.48 (-11.4) -6.59 (-16.19) dropped 1 child.78 (2.5).78 (1.53) 1.33 (3.31).22 (.39) female*1 child -1.72 (-2.43) -1.72 (-14.) -1.49 (-19.) -6.8 (-7.76) 2 children 1.74 (4.91) 1.74 (3.84) 1.84 (4.23).19 (.3) female*2 children -16.36 (-33.4) -16.36 (-16.36) -13.92 (-23.3) -7.62 (-8.17) 3+ children 2.85 (5.88) 2.85 (2.85) 2.55 (4.23) -.23 (.26) female*3+ children -18.8 (-26.73) -18.8 (-18.84) -16.87 (-19.28) -8.87 (-6.73) married, no child 2.4 (5.63) 2.4 (4.53) 1.79 (4.23).21 (.36) female*married, no chi -9.56 (-19.95) -9.36 (-13.63) -6.69 (-11.71) -.54 (-.63) age -.6 (-6.89) -.6 (-4.39) -.3 (-2.42).12-3.2 constant 1.86 -.83 1.86 -.46 1.96-5.54 14.58-2.32 R squared.41.41.4.24 4_35

5 Introducing Fixed Effects Models ( within effects)

Hypothesis: Example: BMI after stopping smoking BMI increases after stopping smoking Hypothetical data: Random sample of former smokers with year after stopping and BMI at that time for 3 individuals time bmi1 bmi2 bmi3 26.. 1 26 24 2 27 24 21 3 28 25 22 4 29 25 22 5 31 26 24 6. 26 24 7.. 25 5-2

BMI after stopping smoking: pooled OLS BMI 21 22 23 24 25 26 27 28 29 3 31 2 4 6 8 years after stop smoke bmi Fitted values 5-3

Pooled regression. * pooled regression:. reg bmi time Source SS df MS Number of obs = 18 -------------+------------------------------ F( 1, 16) =.26 Model 1.793233 1 1.793233 Prob > F =.6149 Residual 13.9189 16 6.4938635 R-squared =.162 -------------+------------------------------ Adj R-squared = -.453 Total 15.611111 17 6.2124183 Root MSE = 2.5483 ------------------------------------------------------------------------------ bmi Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- time.162797.317312.51.615 -.598578.8354392 _cons 24.781 1.262577 19.57. 22.3147 27.38455 ------------------------------------------------------------------------------ No BMI increase over time 5-4

BMI after stopping smoking: individual data 21 22 23 24 25 BMI 26 27 28 29 3 31 2 4 6 8 years after stop smoke P1 P2 P3 OLS P1 OLS P2 OLS P3 OLSpooled Autocorrelat. with pooled regression-unobserved individual heterogeneity 5-5

Individual regressions. forval j=1/3 { /* loop over each individual*/ 2. reg bmi`j' time 3. } bmi1 Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- time 1.138131 7.25.2.6168142 1.383186 _cons 25.33333.4178554 6.63. 24.17318 26.49349 ------------------------------------------------------------------------------ bmi2 Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- time.4571429.699854 6.53.3.2628322.6514535 _cons 23.4.2725541 85.85. 22.64327 24.15673 ------------------------------------------------------------------------------ bmi3 Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- time.8.16945 7.48.2.531855 1.96814 _cons 19.4.514552 37.7. 17.97138 2.82862 ------------------------------------------------------------------------------ All individuals have significant BMI increase over time 5-6

Excursus: unobserved heterogeneity Omitted variables bias: Many individual characteristics are not observed e.g. enthusiasm, ability, willingness to take risks, our example: physical activities, calories intake, muscle mass, genes, cohort These have generally an effect on dependent variable, and are correlated with independent variables. Then regression coefficients will be biased! Note: these (formerly) unobserved measures are increasingly included in surveys 5-7

What about the between-effect? 21 22 23 24 25 BMI 26 27 28 29 3 31 21 22 23 24 25 26 27 28 29 3 31 2 4 6 8 years after stop smoke P1 P1 P2 P2 P3 P3 OLS P1 OLS P1 OLS P2 OLS OLS P2 P3 OLSpooled P3 5-8

Which models are appropriate to analyze the effects of time? Data transformation necessary 5-9

Panel Data and within- Regression (FE) 5-1

Error components in panel data models We separate the error components: e it = u i + ε u it, i = person-specific unobserved heterogeneity (level) = fixed effects (e.g., physical activities, calories intake, genetics, cohort) ε it = residual Model: bmi it 1 x it u i it Remember: Pooled OLS assumes that x is not correlated with both error components u i and ε it (omitted variable bias) 5-11

Fixed effects regression We can eliminated the fixed effects u i by estimating them as person specific dummies -> remains only within-variation Corresponds to de-meaning for each individual: bmi it 1 x it u i it (1) bmi i individual mean: (2) i 1 x i u i subtract (2) from (1): bmi it bmi i ( x x ) ( ) 1 it i it i -> Fixed (all time invariant) effects u i disappear, i.e. timeconstant unobserved heterogeneity is eliminated 5-12

De-meaned values with OLS regression 21 22 23 24 de-meaned 25 BMI 26 27 BMI -2-1.5-1 -.5.5 1 1.5 28 2 29 2.5 3 3 3.5 31-2 2 4 6 2 8 de-meaned years after years stop after smoke stop smoke P1 P1 P2 P2 P3 P3 OLS P1 OLS OLS P2 P2 OLS OLS P3 P3 5-13

OLS of individually de-meaned Data We de-mean and regress the Data:. bysort id: egen bmi_m=mean(bmi). gen bmi_dem=bmi-bmi_m. bysort id: egen time_m=mean(time). gen time_dem=time-time_m. reg bmi_dem time_dem Source SS df MS Number of obs = 18 -------------+------------------------------ F( 1, 16) = 92.98 Model 29.719476 1 29.719476 Prob > F =. Residual 5.11428571 16.319642857 R-squared =.8532 -------------+------------------------------ Adj R-squared =.844 Total 34.8333333 17 2.491961 Root MSE =.56537 ------------------------------------------------------------------------------ bmi_dem Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- time_dem.752381.78284 9.64..5869681.9177938 _cons -2.12e-7.1332589 -. 1. -.2824965.2824961 ------------------------------------------------------------------------------ 5-14

Direct modeling of fixed Effects in Stata xtreg bmi time, fe (calculates correct df; this causes higher Std. Err.). xtreg bmi time, fe Fixed-effects (within) regression Number of obs = 18 Group variable: id Number of groups = 3 R-sq: within =.8532 Obs per group: min = 6 between =.992 avg = 6 time since stop smoking explains parts of individual heterogeneity! max = 6 overall =.162 (from pooled OLS) F(1,14) = 81.35 corr(u_i, Xb) = -.431 Prob > F =. ------------------------------------------------------------------------------ bmi Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- time.752381.834159 9.2..5734716.931293 _cons 22.64444.3248582 69.71. 21.94769 23.3412 -------------+---------------------------------------------------------------- sigma_u 3.1781651 sigma_e.644559 rho.965965 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=: F(2, 14) = 137.55 Prob > F =. 5-15

Alternative: OLS with individual dummies controlled. xi i.id, noomit. reg bmi time _I*, noconst Source SS df MS Number of obs = 18 -------------+------------------------------ F( 4, 14) = 7939.84 Model 1161.8857 4 29.47143 Prob > F =. Residual 5.11428571 14.36536122 R-squared =.9996! -------------+------------------------------ Adj R-squared =.9994 Total 1167 18 644.833333 Root MSE =.6441 ------------------------------------------------------------------------------ bmi Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- time.752381.834159 9.2..5734716.931293 _Iid_1 25.95238.323684 8.33. 25.25947 26.64529 _Iid_2 22.36667.3822597 58.51. 21.5468 23.18653 _Iid_3 19.61429.449284 43.66. 18.6583 2.57774 ------------------------------------------------------------------------------ useful for small N, the u i are estimated (only approximate) 5-16

FE estimation can solve the problem of unobserved heterogeneity But: Summary: Fixed Effects Estimation If number of groups large, many extra parameters Enough variance needed in data With FE-Regressions, estimation of time-constant covariates not possible. Are dropped from the model. But: possibility to use interactions (like male*nrchildren) What about comparing with people who never stopped smoking or who never smoked? (later) 5-17

No identification of time-invariant covariates z i Consider the model: y it = az i + bx it + u i + ε it (1) let be an arbitrary number; add and subtract z i on the rhs: y it = (az i + z i ) + bx it + (u i - z i ) + ε it and rewrite this as: y it = a*z i + bx it + u i *+ ε it with a* = a + and u i * = u i - z i (2) But (1) and (2) have exactly the same form so it is not clear if a or a* = a + is estimated -> separate effects of az i and u i cannot be distinguished without further assumptions (e.g., no correlation between z i and u i ) 5-18

Example: DiD (control group comparison) Hypothesis: financial support increases BMI of low-income women (Schmeisser 29)* (hypothetical) experiment: Survey: Sample of low income female patients of doctoral surgeries randomized into program social aid (5% in program and 5% not) 4 measurements of bmi (2 x before program start, 2 x after) *Expanding wallets and waistlines: the impact of family income on the bmi of women and men eligible for the earned income tax credit. Health Econ. 18: 1277 1294 5-19

Data: BMI-increase through social aid? bmi 21 22 23 24 25 26 27 28 29 3 31 32 BMI of low income Women 1 2 3 4 time Effects: causal effect: BMI increase due to more fast food (more money available) Time (age-) effect: BMI increases with age NO aid aid Start social aid Programm 5-2

Pooled Regression. reg bmi aid Source SS df MS Number of obs = 16 -------------+------------------------------ F( 1, 14) =.89 Model 7.5283333 1 7.5283333 Prob > F =.3627 Residual 118.916667 14 8.4944762 R-squared =.595 -------------+------------------------------ Adj R-squared = -.77 Total 126.4375 15 8.42916667 Root MSE = 2.9145 ------------------------------------------------------------------------------ bmi Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- aid 1.583333 1.682661.94.363-2.25616 5.192283 _cons 25.41667.841337 3.21. 23.61219 27.22114 5-21

Fixed effects xtreg bmi aid, fe. xtreg bmi aid, fe Fixed-effects (within) regression Number of obs = 16 Group variable: id Number of groups = 4 R-sq: within =.3596 Obs per group: min = 4 between =.54 avg = 4. overall =.595 max = 4 F(1,11) = 6.18 corr(u_i, Xb) = -.74 Prob > F =.33 ------------------------------------------------------------------------------ bmi Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- aid 2.848151 2.49.3.228614 3.771386 _cons 25.3125.3484951 72.63. 24.54547 26.7953 -------------+---------------------------------------------------------------- sigma_u 2.966798 sigma_e 1.138184 rho.871241 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=: F(3, 11) = 26.93 Prob > F =. 5-22

One step back: a causal model With cross-sectional data: only between-estimation: Crucial assumption: Y i, Tt ( reatment) C( ontrol) Yi,t Random sample (no unobserved heterogeneity) With Panel data I: within-estimation (before and after) T C Y i, t 1 Yi,t problem: time effects, panel conditioning With Panel data II: difference-in-difference (DID): ( T C C C Y i, t 1Yi,t j,t 1 Yj, t ) ( Y ) 5-23

Now: causal effect of aid We have - Before-after comparison (within) - Treatment- and control groups (between) We compare the within-effect of aid ( treatment ) with that without aid ( control ) i.e., we calculate treatment effect and control for time DID estimator: =(after aid -before aid ) 28 29 25 26 26 26 24 24 - (after noaid before noaid ) 4 4 3 31 23 23 27 28 21 22 4 4 27 25 26.75 24.5.25 5-24

DiD effects. xi i.time, noomit. xtreg bmi aid _I*, fe note: _Itime_4 omitted because of collinearity Fixed-effects (within) regression Number of obs = 16 Group variable: id Number of groups = 4 R-sq: within =.8876 Obs per group: min = 4 between =.54 avg = 4. overall =.1528 max = 4 F(4,8) = 15.8 corr(u_i, Xb) = -.55 Prob > F =.7 ------------------------------------------------------------------------------ bmi Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- aid -.25.55917 -.45.667-1.53996 1.3996 _Itime_1-2.875.4841229-5.94. -3.991389-1.758611 _Itime_2-2.375.4841229-4.91.1-3.491389-1.258611 _Itime_3 -.75.3952847-1.9.94-1.661528.1615282 _Itime_4 (omitted) _cons 27.375.3952847 69.25. 26.46347 28.28653 -------------+---------------------------------------------------------------- sigma_u 2.952753 sigma_e.5591699 rho.96539792 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=: F(3, 8) = 111.7 Prob > F =. 5-25

FE easier, no control group Summary: within estimators DID: control group can control simultaneous effects (like time): find statistical twin, such that research variables is only variable -> DID useful for estimating causal effects from nonexperimental data. Especially for small samples Excursus: First-Difference (FD) estimators: Stable similarities of adjacent observations eliminated. Problem: level differences not taken into account (6-5 children = 1- ch) for lasting effects (like children): FD not useful because only immediate changes taken into account 5-26

6 Introducing Random Effects Models 6-1

Towards RE: error components in panel models We have both -within and -between variance: y it = a + e it = a + u i + ε it fixed effects = within: u i for each person ( ANOVA) two residuals: - it on lowest (1.) level: time point a u 1 11 -u i on highest (2.) level: individuals 6-2

Necessary, if data have different levels with - observations are not independent of levels - true social interactions Examples: Motivation: multilevel models Schools classes students: first applications Networks: people are influenced by their peers Spatial context: from environment (e.g., poor people are less happy if they live in a rich environment) US: neighborhood-effects Interviewer - effects: respondents clustered in interviewers Panel-surveys: waves clustered in respondents (households) 6-3

Levels in clustered data Hierarchical - Households in neighborhoods - Students in schools in classes (three levels) - Respondents in interviewers - Panel Surveys: Waves in respondents (crossed?) Crossed - Questions in respondents longitudinal? Attention: Do not confuse variables and levels: (total) variance can be attributed to levels, not to variables! E.g., 1 hospitals are probably a level, 7 nations are probably dummy variables. Think of a population the sample represents! Note: 3 rule of thumb in contexts: 3 second level units, randomly chosen 6-4

Multilevel models: analytic advantages Improved regression models - unbiased estimators for regression coefficients - unbiased estimators for standard errors (usually higher std.err. than OLS) - model true covariance structure (autocorrelation, heteroscedasticity, ) Decomposition of total variance into those in different levels (within-between) Similar to Analysis of variance (ANOVA), but parsimonious (number of estimation parameter independent on number of contexts) can handle large number of contexts (in parts) modeling of unobserved heterogeneity / self selection 6-5

Typical results: one-level vs. multilevel model Dependent Variable mixed school boy school girl school Underestimated variance: Kish design effect deff : larger N necessary 6-6

Illustration: variance decomposition between levels Example: 2 individuals each asked 2x about their happiness (continuous), here measurements not time ordered! Which variance is due to individuals, which to observations? 3 Happiness 2-1 -4 Total ( Grand ) Mean (=) Indiv. 1 Indiv. 2 Total Mean= (3+2+(-1)+(-4)) / 4 = 6-7

Illustration: calculation of the total variance Happiness 3 2-1 -4 n i 1 T i ( t 1 y it T yy y ) 2 n T i i1 t1 ( y it W y yy Total Mean (=) i ) 2 n T i i1 t1 B yy (y i y) 2 Indiv. 1 Indiv. 2 Total Variance is equal to the Square of the Differences of all Observations from the Total Mean divided by the Sample Size (4) = { (3-) 2 + (2-) 2 + (-1-) 2 +(-4-) 2 } / 4 = (9+4+1+16)/4 = 7.5 6-8

Happiness Illustration: individual specific variance ( between ) 3 2.5 2-1 -4 n T i i1 t1 ( y T it yy y) 2 n i i1 t1 ( y W y yy Total Mean (=) -2.5 T it i ) 2 i 1t n Ti ( 1 B yy yi y ) 2 Indiv. 1 Indiv. 2 variance of the individual means = between -variance ( between = u ) (2.5 2 +(-2.5) 2 )/2 = 6.25 (remember: total variance = 7.5) 6-9

Happiness Illustration: measurement specific variance ( within ) 3 2.5 2-1 -4 n T i i1 t1 ( y T it yy -2.5 y) 2 i 1t n Ti ( 1 yit W yy yi ) 2 n T i i1 t1 B yy (y i y) 2 Indiv. 1 Indiv. 2 variance of measurements within individuals= within -Variance (later: within = ) ((3-2.5) 2 + (2-2.5) 2 + (-1-(-2.5)) 2 + (-4-(-2.5)) 2 )/4 =1.25 (=18% of total variance) variance of individual means = (2.5 2 +(-2.5) 2 )/2 = 6.25 (=82% of total variance = ρ = ICC (intra-class-correlation) 6-1

ICC = (zero clustering use OLS) Examples of Θ ICC.8 ICC.2 ICC = 1 (maximum clustering) 6-11

y where : u ε it i it u Starting point: null ( Variance Components (VC)) model i it individual specific random variable (N(,σ deviation from individual (note : no intercept a in VC model) u specific mean (N(,σ ) assumed) ε ) assumed) the VC model allows for variance decomposition : ρ correlation between different time points t within an individual i: 2 σu ρ ( ICC intra - class - correlation autocorrelation in Panels) 2 2 σ σ u ε (note : ρ significant multilevel model necessary) 6-12

Idea RE: better estimate of u i (modeling intercept) random intercept ( borrowing strength from others): u high edu To estimate mean of individual i (=u i ) only within (FE) suboptimal if - sample small (T i small) - variance high - n large (inefficient) a low edu idea: use information u j from other sample members (between) in the same population group (e.g., education) 6-13

Idea RE: weighted within and between RE - Regression is equivalent to pooled OLS after the Transformation : (y it with θ y ) β i (1 θ) β (x θ 1, and σ 1 it θ x ) (u (1 θ) (ε 2 ε i 2 σε Tσ 2 u i, θ 1 it θ ε i )) RE uses optimal combination of within and between variation RE allows estimation of time invariant RE biased because u i variables u remains in error term (if cov(x,u ) ) i i 6-14

. xtreg bmi children, re theta Example ρ based on SHP Random-effects GLS regression Number of obs = 18 Group variable: id Number of groups = 3 R-sq: within =.3921 Obs per group: min = 6 between =.1125 avg = 6. overall =.24 max = 6 Wald chi2(1) = 9.76 corr(u_i, X) = (assumed) Prob > chi2 =.18 theta =.84481818 ------------------------------------------------------------------------------ bmi Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- children 1.57256.5338 3.12.2.58655 2.558958 _cons 24.1428 1.84525 13.8. 2.5259 27.75826 -------------+---------------------------------------------------------------- sigma_u 3.1963635 sigma_e 1.2298887 rho.8713938 (fraction of variance due to u_i) 6-15

Random-Effects estimation 6-16

Illustration: how is a random intercept model fitted? model y it = bx it + u i + it with u i ~ N(, u ), it ~ N(, ), cov(x,u i )= Parameters to be estimated are: b, u, Parameter estimation: IGLS - first iteration: OLS, assuming u = - then iteration with current covariance matrix (from residuals) and GLS-estimated u, Maximum Likelihood 6-17

Intuition of multilevel (random intercept) ε it u i 6-18

FE versus RE models Regression equation: y it = α + βx it + u i + ε it Fixed effects models OLS estimated only variance within individuals used Controls for unobserved heterogeneity (consistent also if Cov(u,x) ) effects of time-invariant characteristics cannot be estimated (e.g., sex, cohort) If research interest is longitudinal Random effect models Cannot be estimated with OLS assumes u i ~N(,σ u ) assumes exogeneity: Cov(u,x)= (no effects from unobserved variables) Effects from time-invariant and timevarying covariates More efficient in unbalanced panels (additional observations for the same individual does not add information linearly!) If research interest is cross-sectional - method must depend on research question (longitudinal variables: FE, true cross-sectional variables : robust OLS, RE) - formally: test for RE against FE model (Hausman test)

Hausman test Test if FE or RE model (basic assumption: FE unbiased) Test H : E(u i x it ) = cov(u i, x it ) = cov(u i,x it ) = -> FE and RE unbiased, FE is inefficient -> RE cov(u i,x it ) -> FE is unbiased and RE is biased -> FE If H is true (between-coeff.=within-coeff.), no differences between FE and RE equivalently : Hausman compares if if H H 1, βˆ, βˆ FE FE βˆ RE and βˆ unbiased but βˆ estimation coefficients RE more efficient RE not βˆ FE (var(βˆ FE and βˆ ) RE var(βˆ Note: - H almost always rejected (sample size high enough even with small differences) - Test does not replace research question driven check for model appropriateness 6-2 RE ))

* Hausman Test: RE or FE estimate?. qui xtreg bmi children, re. estimates store random. qui xtreg bmi children, fe. estimates store fixed. hausman fixed random Hausman test: example ---- Coefficients ---- (b) (B) (b-b) sqrt(diag(v_b-v_b)) fixed random Difference S.E. -------------+---------------------------------------------------------------- children 1.575758 1.57256.32511.1473474 ------------------------------------------------------------------------------ b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test: Ho: difference in coefficients not systematic chi2(1) = (b-b)'[(v_b-v_b)^(-1)](b-b) =. Prob>chi2 =.9824 6-21

Summary: RE estimation RE more efficient than FE because RE also models person specific time-invariant effects (uses more information) RE uses combination of within- and between- estimator RE uses coefficients of time-invariant regressors (between- Variation) RE not useful for causal effects of time-variant regressors Main assumption of RE (and limitation vs. FE): Person specific effect u i must not correlate with regressors Hausman test can test this assumption 6-22

RE models RE controls for unobserved heterogeneity (under strong assumptions) Unbiased standard errors (and inferences) Decomposition of total variance on different levels Modeling of variance Summary: models for panel data Allows for covariance that is appropriate for panel data (autocorrelation, heteroscedasticity, etc.) Longitudinal models (e.g., FE, DiD) Full control for unobserved heterogeneity Fixed effects models: within-variance only Difference-in-Difference (DID): before after comparisons (withintreatment effect), controlled for observed (between) effects Other models possible (FD) 6-23

Specifically for linear regression models Within research questions causal effects of time-variant variables: modeling intrapersonal change (FE models) If in addition interest in time-invariant variables: hybrid models (RE, include individual means) Cross-sectional research questions effects of time-invariant effects, ( correlations ) OLS with robust standard errors In unbalanced panels: RE models Interpretation (eg children): within: effect of additional children between: differences between people with a different number of children 6-24

Models result example * Data: bmi sample estout m1 m2 m3 m4, cells(b se) ---------------------------------------------------------------------- OLS OLS_cluster_pers FE RE b/se b/se b/se b/se ---------------------------------------------------------------------- working -.5276885 -.5276885 -.2835562 -.3273556.12146.165462.774774.71689 working_bar working_dem _cons 25.92231 25.92231 25.7677 25.6964.799125.139252.519713.941936 ---------------------------------------------------------------------- 6-25

Hybrid models ( regression with context variables ) FE is preferred for causal statements BUT: no inclusion of time-constant variables Idea: hybrid-models simultaneous within and between regression: y it a (x it xi ) x i z i it Estimated as RE model Hybrid-models separate within and between - effects Hybrid-models eliminate the part of unobserved heterogeneity correlated with the independent variables ( unbiased estimators) 6-26

+ hybrid (1) * Data: bmi sample estout m1 m2 m3 m4 m5, cells(b se) stats(n r2 r2_w r2_b sigma_u sigma_e thta_max) ---------------------------------------------------------------------------- OLS OLS_cluster_pers FE RE RE b/se b/se b/se b/se ---------------------------------------------------------------------------- working -.5276885 -.5276885 -.2835562 -.3273556.12146.165462.774774.71689 working_bar - working_dem - _cons 25.92231 25.92231 25.7677 25.6964.799125.139252.519713.941936 ---------------------------------------------------------------------------- N 9189 9189 9189 9189 r2.389.389.21589 r2_w.21589.21589 r2_b.32241.32241 sigma_u 4.517998 4.393536 sigma_e 1.586554 1.586554 thta_max.8223176 6-27

RE (combin.) hybrid (2) FE (dem(x)) BE (mean(x)) hybrid (both) SHP 2-21 Happiness b b b b nb children.13** -.17*.12*** health.391***.288***.954***.393*** male.65***.128***.62** intermed. educ. -.89*** -.192*** -.65** -.96*** high educ. -.83*** -.255*** -.62* -.88*** childmean.64*** childwithin -.16* _cons 6.482*** 7.49*** 4.72*** 6.438*** N 74315 74315 74315 74315 6-28

7 Non linear regression 7-1

Non linear regression: motivation Linear regression: requires continuous dependent variable e.g. BMI, income, satisfaction on scale from 1 (?) Most variables in social science are not continuous but discrete Opinions: agree vs. disagree Poverty status Party voted for Number of visits to the doctor Having a partner We need appropriate regression models! 7-2

Non linear models Dependent variable is not continuous: non linear regression Binary variables (dummy variables, or 1) (e.g. yes no) Multinomial (unordered variables) (e.g. vote choice, occupation) Ordinal (e.g. satisfaction) Count variable (e.g. doctor visits) Logistic Reg., Probit Regression Complementary log log Regression Multinomal logistic Regression Multinomial probitregression Ordinal Regression Poisson Regression Negative Binomial Regression 7-3

Linear probability model for binary variables Binary variables (y = or 1) E(y)=π (probability y=1) Linear probability model: π = + β 1 x 1 +. 1 x 7-4

Problems of linear probability model Violation of regression assumptions The variance of y for binary variables is π(1 π) residual variance depends on x i heteroskedasticity Relationship between response probability and x may not be linear, especially for π close to and π close to 1 Predicted response probabilities may be negative or greater than one Residuals can take only take two values for fixed x residuals are not normally distributed 7-5

S shaped function 1 P(y=1) e.g. logistic regression Effect size depends on x Values between and 1 x E.g. cumulative logistic function, cumulative normal distribution 7-6

Logistic transformation Non linear transformations for binary variables e i follow a logistic distribution: logit model Normal transformation Pr( y 1) P 1 e P Logit Ln 1 P e i follow normal distribution: Probit model Pr(y=1) = Ф (α + βx) where Ф is the cdf of a standard normal distribution 1 x ( 1 1 2 2 α β 1 x x 1...) β 2 x 2... For practical purposes, logit and probit models are equivalent for statistical analysis (very similar predicted probabilities) 7-7

Generalised linear model A latent (unobserved) continuous variable y* which underlies the observed data y* = a + b 1 x 1 +.+ e* Assume y i * is generated by a linear regression structure Link function between y and y* E(y) = g 1 (y*) = g 1 (a + b 1 x 1 +.+ e*) Choice of the link function depends on distribution of e* e.g. logit, probit, poisson, negative binomial Scale of y* is variable, variance of e i * assumed Probit model: assume e i *~N(,1) Logit model: assume e i * standard logistic (with var=π 2 /3 ~3.29) in general, coefficients cannot be compared across different models 7-8

Most frequent link functions Distribution Link function Normal Identity y=y* Poisson Log y=exp(y*) Binomial Logistic y=1/(1+exp(y*)) Probit y= Ф(y*) 7-9

Maximum likelihood estimation Usually, non linear models are estimated by maximum likelihood (MLE) Advantages of MLE MLE is extremely flexible and can easily handle both linear and non linear models Desirable asymptotic properties: consistency, efficiency, normality, (consistent if missing at random MAR) Disadvantages of ML estimation Requires assumptions on distribution of residuals Desired properties hold only if model correctly specified Best suited for large samples 7-1

Parameter estimation with ML Principle for MLE: Which set of parameters has the highest likelihood to generate the data actually observed (x i, y i )? Observed values (x i, y i ) are a function of the unknown model parameters θ L=f( x i, y i θ) (likelihood function) The maximum likelihood estimate θ is that value which maximizes L Often, there is no closed form (algebraic) solution. Coefficients have to be estimated through numerical methods (iterative algorithms) For practical reasons, often the log likelihood is used instead of the likelihood Linear model: MLE = OLS estimator 7-11

Example: logistic regression (y: 1 vote SVP; vote other party) M1 M2 Female -.51*** -.38*** 18-3 years -.4.87 46-6 years.13.81 6+ years.91.192 Education: intermed -.239* -.24 Education: high -1.121*** -.929*** Income (hh, log) -.395*** -.281*** EU: against joining 2.11*** Foreigners: prefer Swiss 1.7*** Nuclear energy: pro.57*** Satisfaction with democracy -.233*** constant 3.51*** -.233*** N 3879 3879 7-12

Example: logit vs probit logit probit b t b t female -.38*** (-3.7) -.222*** (-3.8) 18-3 years.87 (.6).41 (.6) 46-6 years.81 (.6).47 (.6) over 6 years.192 (1.4).18 (1.4) intermed.edu. -.24 (-1.8) -.146 (-1.9) high edu. -.929*** (-5.6) -.552*** (-5.9) hh income (log) -.281*** (-3.5) -.163*** (-3.6) stay outside EU 2.11*** (15.) 1.123*** (16.1) prefer Swiss 1.7*** (11.).625*** (11.2) pro nuc. energy.57*** (5.6).334*** (5.8) satisf.democracy-.233*** (-9.1) -.132*** (-9.1) _cons 1.26 (1.4).81 (1.6) Pseudo R-Square.253.257 N 3879 3879 7-13

Interpretation of non linear models Because of the non linearity, coefficients cannot be interpreted directly Predicted probabilities have to be computed through a non linear transformation according to the link function e.g. π i = F(α + βx i ) Interpretation of Odds ratios (OR) problematic 7-14

Odds ratios (OR) OR Often misunderstood as relative risk P Odds OR RR A Group 1.1.11 2.1 2. Group 2.5.5 B Group 1.4.67 2.7 2. Group 2.2.25 C Group 1.8 4. 6. 2. Group 2.4.67 D Group 1.6 1.5 6. 3. Group 2.2.25 E Group 1.4.67 6. 4. Group 2.1.11 Ref.: Best and Wolf 212, Kölner Zeitschrift für Soziologie 7-15

Logit model Predicted probabilities: example 1 Pr( y 1) ( 1x1 2x2 ) 1e exp(logit) 1exp(logit) Example Individual i with x 1 =2 and x 2 =1, ^ ^ α 3; 1-2 ; 2 ^.5 ^ i 1e 1 1e.5 1 (3 ( 2)2.5).62 7-16

Calculate predicted probabilities Remember: predicted probabilities depend on values of x and parameter estimates (and unobserved heterogeneity) It is often useful to vary only one covariate and hold the others constant at their mean at realistic values (mean for continuous variables, median for ordinal variables, mode for nominal variables) Note: Regression parameter are estimates (sampling variability, confidence intervals) also predicted probabilities are estimates confidence intervals also for predicted probabilities 7-17

How does change in x affect predicted probability (y)? Discrete change The change in predicted probability as one moves from one (discrete) value of a predictor to another, holding all else equal. Marginal effect (MFX) or Partial effect The slope of y* at x, holding all else equal. Average partial effect (APE) The average change in predicted probability at different given values of a predictor (calculate effect for each individual i). Average marginal effect (AME) The average slope of y* at x (calculate effect for each individual i). (in Stata: margins) 7-18

Example: predict discrete change Vote intention for SVP 29 margins i.eu_contra i.for_contra i.nuc_pro, /// atmeans at (female= edulevel=2 agegr=2 ) Discrete change Variation of one variable, others at their mean or mode 95% Confidence Intervall Sex Men 13.3 1.2 16.4 Women 9.5 7.3 11.7 Education Low 16.4 11.4 21.3 Intermediate 13.3 1.2 16.4 High 7.2 5.3 9. EU Pro or neither 4.7 3.1 6.2 Contra 28.8 23.6 34. Foreigners Same changes or neither 1.2 7.7 12.8 Satisfaction Democracy Better changes for Swiss 24.9 19.4 3.3 Minimal () 39.7 3.3 49.1 Maximal 6. 4. 8. 7-19

Example: Average Partial effects Vote intention for SVP 29 margins i.female i.edulevel i. agegr APE 95% Confidence Intervall Sex Men 2.4 18.8 22.1 Women 16.2 14.7 17.6 Education Low 23. 19.8 26.2 Intermediate 2. 18.4 21.5 High 12.6 1.8 14.5 EU Pro or neither 36.8 35. 38.7 Contra 5.3 51.5 54.8 Foreigners Same changes or neither 42.9 41.4 44.3 Satisfaction Democracy Better changes for Swiss 53.4 51.8 55.8 Minimal 7.1 67.1 74.6 Maximal 3.1 27.5 32.8 7-2

Marginal effects Conditional Marginal Effects of satdem with 95% CIs Effects on Pr(Svp) -.8 -.6 -.4 -.2 1 2 3 4 5 6 7 8 9 1 satisf. democracy margins, atmeans at (satdem=((1)1)) dydx (satdem)

Average Marginal effect Effects on Pr(Svp) -.5 -.4 -.3 -.2 -.1 Average Marginal Effects of satdem with 95% CIs 1 2 3 4 5 6 7 8 9 1 satisf. democracy margins, at (satdem=((1)1)) dydx (satdem)

Model performance Linear regression: R 2, adjusted R 2 explained var. in y R 2 Non linear regression total var. in y - variance of the residual not observed - many so called pseudo R 2 (,1) (Stata: fitstat) Example: voting model M1 M2 McFadden's R 2.5.253 McFadden's Adj R 2.44.247 ML (Cox-Snell) R 2.46.214 Cragg-Uhler(Nagelkerke) R 2.46.214 Count R 2.816.827 Adj Count R 2.3.51 AIC 3513.6 2772 7-23

Model performance ctd. MLE: likelihood as a measure of fit Compare likelihood estimated model to the likelihood of empty model (no covariates): equivalent to McFadden R 2 Compare likelihood of nested models: likelihood ratio (LR) test Test joint significance of variables, or nested models Test non nested models: AIC, BIC How many observations are predicted correctly? Best prediction without covariates: mode (correct prediction in π of cases) How much better is the prediction than compared to modesguess? 7-24

The likelihood ratio test To test hypotheses involving several predictors (multiple constraints) (e.g. Test β 2 = β 3 = ) Compare log likelihoods of constrained and unconstrained model, e.g. M u : π=(f(α + β 1 x 1 +β 2 x 2 + β 3 x 3 ) M c = π=(f(α + β 1 x 1 ) Generally: L c L u Constraints valid: Lu Lc = Constraints invalid: Lu Lc > Test statistic: LR = 2 ( L u L c ) ~ Х 2 (q) ; (q: number of constraints, d.o.f.) Prerequisites of LR test Models are based on the same sample Models are nested 7-25

LR test example: test for joint significance Vote intention model, probability to support SVP vs. supporting another party π=f(α + β 1 (contra EU) +β 2 (satisfaction democracy) +β 3 (female) +β 4 (lnincome) +β 5 (age)) Do demographic variables (age, sex) matter? H: β 3 =β 5 = Model fits: L c = 9.13, L u = 911.9 LR = 2*( 9.13 ( 911.9))=23.4 Chi squared distribution with 2 degrees of freedom: p=. > we reject H (demographic variables seem to affect the probability to vote for SVP) 7-26

Difficulties of nonlinear models: frequent mistakes Interpretation of coefficients (Logits, OR) Comparison of estimates across models and samples (estimates reflect also unobserved heterogeneity) Be cautious with interpretation Use different measures to show effects (MFX, AME) Correction proposed by Karlson et al. (212) References: Mood 21, Best and Wolf 212, Karlson et al. 212, Stata: ado khb Interaction effect Cannot be interpreted as in linear models References: Ai and Norton 23, Norton Wan and Ai 24, Stata: ado inteff Model performance 7-27

Multinomial dependent variables More than two response categories (m categories) Unordered e.g. Voting preference (different parties, no party), type of education, social class Multinomial regression compare each pair of response categories estimate probability for each category (1 reference category, m 1 equations) Ordered e.g. Opinions (strongly agree, agree, neither, disagree, strongly disagree), health status Ordinal regression latent variable with m 1 thresholds estimate cumulative probability (prob. y mi) (one equation with dummies for m 1 thresholds) 7-28

Example: multinomial regression Voting: FDP/CVP, SP/Greens, other parties; Base category: vote for SVP FDP & CVP SP & Greens Other party No party female.377*** (3.5).46*** (3.6).274* (2.).583*** (5.9) age183.162 ( 1.).7 (.4).167 (.8).22 ( 1.5) age466.92 (.7).31 (.2).129 (.7).75 (.6) age6plus.52 (.4).413** ( 2.8).584** ( 3.1).274* ( 2.1) edu2.179 (1.2).3* (2.).26 (1.).135 ( 1.1) edu3.78*** (4.5).982*** (5.4) 1.114*** (4.8).287 (1.8) lnincome.317*** (3.7).229** (2.6).397*** (3.5).113 (1.5) eu_contra 1.74*** ( 11.8) 2.79*** ( 18.9) 1.559*** ( 9.2) 1.665*** ( 11.9) for_contra.746*** ( 7.2) 1.68*** ( 13.4) 1.254*** ( 8.4).862*** ( 9.1) nuc_pro.64 (.6) 1.542*** ( 13.).967*** ( 6.7).57*** ( 5.9) satdem.252*** (9.4).193*** (6.9).183*** (5.).44 (1.9) _cons 3.138*** ( 3.3).694 (.7) 4.259*** ( 3.3) 1.364 (1.6) 7-29

Refresher: Panel data models Fixed effects models Y it = β 1 x i + β 1 x i +u i + ε u i: unobservable stable individual characteristics Only variance within individuals taken account Equivalent to OLS with dummies for each individual Controls for unobserved heterogeneity (consistent also if Cov(u,x) ) > causal interpretation Effects of time invariant characteristics cannot be estimated (e.g., sex, cohorts) Random effect models Y it = α + β 1 x i + β 1 x i +u i + ε it assumes u i ~N(,σ u ) u i : unobservable stable individual characteristics Multilevel model with random intercepts Controls for unobserved heterogeneity (but consistent only if Cov(u,x)=) Effects of time invariant and timevarying covariates Test for RE against FE model: Hausman test 7-3

Fixed effects for non linear models Linear model: by differencing out (or including dummy variables), the u i disappears from (FE) equation Non linear model: there is no equivalent FE model Incidental parameter problem > inconsistent estimates Instead: Conditional ML estimation (similar to FE) Technical trick to eliminate individual specific intercepts (Also called Chamberlain fixed effects model) Only possible for logit and poisson (they use exponentials) Only subsample of individuals for whom there is some change in y it > information loss > potential bias because stable individuals are excluded 7-31

Random effects for non linear models RE model equivalent to linear regression y it = F(α + α i + x 1it β 1 + x 2it β 2 + + u i + ε it ) with u i ~ N(, σ u ), Cov(u i,x i )= But in contrast to linear models Predicted probabilities depend on values of u i : we have to assume a value for u i to estimate predicted probabilities Measures for variance decomposition questionable only variance of unobserved heterogeneity estimated within variance is fixed (usually at 1) proportion of the variance of error components (ρ)not meaningful Alternative: How much can the unexplained variance between individuals be reduced relative to the empty model? 7-32

Stata commands for non linear panel models Stata built in commands Random intercept models: xt prefix xtlogit, xtprobit, xtpoisson, xttobit, xtcloglog, xtnbreg Random slope models: xtmelogit, xtmepoisson Other software necessary for multinomial panel models (run from Stata) gllamm add on (Rabe Hesketh and Skrondal) to Stata (very powerful, freely available (but Stata necessary), slow, become familiar with syntax) runmlwin: command to run mlwin software from within Stata (Mlwin software needs to be purchased) 7-33

svp logit random fe b b b female.116**.214***.o 18-3 years -.133** -.33.62*** 46-6 years -.119** -.162** -.36*** over 6 years -.97* -.25*** -.61*** med education -.229*** -.497*** -.318* high education -.633*** -1.35*** -.537*** hh income, log -.128*** -.159*** -.13 stay outside EU.749***.622***.49 prefer Swiss.471***.42*** -.1 nuc_pro.225***.76 -.73 satisf. democracy -.16*** -.153*** -.43*** _cons 1.19*** 1.818*** Lnsig2u _cons 2.34*** Pseudo R-Square.77.5 N 55177 55177 2642 7-34

Example: RE model for event history analysis Example for discrete event history analysis Dependent variable event has not occurred 1 event has occurred (since last observation) Independent variable Time until event occurrence Any other variable 7-35

8 The Multilevel model for growth 8-1

Motivation Many phenomena in social sciences have high variation of rates of growth between individuals over time and There are many unobserved variables with effects on these rates of growth given this, alternative hypotheses: do very motivated people have high initial score and high increase over time? or does high initial score of very motivated people rather level out over time? 8-2

Graphical interpretation of multilevel models Random intercept model - Distribution of individual intercepts random Random slope model - Distribution of individual intercepts and slopes random - all lines parallel (same slope) Other name: mixed effects models - Intercepts and slopes correlated 8-3

Level-1 and level-2 submodels steps Level-1 Submodel: within-behavior Q.: functional form? Level-2 Submodels: between-behavior Q.: parameters of this function? Model estimation Interpretation of results - regression coefficients - variance components If only within: no multilevel model! If in addition level (u ) between: random intercept If in addition slope (u 1 ) between: random slope 8-4

The Multilevel idea II: letting the slope (time) become random Suppose we change the model y it = u i + b x it + it y it = u i + b i x it + it with u i ~ N(, u ) and b = b + u i 1i and u 1i ~ N(, u1 ) to give the random slope model y it = u i + (b + u 1i ) x it + it = b x it + (u i + u 1i x it ) + it Parameters to be estimated are: b, u, u1,,+ covariance between u and u 1 = cov(u, u 1 ) = u1 heteroscedasticity: because variance depends on x All other covariances are assumed to be : cov(x,u i ) = cov(u i, i ) = cov( i, j ) = to 8-5

Random parameters in random slope model y it = a i + b i x it + it with a i = a + u i b i = b + u 1i u1 = cov(u, u 1 ) it ~ N(, ), u. i ~ N(, u. ) a, b : Intercept and slope of average line u1 u u1 : Standard Deviation of random slopes across individuals : Standard Deviation of random intercepts across individ. : Covariance between random intercepts and rand. slopes 8-6

Interpretation of covariance in random slope models u1 > fanning out u1 < fanning in u1 is not defined (no slope variation) u1 = no pattern 8-7

Example: body mass index (bmi) change Data: subsample of the CNEF - Adults from CNEF-CH (24-27), CNEF-US (1999/21/3/5), CNEF-DE (22/4/6/8) - Random 999 individuals per survey (country) - Context (between persons): sex, age, health, education, working, wife in household, children under 16 in household, country Design: Each individual measured up to 4 times Research Question: Level and change of bmi for different subgroups (e.g., cohorts by country) 8-8

Data structure: person-period: 1 line per person and period. list survey pers1 y bmi age male, sepby(pers1) Unbalanced Panel, 1-4 waves per adult (measured in years) +--------------------------------------------+ survey pers1 year bmi age male -------------------------------------------- 1. PSID 2955 1999 2.38 85 1 -------------------------------------------- 2. PSID 1258 1999 25.64 34 1 3. PSID 1258 25 28.86 34 1 4. PSID 1258 21 26.37 34 1 5. PSID 1258 23 23.38 34 1 -------------------------------------------- 6. PSID 1454 1999 2.87 59 7. PSID 1454 25 21.8 59 8. PSID 1454 23 21.8 59 9. PSID 1454 21 19.96 59 -------------------------------------------- 1. SHP 13787 24 25.88 72 1 11. SHP 13787 25 25.56 72 1 -------------------------------------------- 12. SHP 977 26 18.7 35 -------------------------------------------- 13. SHP 11737 25 24.61 79 14. SHP 11737 24 24.22 79 -------------------------------------------- 15. SHP 12575 25 19.71 36 16. SHP 12575 27 19.33 36 17. SHP 12575 26 19.71 36 18. SHP 12575 24 19.33 36 -------------------------------------------- 19. SOEP 34958 26 33.46 48 2. SOEP 34958 28 31.6 48 21. SOEP 34958 24 33.46 48 22. SOEP 34958 22 3.86 48 8-9

Level-1 model (within) we assume linear change Coefficients, true individual growth curve Random Part, interpreted as measurement error of person i at time t a i : intercept of i s true trajectory bmiit i s true trajectory Health bmi a i + b i time 2 + it i: individual t: time i1, i4: deviation of i s true trajectory b i: : slope of i s true trajectory time wave 8-1

Level-2 model (between): random intercept / slopes models PSID SHP SOEP 2 25 3 35 4 2 4 6 2 4 6 2 4 6 Participation Year Graphs by Survey: USA (PSID), Germany (SOEP) or Switzerland (SHP) Data: CNEF 27 subsample8-11

Level-2 coefficients Level-2 Intercept ai a a1surveyui grand mean of intercepts a i (dep.survey) difference of intercepts ai (dep survey) Level-2 Slope bi b b1surveyu1 i grand mean of slopes b i ( dep survey) difference of slopes b i ( dep survey) 8-12

Level-2 random effects Idea: Individual specific deviations from mean (intercept u i and slope u 1i ) may be correlated ai a a1survey u i bi b b1survey u1i variance of intercepts a i (survey controlled) u u i 1i ~ N, u 2 u u1 2 1 u 1 variance of slopes b i (survey controlled) co-variance of intercepts a i and slopes b i 8-13

Parameters of level-1 and level-2 model ai au i bi bu1 i bmi it = a i + b i time + it r it = u i + u 1i time + it Level 2 Level 1 Residuals We model: Level-1 variance σ 2 Level-2 grand intercept (a) and its variance σ u 2 Level-2 grand slope (b) and its variance σ u1 2 Level-2 covariance cov(σ u,σ u1 ) (check if Estimation methods: depends on software Coefficients (a,b) are weighted means of the individual OLS results (max. flexible) and of the population curves (max. stable), weights depend on preciseness of the OLS-estimation) 8-14

Multilevel as weighted individual OLS and population curves PSID,, 363 PSID, 1, 21 SHP,, 213 SHP, 1, 719 SOEP,, 623 SOEP, 1, 394 2 4 6 2 4 6 Data points Population average Participation Year Graphs by survey, Sex: Male, and Unique person number in each survey OLS ML random slope Population average (OLS between Var. survey, male): stable,efficient Weight high if OLS imprecise Individual OLS: unbiased but inefficient Weight high if OLS precise ML: biased but efficient 8-15

Steps to do Never without empirical Plots and fitted OLS-curves Person-Period Dataset ok? First: model without covariates. Find true covariance matrix! First: unconditional Why unconditional? 1. zero model Intercept only, Target: variancedecomposition between levels 1. Does systematic variation of dependent variable exist and, on which level(s)? 2. Unconditional growth model only TIME as level-1 predictor and intercept as level-2, Target: information on level-2 variation of grand slope 2. Magnitude of variation within and between are basis for further predictors 8-16

Unconditional means/growth model (max. likelihood estimation). xtreg bmi, i(pers) /* VC modfel */ Random-effects GLS regression Number of obs = 9189 Group variable: pers Number of groups = 2997 R-sq: within =. Obs per group: min = 1 between =. avg = 3.1 overall =. max = 4 Wald chi2() =. corr(u_i, X) = (assumed) Prob > chi2 =. ------------------------------------------------------------------------------ bmi Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons 25.48776.82556 38.73. 25.32595 25.64957 -------------+---------------------------------------------------------------- sigma_u 4.413 sigma_e 1.588148 rho.88473832 (fraction of variance due to u_i) ------------------------------------------------------------------------------. xtmixed bmi year pers:year, cov(unst) mle Mixed-effects ML regression Number of obs = 9189 Log likelihood = -21557.826 Prob > chi2 =. ------------------------------------------------------------------------------ bmi Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- year.1386832.11217 12.38..1167283.166381 _cons 25.2275.82938 34.17. 25.645 25.38961 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ pers: Unstructured sd(year).364826.12562.3419454.3892377 sd(_cons) 4.38887.6836 4.271238 4.59741 corr(year,_cons) -.842973.32786 -.1467533 -.211718 -----------------------------+------------------------------------------------ sd(residual) 1.33571.14744 1.31992 1.359777 ------------------------------------------------------------------------------ LR test vs. linear regression: chi2(3) = 111.72 Prob > chi2 =. 8-17

For regression coefficients : use z-test (5% if IzI>1.96) Excursus: hypotheses testing For random effects : use the likelihood ratio test: -test current model against base model - Important: current models uses same data ( nested ) but more parameters (regression coefficients, random effects, covariance) - Null Hypothesis: these coefficients are all - Test statistic: = 2*[loglikelihood(current) - loglikelihood(base)] - This statistic is χ 2 -distributed with the additional number of coefficients as degree of freedom 8-18

_cons measures grand mean of individual with on all variables (year=survey==reference, bmi=) bmi= shows: de-mean independent variables - Eliminates multi-collinearity between independent variables (generally in regressions) and in ML models between first level-specific and context variables - Reduces covariance between const. and slope (a and b) - Improves estimation precision and speed - Facilitates interpretation Care! Model fit separately for within and between model, (special pseudo R 2 measures total fit) 8-19

9 Missing Data: Incomplete panels, imputation techniques 9-1

Underrepresentation Sampling frame errors (mostly undercoverage) Unit non-response: First wave ( initial ) nonresponse Attrition: respondents drop out or are temporary absent Reasons: - Non-contact (out-of-home, no telephone answer, or unsuccessful tracking of movers) -Refusal(also on behalf of others) Item non-response Respondents fail to answer particular questions, e.g. income 9-2

Initial nonresponse vs attrition Initial nonresponse Only little information on non-participants (from sampling frame eventually sex, age, nationality, civil status, region ) Attrition Rich information on these individuals from at least one wave Methodological research: attitudinal characteristics (e.g. political interest, political orientation, social integration) more biased than socio-demography Bias correction of bias easier for attrition than for undercoverage or initial nonresponse 9-3

Types of missingness Missing completely at random (MCAR) Missing at random (MAR) conditional on observables (x it ), response is random = systematic differences in response can be explained by observable characteristics. Not missing at random (NMAR) non-ignorable: systematic differences in response remain after controlling for observables (x it ) 9-4

Consequences of missingness Missing completely at random (MCAR) Statistics unbiased Loss of efficiency (smaller n) Missing at random (MAR) Univariate statistics: biased Example: if the politically uninterested do not take part in the survey we will overestimate the political interest rate Multivariate statistics: unbiased if controlled for variables which explain missingness Not missing at random (NMAR) Statistics biased 9-5

Strategies to deal with missingness Multivariate models: Control for variables which explain missing mechanism (under MAR assumption) Weighting: mainly for unit-nonresponse Imputation: mainly for item-nonresponse Look at the data, analyze missingness, assess bias, careful interpretation! 9-6

Univariate statistics: use weights Multivariate statistics Weighting Weights may or may not be appropriate: depends on research question, model, construction of weights Use of weights is not always possible not implemented in RE models in Stata, but possible in xtmixed or with gllamm software Sensitivity analysis: run models with and without weights 9-7

Imputation strategies I Exclude cases from analysis (listwise deletion in regression) Assumes MCAR Loss of statistical power Mean or median imputation Assumes MCAR, variance reduction, bias to the Mean Regression imputation Assumes MAR, variance reduction, regression to the mean Hot Deck Identification of donor with same characteristics of co-varying observed variables assumes MAR, variance reduction Cold deck donor comes from another survey assumes MAR, variance reduction, loss of statistical power 9-8

Imputation strategies II Nearest neighbor for continuous variables; donor has smallest distance assumes MAR, variance reduction Multiple Imputation several imputations per missing value, stochastic In Panel Surveys: information of the same unit (individual) from other waves Simple carryover: take value from another wave (longitudinal change reduction) Row and Column : uses longitudinal and cross-sectional information, stochastic (longitudinal change reduction) 9-9

Testing for attrition bias Compare individuals who drop out with individuals who stay: characteristics of first wave example: individuals present in 23 vs. not: political interest in 1999 stayers mean se(mean) N ---------+------------------------------ 4.688572.518871 3439 1 5.29387.433594 436 ---------+------------------------------ Total 5.26927.33537 7799 ---------------------------------------- 9-1

Testing form non-random missing Missing value is random (MCAR or MAR after control) response r it is independent of u i or ε it 1. check whether r it explains the outcome y it after controlling for other characteristics Functions of r it can be added to the equation and their significance tested. E.g.: lagged response indicator r it-1 number of waves present correction models (eg Heckman Selection) 2. compare results from the unbalanced panel with the balanced sub-panel regressor differences (use Hausman test) 9-11

1 Dynamic models 1_1

Longitudinal models Within estimators: Fixed effects (FE), Difference in Difference (DID) Long panels: causal influence Short panels: policy evaluation Dynamic panel models (complex!) Model pathways through which variables measured over time are associated Model stability over time, true state dependence Lagged dependent variable as a regressor Cross-lagged models Lagged dependent and lagged independent variables as regressors Establish causality through time ordering 1_2

Dynamic panel models Example: does unemployment cause future unemployment? Does poverty cause future poverty? Dynamic model: y it = a + b 1 y i,t-1 + b 2 x i,t + λ t + u i + e it b 1 represents state dependence (how quickly do people return t their equilibrium after a shock?) u i represents stable characteristics (trait dependence) λ i year dummies (control for temporal shocks) e it person- and time-specific residual Problems (Strict) exogeneity fails: y i,t correlated with e i,t-1 Y i,t-1 and u i are correlated Initial condition: y assumed exogenous -> all OLS estimators potentially biased -> (true) state dependence cannot be estimated (counfounded with persistence of unobservables) 1_3

Instrumental variable dynamic models Solution to endogeneity problem of lagged dependent variable Use previous values of y to instrument y i,t-1 Arellano and Bond 1991 Arellano and Bover 1995, Blundel and Bond 1998 (GMM estimators) Estimates are very sensitive to model specification Specification tests: Sargan test, Hansen test References: Wooldridge 29, Wawro 22 (application) 1_4

Cross-lagged models y it = a + b 1 y i,t-1 + b 2 w i, t-1 + e it w it = a + b 3 y i,t-1 + b 4 w i, t-1 + e it Examining questions of reciprocal causality compare size of b 2 and b 3 Variables is regressed onto its lagged measure and the lagged other (independent) variables Initial condition: y and w assumed exogenous References: Finkel 1995, Carsey and Layman 26 (application) 1_5