Introduction to Panel Data Analysis
|
|
|
- Eugenia Jenkins
- 10 years ago
- Views:
Transcription
1 Introduction to Panel Data Analysis Oliver Lipps / Ursina Kuhn Swiss Centre of Expertise in the Social Sciences (FORS) c/o University of Lausanne Lugano Summer School, August
2 Introduction panel data, data management 1 Introducing panel data (OL) 2 The SHP (OL) 3 Data Management with Stata (UK) Regressions with panel data: basic 4 Regression refresher (UK) 5 Fixed effects (FE) models (OL) 6 Introducing random effects (RE) models (OL) 7 Nonlinear regression (UK) Start: 8.3 Breaks: Lunch Breaks: End: 17.3 Additional topics 8 Level 1 and 2 growth models (OL) 9 Missing data (OL) 1 Dynamic models (UK)
3 1 Introducing panel data
4 Surveys over time: repeated cross-sections vs. panels Cross-Section: Survey conducted at several points in time ( rounds ) using different sample members in each round Panel: Survey conducted at several points in time ( waves ) using the same individuals over waves -> panel data mostly from panel surveys -> If from cross-sectional surveys: retrospective ( biographical ) questionnaire 1-4
5 Panel Surveys: to distinguish Length and sample size: Time Series: N small (mostly=1), T large (T ) time series models Panel Surveys: N large, T small (N ) social science panel surveys Sample General population: - rotating: only few (pre-defined number) waves per individual (in CH: SILC, LFS) - indefinitely long (in CH: SHP) Special population: - e.g., age/birth cohorts (in CH e.g.: TREE, SHARE, COCON) representative for population of special agegroup / birthyears 1-5
6 Panel surveys increasingly important Changing focus in social sciences Life course research: effects of events within individuals Large investments in social science household panels surveys, high data quality! Concern about causality in cross-sectional studies Analysis potential of panel data - close to experimental design: before and after studies - control unobserved individual characteristics (exogenous independent variables; implicit in regression analyses) 1-6
7 Identification of individual dynamics (poverty in SHP) 1% 96% 4% 89.5% 91.% 92.3% 91.2% 9.6% 5% 52% 49% 46% % 5% 48% 51% 54% 1.5% 9.% 7.7% 8.8% 9.4% poor not poor -> individual dynamics can only be measured with panel data! 1-7
8 Identification of age, time, and (birth) cohort effects Fundamental relationship: a it = t - c i Effects from formative years (childhood, youth) -> cohort effect (eg taste in music ) Time may affect behavior -> time effect (eg computer performance) Behavior may change with age -> age effect (eg health) In a cross-section, t is constant age and cohort collinear (only joint effect estimable) In a cohort study, cohort is constant age and time collinear (only joint effect estimable) In a panel, t varies, but A it, t, and c i collinear. only two of the three effects can be estimated we can use (t,c i ), (A it,c i ), or (A it,t), but not all three 1-8
9 Age, time, cohort effects: interpretation (no cohort effect) (aging effect) (age effect) (no aging effect) (no age effect) (aging and cohort effect) 1-9
10 Problems of panels Fieldwork / data quality related High costs (panel care, tracking households, incentives) Initial nonresponse (wave 1) and attrition (=drop-out of panel after wave 1) Finally: you design a panel for the next generation Modeling related Sometimes strong assumptions for applicability of appropriate models necessary (later) 1-1
11 Advantages of panel data Allow to study individual dynamics + Control for unobserved characteristics between individuals by repeated observation + Higher precision of change + Data can be pooled 1-11
12 2 Introducing the Swiss Household Panel (SHP)
13 Swiss Household Panel: history Primary goal: observe social change and changing life conditions in Switzerland Started as a common project of the Swiss National Science Foundation, Swiss Federal Statistical Office, University of Neuchâtel First wave in 1999, more than 5, households Refreshment sample in 24, more than 2,5 households, several new questions Since 28, integrated into FORS (Swiss Centre of Expertise in the Social Sciences ), c/o University of Lausanne 2-2
14 Disciplines working with SHP data 2-3
15 SHP sample and methods Representative of the Swiss residential population Each individual surveyed every year (Sept.-Jan.) All household members from 14 years on surveyed (proxy questionnaire if child or unable) Telephone interviews (central CATI), languages D/F/I Metadata: biographic, interviewers Paradata: call data (from address management) Following rules: OSM followed if moving, from 27 on all individuals All new household entrants surveyed 2-4
16 SHP sample size (households) and attrition Befragte Haushalte SHPI SHP II Cross-sectional and longitudinal (SHPI and SHP I+II combined) weights to account for nonresponse and attrition 2-5
17 SHP: Survey process and questionnaires Grid Questionnaire: Inventory and characteristics of hh-members Persons 18+ years «reference person» Persons 14+ years Persons 13- years + «unable to respond» Household Questionnaire: housing, finances, family roles, Individual Questionnaires: work, income, health, politics, leisure, satisfaction of life Individual Proxy Questionnaires: school, work, income, health, 2-6
18 SHP: Questionnaire Content Social structure: socio-demography, socio-ecomomy, work, education social stratification and social mobility Life events: marriages, births, deaths, deceases, accidents, conflicts with close persons, etc. life course Social participation: politics (attitudes, elections, party preferences and -choice), culture, social network, leisure social integration, political attitude and behavior Perception and values: trust, confidence, gender values and social capital Satisfaction and health: physical and mental health selfevaluation, chronic problems, different satisfaction issues quality of life 2-7
19 SHP Household: composition and housing «objective» elements Characteristics of household members (sex, age, civil status, education, occupation) Relationships between all household members «subjective» elements Satisfaction with house, noise, pollution, etc. Assessment of state of house Since when at this place Type of house Size, number of rooms of house State of house (heating, noise, pollution, etc.) Costs and subsidy, etc. External help for domestic work Child care Division of labor Who takes decisions, etc. 2-8
20 SHP Household: standard of living «objective» elements Activities and (durable) goods Reason why absence of goods (financial, other) «subjective» elements Satisfaction with financial situation Financial difficulties Debts (+reason) Total household income Taxes Social and private financial transfers 2-9
21 SHP Individual level: family «objective» elements Children out of the house Division of housework, care for dependents Disagreement about family problems etc. «subjective» elements Satisfaction with private situation Satisfaction with living alone or together Satisfaction with division of housework SHP Individual level: health, well-being «objective» elements Health problems Physical activities Doctor visits, hospitalization Improvement of health Long and short term handicaps «subjective» elements Subjective health state Satisfaction with health Satisfaction with life in general 2-1
22 Profession of parents Level of education of parents Nationality of parents Financial problems in childhood Social origin and education Education level Current training Language capabilities Leisure «objective» elements Activities: holidays, invitation of and meeting friends, reading, Internet use, restaurant, etc. «subjective» elements Satisfaction with leisure time and leisure activities Satisfaction with work-life balance 2-11
23 Individual level work «objective» elements Job sector Social stratification Private or public Position Working time, commuting time Size of company «subjective» elements Satisfaction with work (general, income, interests, working conditions, amount, atmosphere) Risk of unemployment Job security Chances to get promoted Income «objective» elements Total personal income Total personal income from work Social transfers received Private transfers received Other income «subjective» elements Satisfaction with financial situation Assessment whether financial situation improved or not Possible reactions on financial problems 2-12
24 Values and politics «objective» elements Right to vote Political activities Member of a political party «subjective» elements Satisfaction with democracy Confidence in federal government Political interest Left-right political positioning Opinions on political questions Participation, integration, social network «objective» elements Frequency of contacts Voluntary work outside the household Participation and membership in associations Belief and religious participation «subjective» elements Satisfaction with personal relationships Assessment of amount of practical help received from partner, parents, friends, etc. Assessment of amount of emotional help received from partner, parents, friends, etc. General trust in people 2-13
25 Biographical (retrospective) questionnaire N = 5 56 Written questionnaire in 21/22, sent to all individuals surveyed in 2, aged 14 or over Questions since birth about family, education, and professional biography: - with whom lived together - periods out of Switzerland - changes of civil status - learned professions - education - professional and non-professional biography - family life events (divorce-re-marriage of parents) 2-14
26 International Context SHP is part of the Cross National Equivalent File (CNEF): General population panel surveys from: USA (PSID since 198) D (SOEP, since 1984) UK (BHPS since 1991) Canada (SLID since 1993) CH (SHP since 1999) Australia (HILDA since 21) Korea (KLIPS since 27) More countries will join (Russia, South Africa, ) Each panel includes subset of all variables (variables from original files can be merged) Variables ex-post harmonized, names, categories Missing income variables are imputed Frick, Jenkins, Lillard, Lipps and Wooden (27): The Cross-National Equivalent File (CNEF) and its member country household panel studies. Journal of Applied Social Science Studies (Schmollers Jahrbuch) 2-15
27 SHP Questionnaire: Rotation Module Social network X X X X Religion X X X Social participation X X X X Politics X X X X Leisure X X X X Psychological Scales X X X 2-16
28 Outlook: new sample (LIVES) SHP III (213, based on individual register) Biographic questionnaire in 1. Wave SHP III (with NCCR LIVES) NCCR LIVES (Precarious) life course University of Lausanne and Geneva 15 research projects, 12 years Use of SHP 2-17
29 SHP structure of the data 2 yearly files (currently available: (+beta 211)) household Individual 5 unique files master person (mp) master household (mh) social origin (so) last job (lj) activity (employment) calendar (ca) Complementary files biographical questionnaire Interviewer data (2, and yearly since 23) Call data (since 25) CNEF SHP data variables 2-18
30 Documentation (Website: D/E/F) e.g.,: User Guide Questionnaires Variable Search (by variable name and topic) Construction of variables Syntax examples - Merge data files with SPSS, SAS, Stata - Documentation Data Management with SHP 2-19
31 SHP data delivery Data ready about 1 year after end of fieldwork downloadable from SHP-server: Signed contract with FORS Upon contract receipt, login and password sent by Data free of charge Users become member of SHP scientific network and document all publications based on SHP data Data on request: Imputed income Call data Interviewer matching ID Context data (special contract); data is matched at FORS More info: 2-2
32 3 Stata and panel data
33 Why Stata? Capabilities Data management Broad range of statistics Powerful for panel data! Many commands ready for analysis User-written extensions Beginners and experienced users For beginners: analysis through menus (point and click) Advanced users: good programmable capacities 3_2
34 Starting with Stata Basics Look at the data, variables Descriptive statistics Regression analysis Handout Stata basics Working with panel data Merge Creating «long files» Working with the long file Add information from other household members Handout data management of SHP with Stata (includes Syntax examples, exercises) 3_3
35 1. Merge: _merge variable Master file idpers p7c idpers p8c using file idpers p7c44 p8c _merge Merge variable 1 only in master file 2 only in user file 3 in both files 3_4
36 Merge: identifier Master file idpers p7c idpers p8c using file idpers p7c44 p8c _merge _5
37 Merge files: identifiers filename identifiers Individual master file shp_mp idpers, idhous$$, idfath, idmoth Individual annual files shp$$_p_user idpers, idint, idhous$$, idspou, refper$$ Additional ind. files (Social origin, last job, calendar, biographic) shp_so, shp_lj shp_ca, shp_* idpers Interviewer data shp$$_v_user idint Household annual files shp$$_h_user Biographic files idhous$$, refpers, idint, canton$$, (gdenr) idpers CNEF files shpequiv_$$$$ x1111ll (=idpers) 3_6
38 Stata merge command The merge command merge [type] [varlist] using filename [filename...] [, options] varlist filename identifier(s), e.g. idpers data set to be merged type 1:1 each observation has a unique identifier in both data sets 1:m, m:1 some observations have the same identifier in one data set 3_7
39 2 annual individual files Basic merge example I use shp8_p_user, clear merge 1:1 idpers using shp_p_user _merge Freq. Percent Cum , , , Total 16, _8
40 Basic merge example II annual individual file and individual master file use shp8_p_user, clear // opens the file (master) count //there are cases merge 1:1 idpers using shp_mp //identif. & using file tab _merge _merge Freq. Percent Cum , , Total 22, 1. drop if _merge==2 //if only ind. from 28 wanted drop _merge 3_9
41 Basic merge example III annual individual file and annual household file use shp8_p_user, clear //master file merge m:1 idhous8 using shp8_h_user /*identifier & using file */ _merge Freq. Percent Cum , Total 1, _1
42 More on merge Options of merge command keepusing (varlist): selection of variables from using file keep: selection of observations from master and/or using file for more options: type help merge Merge many files loops (see handout) Create partner files (see handout) 3_11
43 2. Wide and long format Wide format idpers i4empyn i5empyn i6empyn i7empyn Long format (person-period-file) idpers year iempyn _12
44 Use of long data format in stata All panel applications: xt commands descriptives panel data models fixed effects models, random effects, multilevel discrete time event-history analysis declare panel structure panel identifier, time identifier xtset idpers wave 3_13
45 Convert wide form to long form reshape long command in stata reshape long varlist, i(idpers) j(wave) But: stata does not automatically detect years in varname reshape long /// i (idpers) /// j(wave "99" "" "1" "2" "3" "4" /// "5" "6" "7" "8" ),atwl () 3_14
46 Create a long file with append 1. Modify datasets for each wave idpers i99wyn idpers wave iwyn temp1.dta 2. Stack data sets use temp1, clear forval y = 2/1 { append using temp`y' } idpers i99wyn idpers wave iwyn temp2.dta 3_15
47 Work with time lags If data in long format and defined as panel data (xtset) l. indicates time lag Example: social class of last job (see handout) 3_16
48 Missing data in the SHP Missing data in the SHP: negative values -1 does not know -2 no answer -3 inapplicable (question has not been asked) -8/-4 other missings Missing data in Stata:..a.b.c.d etc negative values are treated as real values missing data (..a.b etc) are defined as the highest possible values;. <.a <.b <.c <.d recode to missing or analyses only positive values e.g. sum i8empyn if i8empyn>= care with operator > e.g. count if i8empyn>1 counts also missing values write <. instead of!=. 3_17
49 Longitudinal data analysis with Stata xt commands descriptive statistics xtdescribe xtsum, xttab, xttrans regression analysis xtreg, xtgls, xtlogit, xtpoisson, xtcloglog xtmixed, xtmelogit, diagrams: xtline 3_18
50 Descriptive analysis Get to know the data Usually: similar findings to complicated models Visualisation Accessible results to a wider public Assumptions more explicit than in complicated models 3_19
51 Example: variability of party preferences Kuhn (29), Swiss Political Science Review 15(3): _2
52 Happiness with life 8.5 Example: becoming unemployed West East German-speaking French-speaking Employed Year before unemployed 1st year unemployed 2nd year unemployed 5. Employed Year before unemployed 1st year unemployed Germany, Switzerland, nd year unemployed Oesch and Lipps (212), European Sociological Review (online first) 3_21
53 Example: Income mobility Switzerland Low income 29 Middle income 29 High income 29 Total Low income % 4.8 % 3.1 % 1 % Middle income % 75.8 % 1.8 % 1 % High income % 34.4 % 61.1 % 1 % Germany Low income 29 Middle income 29 High income 29 Total Low income % 36.4 % 1.9 % 1 % Middle income % 78.4 % 9.2 % 1 % High income % 29.6 % 67.8 % 1 % Grabka and Kuhn (212), Swiss Journal of Sociology 38(2), _22
54 4 Linear regression (Refresher course) 4_1
55 Aim and content Refresher course on linear regression What is a regression? How do we obtain regression coefficients? How to interpret regression coefficients? Inference from sample to population of interest (significance tests) Assumptions of linear regression Consequences when assumptions are violated 4_2
56 What is a regression? A statistical method for studying the relationship between a single dependent variable and one or more independent variables. Y: dependent variable X: independent variable(s) Simplest form: bivariate linear regression linear relationship between a dependent and one independent variable for a given set of observations Example Does the wage level affect the number of hours worked? Gender discrimination in wages? Do children increase happiness? 4_3
57 Y yi ŷi Linear regression: fitting a line Y = x 1 unit X slope xi ei 4_4
58 yearly income from employment number of years spent in paid work scatter plot of observations 4_5
59 yearly income from employment a b 1 unit x number of years spent in paid work Regression line: ŷ i = a + bx i = *x i 4_6
60 yearly income from employment number of years spent in paid work Regression line: ŷ i = a + bx i = *xi Estimated regression equation: y i = a + bx i + e i 4_7
61 Components (linear) regression equation Estimated regression equation: y i = a + bx i + e i y dependent variable x independent variable(s) (predictor(s), regressor(s)) a intercept (predicted value of Y if x =) b regression coefficients (slope) measure of the effect of X on Y e part of y not explained by x (residual), due to -omitted variables - measurement errors - stochastic shock -disturbance 4_8
62 Scales of independent variables Independent variables Continuous variables: linear Binary variables (Dummy variables) (, 1) (e.g. female=1, male=) Ordinal or multivariate variables (n categories) Create n-1 dummy variables (base category) Examples: educational levels 1 low educational level 2 intermediate educational level 3 high educational level Include 2 dummy variables in regression model 4_9
63 Example: multivariate regression Including other covariates: Regression coefficients represent the portion of y explained by x that is not explained by the other x s Example: gender wage gap (sample: full-time employed, yearly salary between 2 and 2 CHF) Bivariate model y = a + b x + e i salary = female + e i Multivariate model b constant 45'369 y = a + b 1 x 1 + b 2 x 2 + b 3 x e i female -9'9 education (Ref: compulsory) secondary education 9'197 tertiary education 3'786 supervision 17'128 financial sector 15'592 number of years in paid work 729 4_1
64 Assumptions for OLS-estimations: coefficients Assumptions for OLS-estimation (necessary to calculate slope coefficients) 1) No perfect multicollinearity (None of the regressors can be written as a linear function of the other regressors) 2) E(e) = 3) None of the x is correlated with e; Cov(x,e) = ; (all x s are exogenous) If assumptions 1-3 hold: OLS is consistent (regression coefficients asymptotically unbiased) 4_11
65 Inference from linear regression I Inference from OLS-estimations if random sample But: OLS coefficients are estimations Estimated regression equation: y i = a + bx i + e i True regression equation: y i = α + βx i + ε i True coefficients (α, β) unknown, true «error term» unknown Distribution of coefficients (a, b) E( b) Var( b) ˆ 2 σ β E(β) 4_12
66 Inference from linear regression II Var ( b) ( i ) where 2 ( xi x) n p Variation of b (σ β2 ): decreases if n increases x are more spread out squared residuals decrease Distribution of b Student t-distribution Depends on n and number of x s = normal distribution if n large σ β E(β) 4_13
67 Inference from linear regression: testing whether b If β = (in population), there is no relationship between x and y test how likely it is, that β = H : Distribution if β = critical values for coefficients compare estimated coefficient with critical value if abs(b) >abs(critical value), b significant b stand b t value b b Critical value for standardized normal distribution and 95% confidence level: _14
68 Inference from linear regression: example yearly income from employment number of years spent in paid work Regression line: ŷ i = a + b *x i example: ŷ i = *x i 4_15
69 Inference from linear regression: example Sample n=53 Coef. st.e. t P> t [95% Conf. Interval] years work _cons R 2 :.11 Sample n=1787 Coef. St.e. t P> t [95% Conf. Interval] years work _cons R 2 :.159 4_16
70 4_17 Inference : assumptions Assumptions on error terms Independence of error terms, no autocorrelation: Cov (ε i, ε k ) = for all i,k, i k Constant error variance : Var(ε i )=σ 2 ε for all i; (Homoscedasticity) Preferentially: e is normally distributed Matrix of error terms ; n n n n k i
71 4_18 Autocorrelation Reason: Nested observations (e.g. households, schools, time, communities) standard errors underestimated OLS, adjust standard errors 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ 2 σ ; n n n n k i ; n n n n k i autocorrelation no autocorrelation
72 4_19 Heteroskedasticity Variance is not consistent standard errors overestimated or underestimated OLS, adjust standard errors (White standard errors) Weighted least squares (WLS) σ σ 1 σ 5 σ 4 σ 3 σ 2 σ ; n n n n k i ; n n n n k i Homoskedasticity Heteroskedasticity
73 Summary: assumptions of OLS regression General Continuous dependent variable Random sample Coefficient estimation No perfect multicollinearity E(e) = No endogeneity Cov(x,e) = Omitted variables Measurement error in indep. variables Simultaneity Nonlinearity in parameters Inference No autocorrelation Cov (ei, ek)= Constant variance (no heterogeneity) Preferentially: residuals normally distributed Coefficients biased (inconsistent) Standard errors of coefficients biased 4_2
74 Endogeneity Traditional meaning Variable is determined within a model Econometrics Any situation where an explanatory variable is correlated with the residual If a variable is endogenous Care with interpretation: model cannot be interpreted as causal 4_21
75 Endogeneity: reasons Omitted variables Measurement error (in explanatory variables) Simultaneity Nonlinearity in parameters x contains lagged values of y (see later, dynamic models) 4_22
76 Endogeneity: consequences and detection Consequence of endogeneity ALL estimators may be biased! (exception: if a variable is completely exogenous controlling for the endogenous variable) Detection of endogeneity Difficult to detect and correct! Caution for causal interpretation Theory, literature (variable selection and interpretation)!!!! 4_23
77 Endogeneity: correction Test for nonlinear relationship and interactions Omitted variables Proxy variables, instrumental variables (2sls estimation) Panels: FE-models (within estimators), Difference-in-Difference models Propensity score analysis, Regression discontinuity, Heckman selection models, Simultaneity Structural equations modelling, Panel data for time ordering Theory, literature (variable selection and interpretation)!!!! 4_24
78 Example: Control of unobserved heterogeneity Example: effect of partnership on happiness happiness = e (partnership) We know (from literature, e.g.): happiness = f (attractiveness, leisure activities, health,...) [not measured!] Similarly: partnership = g (attractiveness, leisure activities, health,...) Cross-sectional data happiness = e(partnership) but we know that there are positive effects on partnership from income, fitness, attractiveness, health,... -> which part of effects are due to attractiveness, leisure activities, health,...? Panel data 1. Happiness (at times with a partner) 2. Happiness (at times without partner) of the same individuals Care: reversed causality, time-dependent unmeasured effects!) 4_25
79 Example: Regression analysis using Stata Sample: individuals in employment, 2 to 6 years Dependent variable: Working hours per week (paid work) Independent variables: hourly wage, number of children, married, sex, age Summary statistics sum workhours wage age8 married_nokid onekid twokids threepkids /// if workhours> & age8>=2 & age8<=6 Variable Obs Mean Std. Dev. Min Max workhours wage age married_no kids onekid twokids three+ kids _26
80 Example: check hourly wages graph box wage 4_ gross hourly wage Frequency gross hourly wage
81 Example: transform hourly wages Frequency Frequency gross hourly wage histogram wage, freq histogram lnwage, freq lnwage 4_28
82 reg workhours wagesd wagesq agerec agerecsq married_nokid /// onekid twokids threepkids if workhours> & age8>2 & age8<=6 Source SS df MS Number of obs = F( 8, 36) = Model Prob > F =. Residual R-squared = Adj R-squared =.3598 Total Root MSE = workhours Coef. Std. Err. t P> t [95% Conf. Interval] female lnwage lnwagesq onekid twokids threepkids married_no~d agerec _cons _29
83 Diagnostic plot: heteroskedasticity standardised residuals Fitted values 4_3
84 Diagnostic plot: normal distribution of residuals Density standardised residuals Standardized residuals normal scores 4_31
85 Regression with panel data: Data structure Wide data format Long data format (person-period-file, pooled data) idpers wage4 wave5 wave6 wage idpers year wage _32
86 OLS with pooled panel data: problems I OLS for cross-sectional analysis (one wave) no particular problem! OLS for pooled data (different years in one file) Problem: assumption of independent observations violated (autocorrelation) Possible correction: Correct for clustering in error terms (coefficients unaffected) But: OLS is not the best estimator for pooled data (not efficient) number of working hours OLS OLS, cluster in se per week b t b t lnwage (13.76) lnwage squared (-9.57) (-5.55) female (-15.34) (-11.4) 1 child.78 (2.5).78 (1.53) female*1 child (-2.43) (-14.) 2 children 1.74 (4.91) 1.74 (3.84) female*2 children (-33.4) (-16.36) 3+ children 2.85 (5.88) 2.85 (2.85) female*3+ children (-26.73) (-18.84) married, no child 2.4 (5.63) 2.4 (4.53) female*married, no child (-19.95) (-13.63) age -.6 (-6.89) -.6 (-4.39) 4_33
87 OLS with panel data: problems II OLS does not take advantage of panel structure Two different types of variation in panel data Variation within individuals Variation between individuals Control for unobservable variables (stable personal characteristics) Fixed Effects Models (only within variation) Random Effect Models (multilevel /random intercept / frailty for event history) 4_34
88 Comparisons of different regression models number of working OLS OLS, cluster Random effects Fixed effects hours per week b t b t b t b t lnwage (13.76) (14.2) 13.3 (1.26) lnwage squared (-9.57) (-5.55) (-12.3) (-9.91) female (-15.34) (-11.4) (-16.19) dropped 1 child.78 (2.5).78 (1.53) 1.33 (3.31).22 (.39) female*1 child (-2.43) (-14.) (-19.) -6.8 (-7.76) 2 children 1.74 (4.91) 1.74 (3.84) 1.84 (4.23).19 (.3) female*2 children (-33.4) (-16.36) (-23.3) (-8.17) 3+ children 2.85 (5.88) 2.85 (2.85) 2.55 (4.23) -.23 (.26) female*3+ children (-26.73) (-18.84) (-19.28) (-6.73) married, no child 2.4 (5.63) 2.4 (4.53) 1.79 (4.23).21 (.36) female*married, no chi (-19.95) (-13.63) (-11.71) -.54 (-.63) age -.6 (-6.89) -.6 (-4.39) -.3 (-2.42) constant R squared _35
89 5 Introducing Fixed Effects Models ( within effects)
90 Hypothesis: Example: BMI after stopping smoking BMI increases after stopping smoking Hypothetical data: Random sample of former smokers with year after stopping and BMI at that time for 3 individuals time bmi1 bmi2 bmi
91 BMI after stopping smoking: pooled OLS BMI years after stop smoke bmi Fitted values 5-3
92 Pooled regression. * pooled regression:. reg bmi time Source SS df MS Number of obs = F( 1, 16) =.26 Model Prob > F =.6149 Residual R-squared = Adj R-squared = Total Root MSE = bmi Coef. Std. Err. t P> t [95% Conf. Interval] time _cons No BMI increase over time 5-4
93 BMI after stopping smoking: individual data BMI years after stop smoke P1 P2 P3 OLS P1 OLS P2 OLS P3 OLSpooled Autocorrelat. with pooled regression-unobserved individual heterogeneity 5-5
94 Individual regressions. forval j=1/3 { /* loop over each individual*/ 2. reg bmi`j' time 3. } bmi1 Coef. Std. Err. t P> t [95% Conf. Interval] time _cons bmi2 Coef. Std. Err. t P> t [95% Conf. Interval] time _cons bmi3 Coef. Std. Err. t P> t [95% Conf. Interval] time _cons All individuals have significant BMI increase over time 5-6
95 Excursus: unobserved heterogeneity Omitted variables bias: Many individual characteristics are not observed e.g. enthusiasm, ability, willingness to take risks, our example: physical activities, calories intake, muscle mass, genes, cohort These have generally an effect on dependent variable, and are correlated with independent variables. Then regression coefficients will be biased! Note: these (formerly) unobserved measures are increasingly included in surveys 5-7
96 What about the between-effect? BMI years after stop smoke P1 P1 P2 P2 P3 P3 OLS P1 OLS P1 OLS P2 OLS OLS P2 P3 OLSpooled P3 5-8
97 Which models are appropriate to analyze the effects of time? Data transformation necessary 5-9
98 Panel Data and within- Regression (FE) 5-1
99 Error components in panel data models We separate the error components: e it = u i + ε u it, i = person-specific unobserved heterogeneity (level) = fixed effects (e.g., physical activities, calories intake, genetics, cohort) ε it = residual Model: bmi it 1 x it u i it Remember: Pooled OLS assumes that x is not correlated with both error components u i and ε it (omitted variable bias) 5-11
100 Fixed effects regression We can eliminated the fixed effects u i by estimating them as person specific dummies -> remains only within-variation Corresponds to de-meaning for each individual: bmi it 1 x it u i it (1) bmi i individual mean: (2) i 1 x i u i subtract (2) from (1): bmi it bmi i ( x x ) ( ) 1 it i it i -> Fixed (all time invariant) effects u i disappear, i.e. timeconstant unobserved heterogeneity is eliminated 5-12
101 De-meaned values with OLS regression de-meaned 25 BMI BMI de-meaned years after years stop after smoke stop smoke P1 P1 P2 P2 P3 P3 OLS P1 OLS OLS P2 P2 OLS OLS P3 P3 5-13
102 OLS of individually de-meaned Data We de-mean and regress the Data:. bysort id: egen bmi_m=mean(bmi). gen bmi_dem=bmi-bmi_m. bysort id: egen time_m=mean(time). gen time_dem=time-time_m. reg bmi_dem time_dem Source SS df MS Number of obs = F( 1, 16) = Model Prob > F =. Residual R-squared = Adj R-squared =.844 Total Root MSE = bmi_dem Coef. Std. Err. t P> t [95% Conf. Interval] time_dem _cons -2.12e
103 Direct modeling of fixed Effects in Stata xtreg bmi time, fe (calculates correct df; this causes higher Std. Err.). xtreg bmi time, fe Fixed-effects (within) regression Number of obs = 18 Group variable: id Number of groups = 3 R-sq: within =.8532 Obs per group: min = 6 between =.992 avg = 6 time since stop smoking explains parts of individual heterogeneity! max = 6 overall =.162 (from pooled OLS) F(1,14) = corr(u_i, Xb) = Prob > F = bmi Coef. Std. Err. t P> t [95% Conf. Interval] time _cons sigma_u sigma_e rho (fraction of variance due to u_i) F test that all u_i=: F(2, 14) = Prob > F =. 5-15
104 Alternative: OLS with individual dummies controlled. xi i.id, noomit. reg bmi time _I*, noconst Source SS df MS Number of obs = F( 4, 14) = Model Prob > F =. Residual R-squared =.9996! Adj R-squared =.9994 Total Root MSE = bmi Coef. Std. Err. t P> t [95% Conf. Interval] time _Iid_ _Iid_ _Iid_ useful for small N, the u i are estimated (only approximate) 5-16
105 FE estimation can solve the problem of unobserved heterogeneity But: Summary: Fixed Effects Estimation If number of groups large, many extra parameters Enough variance needed in data With FE-Regressions, estimation of time-constant covariates not possible. Are dropped from the model. But: possibility to use interactions (like male*nrchildren) What about comparing with people who never stopped smoking or who never smoked? (later) 5-17
106 No identification of time-invariant covariates z i Consider the model: y it = az i + bx it + u i + ε it (1) let be an arbitrary number; add and subtract z i on the rhs: y it = (az i + z i ) + bx it + (u i - z i ) + ε it and rewrite this as: y it = a*z i + bx it + u i *+ ε it with a* = a + and u i * = u i - z i (2) But (1) and (2) have exactly the same form so it is not clear if a or a* = a + is estimated -> separate effects of az i and u i cannot be distinguished without further assumptions (e.g., no correlation between z i and u i ) 5-18
107 Example: DiD (control group comparison) Hypothesis: financial support increases BMI of low-income women (Schmeisser 29)* (hypothetical) experiment: Survey: Sample of low income female patients of doctoral surgeries randomized into program social aid (5% in program and 5% not) 4 measurements of bmi (2 x before program start, 2 x after) *Expanding wallets and waistlines: the impact of family income on the bmi of women and men eligible for the earned income tax credit. Health Econ. 18:
108 Data: BMI-increase through social aid? bmi BMI of low income Women time Effects: causal effect: BMI increase due to more fast food (more money available) Time (age-) effect: BMI increases with age NO aid aid Start social aid Programm 5-2
109 Pooled Regression. reg bmi aid Source SS df MS Number of obs = F( 1, 14) =.89 Model Prob > F =.3627 Residual R-squared = Adj R-squared = -.77 Total Root MSE = bmi Coef. Std. Err. t P> t [95% Conf. Interval] aid _cons
110 Fixed effects xtreg bmi aid, fe. xtreg bmi aid, fe Fixed-effects (within) regression Number of obs = 16 Group variable: id Number of groups = 4 R-sq: within =.3596 Obs per group: min = 4 between =.54 avg = 4. overall =.595 max = 4 F(1,11) = 6.18 corr(u_i, Xb) = -.74 Prob > F = bmi Coef. Std. Err. t P> t [95% Conf. Interval] aid _cons sigma_u sigma_e rho (fraction of variance due to u_i) F test that all u_i=: F(3, 11) = Prob > F =. 5-22
111 One step back: a causal model With cross-sectional data: only between-estimation: Crucial assumption: Y i, Tt ( reatment) C( ontrol) Yi,t Random sample (no unobserved heterogeneity) With Panel data I: within-estimation (before and after) T C Y i, t 1 Yi,t problem: time effects, panel conditioning With Panel data II: difference-in-difference (DID): ( T C C C Y i, t 1Yi,t j,t 1 Yj, t ) ( Y ) 5-23
112 Now: causal effect of aid We have - Before-after comparison (within) - Treatment- and control groups (between) We compare the within-effect of aid ( treatment ) with that without aid ( control ) i.e., we calculate treatment effect and control for time DID estimator: =(after aid -before aid ) (after noaid before noaid )
113 DiD effects. xi i.time, noomit. xtreg bmi aid _I*, fe note: _Itime_4 omitted because of collinearity Fixed-effects (within) regression Number of obs = 16 Group variable: id Number of groups = 4 R-sq: within =.8876 Obs per group: min = 4 between =.54 avg = 4. overall =.1528 max = 4 F(4,8) = 15.8 corr(u_i, Xb) = -.55 Prob > F = bmi Coef. Std. Err. t P> t [95% Conf. Interval] aid _Itime_ _Itime_ _Itime_ _Itime_4 (omitted) _cons sigma_u sigma_e rho (fraction of variance due to u_i) F test that all u_i=: F(3, 8) = Prob > F =. 5-25
114 FE easier, no control group Summary: within estimators DID: control group can control simultaneous effects (like time): find statistical twin, such that research variables is only variable -> DID useful for estimating causal effects from nonexperimental data. Especially for small samples Excursus: First-Difference (FD) estimators: Stable similarities of adjacent observations eliminated. Problem: level differences not taken into account (6-5 children = 1- ch) for lasting effects (like children): FD not useful because only immediate changes taken into account 5-26
115 6 Introducing Random Effects Models 6-1
116 Towards RE: error components in panel models We have both -within and -between variance: y it = a + e it = a + u i + ε it fixed effects = within: u i for each person ( ANOVA) two residuals: - it on lowest (1.) level: time point a u u i on highest (2.) level: individuals 6-2
117 Necessary, if data have different levels with - observations are not independent of levels - true social interactions Examples: Motivation: multilevel models Schools classes students: first applications Networks: people are influenced by their peers Spatial context: from environment (e.g., poor people are less happy if they live in a rich environment) US: neighborhood-effects Interviewer - effects: respondents clustered in interviewers Panel-surveys: waves clustered in respondents (households) 6-3
118 Levels in clustered data Hierarchical - Households in neighborhoods - Students in schools in classes (three levels) - Respondents in interviewers - Panel Surveys: Waves in respondents (crossed?) Crossed - Questions in respondents longitudinal? Attention: Do not confuse variables and levels: (total) variance can be attributed to levels, not to variables! E.g., 1 hospitals are probably a level, 7 nations are probably dummy variables. Think of a population the sample represents! Note: 3 rule of thumb in contexts: 3 second level units, randomly chosen 6-4
119 Multilevel models: analytic advantages Improved regression models - unbiased estimators for regression coefficients - unbiased estimators for standard errors (usually higher std.err. than OLS) - model true covariance structure (autocorrelation, heteroscedasticity, ) Decomposition of total variance into those in different levels (within-between) Similar to Analysis of variance (ANOVA), but parsimonious (number of estimation parameter independent on number of contexts) can handle large number of contexts (in parts) modeling of unobserved heterogeneity / self selection 6-5
120 Typical results: one-level vs. multilevel model Dependent Variable mixed school boy school girl school Underestimated variance: Kish design effect deff : larger N necessary 6-6
121 Illustration: variance decomposition between levels Example: 2 individuals each asked 2x about their happiness (continuous), here measurements not time ordered! Which variance is due to individuals, which to observations? 3 Happiness Total ( Grand ) Mean (=) Indiv. 1 Indiv. 2 Total Mean= (3+2+(-1)+(-4)) / 4 = 6-7
122 Illustration: calculation of the total variance Happiness n i 1 T i ( t 1 y it T yy y ) 2 n T i i1 t1 ( y it W y yy Total Mean (=) i ) 2 n T i i1 t1 B yy (y i y) 2 Indiv. 1 Indiv. 2 Total Variance is equal to the Square of the Differences of all Observations from the Total Mean divided by the Sample Size (4) = { (3-) 2 + (2-) 2 + (-1-) 2 +(-4-) 2 } / 4 = ( )/4 =
123 Happiness Illustration: individual specific variance ( between ) n T i i1 t1 ( y T it yy y) 2 n i i1 t1 ( y W y yy Total Mean (=) -2.5 T it i ) 2 i 1t n Ti ( 1 B yy yi y ) 2 Indiv. 1 Indiv. 2 variance of the individual means = between -variance ( between = u ) ( (-2.5) 2 )/2 = 6.25 (remember: total variance = 7.5) 6-9
124 Happiness Illustration: measurement specific variance ( within ) n T i i1 t1 ( y T it yy -2.5 y) 2 i 1t n Ti ( 1 yit W yy yi ) 2 n T i i1 t1 B yy (y i y) 2 Indiv. 1 Indiv. 2 variance of measurements within individuals= within -Variance (later: within = ) ((3-2.5) 2 + (2-2.5) 2 + (-1-(-2.5)) 2 + (-4-(-2.5)) 2 )/4 =1.25 (=18% of total variance) variance of individual means = ( (-2.5) 2 )/2 = 6.25 (=82% of total variance = ρ = ICC (intra-class-correlation) 6-1
125 ICC = (zero clustering use OLS) Examples of Θ ICC.8 ICC.2 ICC = 1 (maximum clustering) 6-11
126 y where : u ε it i it u Starting point: null ( Variance Components (VC)) model i it individual specific random variable (N(,σ deviation from individual (note : no intercept a in VC model) u specific mean (N(,σ ) assumed) ε ) assumed) the VC model allows for variance decomposition : ρ correlation between different time points t within an individual i: 2 σu ρ ( ICC intra - class - correlation autocorrelation in Panels) 2 2 σ σ u ε (note : ρ significant multilevel model necessary) 6-12
127 Idea RE: better estimate of u i (modeling intercept) random intercept ( borrowing strength from others): u high edu To estimate mean of individual i (=u i ) only within (FE) suboptimal if - sample small (T i small) - variance high - n large (inefficient) a low edu idea: use information u j from other sample members (between) in the same population group (e.g., education) 6-13
128 Idea RE: weighted within and between RE - Regression is equivalent to pooled OLS after the Transformation : (y it with θ y ) β i (1 θ) β (x θ 1, and σ 1 it θ x ) (u (1 θ) (ε 2 ε i 2 σε Tσ 2 u i, θ 1 it θ ε i )) RE uses optimal combination of within and between variation RE allows estimation of time invariant RE biased because u i variables u remains in error term (if cov(x,u ) ) i i 6-14
129 . xtreg bmi children, re theta Example ρ based on SHP Random-effects GLS regression Number of obs = 18 Group variable: id Number of groups = 3 R-sq: within =.3921 Obs per group: min = 6 between =.1125 avg = 6. overall =.24 max = 6 Wald chi2(1) = 9.76 corr(u_i, X) = (assumed) Prob > chi2 =.18 theta = bmi Coef. Std. Err. z P> z [95% Conf. Interval] children _cons sigma_u sigma_e rho (fraction of variance due to u_i) 6-15
130 Random-Effects estimation 6-16
131 Illustration: how is a random intercept model fitted? model y it = bx it + u i + it with u i ~ N(, u ), it ~ N(, ), cov(x,u i )= Parameters to be estimated are: b, u, Parameter estimation: IGLS - first iteration: OLS, assuming u = - then iteration with current covariance matrix (from residuals) and GLS-estimated u, Maximum Likelihood 6-17
132 Intuition of multilevel (random intercept) ε it u i 6-18
133 FE versus RE models Regression equation: y it = α + βx it + u i + ε it Fixed effects models OLS estimated only variance within individuals used Controls for unobserved heterogeneity (consistent also if Cov(u,x) ) effects of time-invariant characteristics cannot be estimated (e.g., sex, cohort) If research interest is longitudinal Random effect models Cannot be estimated with OLS assumes u i ~N(,σ u ) assumes exogeneity: Cov(u,x)= (no effects from unobserved variables) Effects from time-invariant and timevarying covariates More efficient in unbalanced panels (additional observations for the same individual does not add information linearly!) If research interest is cross-sectional - method must depend on research question (longitudinal variables: FE, true cross-sectional variables : robust OLS, RE) - formally: test for RE against FE model (Hausman test)
134 Hausman test Test if FE or RE model (basic assumption: FE unbiased) Test H : E(u i x it ) = cov(u i, x it ) = cov(u i,x it ) = -> FE and RE unbiased, FE is inefficient -> RE cov(u i,x it ) -> FE is unbiased and RE is biased -> FE If H is true (between-coeff.=within-coeff.), no differences between FE and RE equivalently : Hausman compares if if H H 1, βˆ, βˆ FE FE βˆ RE and βˆ unbiased but βˆ estimation coefficients RE more efficient RE not βˆ FE (var(βˆ FE and βˆ ) RE var(βˆ Note: - H almost always rejected (sample size high enough even with small differences) - Test does not replace research question driven check for model appropriateness 6-2 RE ))
135 * Hausman Test: RE or FE estimate?. qui xtreg bmi children, re. estimates store random. qui xtreg bmi children, fe. estimates store fixed. hausman fixed random Hausman test: example ---- Coefficients ---- (b) (B) (b-b) sqrt(diag(v_b-v_b)) fixed random Difference S.E children b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test: Ho: difference in coefficients not systematic chi2(1) = (b-b)'[(v_b-v_b)^(-1)](b-b) =. Prob>chi2 =
136 Summary: RE estimation RE more efficient than FE because RE also models person specific time-invariant effects (uses more information) RE uses combination of within- and between- estimator RE uses coefficients of time-invariant regressors (between- Variation) RE not useful for causal effects of time-variant regressors Main assumption of RE (and limitation vs. FE): Person specific effect u i must not correlate with regressors Hausman test can test this assumption 6-22
137 RE models RE controls for unobserved heterogeneity (under strong assumptions) Unbiased standard errors (and inferences) Decomposition of total variance on different levels Modeling of variance Summary: models for panel data Allows for covariance that is appropriate for panel data (autocorrelation, heteroscedasticity, etc.) Longitudinal models (e.g., FE, DiD) Full control for unobserved heterogeneity Fixed effects models: within-variance only Difference-in-Difference (DID): before after comparisons (withintreatment effect), controlled for observed (between) effects Other models possible (FD) 6-23
138 Specifically for linear regression models Within research questions causal effects of time-variant variables: modeling intrapersonal change (FE models) If in addition interest in time-invariant variables: hybrid models (RE, include individual means) Cross-sectional research questions effects of time-invariant effects, ( correlations ) OLS with robust standard errors In unbalanced panels: RE models Interpretation (eg children): within: effect of additional children between: differences between people with a different number of children 6-24
139 Models result example * Data: bmi sample estout m1 m2 m3 m4, cells(b se) OLS OLS_cluster_pers FE RE b/se b/se b/se b/se working working_bar working_dem _cons
140 Hybrid models ( regression with context variables ) FE is preferred for causal statements BUT: no inclusion of time-constant variables Idea: hybrid-models simultaneous within and between regression: y it a (x it xi ) x i z i it Estimated as RE model Hybrid-models separate within and between - effects Hybrid-models eliminate the part of unobserved heterogeneity correlated with the independent variables ( unbiased estimators) 6-26
141 + hybrid (1) * Data: bmi sample estout m1 m2 m3 m4 m5, cells(b se) stats(n r2 r2_w r2_b sigma_u sigma_e thta_max) OLS OLS_cluster_pers FE RE RE b/se b/se b/se b/se working working_bar - working_dem - _cons N r r2_w r2_b sigma_u sigma_e thta_max
142 RE (combin.) hybrid (2) FE (dem(x)) BE (mean(x)) hybrid (both) SHP 2-21 Happiness b b b b nb children.13** -.17*.12*** health.391***.288***.954***.393*** male.65***.128***.62** intermed. educ. -.89*** -.192*** -.65** -.96*** high educ. -.83*** -.255*** -.62* -.88*** childmean.64*** childwithin -.16* _cons 6.482*** 7.49*** 4.72*** 6.438*** N
143 7 Non linear regression 7-1
144 Non linear regression: motivation Linear regression: requires continuous dependent variable e.g. BMI, income, satisfaction on scale from 1 (?) Most variables in social science are not continuous but discrete Opinions: agree vs. disagree Poverty status Party voted for Number of visits to the doctor Having a partner We need appropriate regression models! 7-2
145 Non linear models Dependent variable is not continuous: non linear regression Binary variables (dummy variables, or 1) (e.g. yes no) Multinomial (unordered variables) (e.g. vote choice, occupation) Ordinal (e.g. satisfaction) Count variable (e.g. doctor visits) Logistic Reg., Probit Regression Complementary log log Regression Multinomal logistic Regression Multinomial probitregression Ordinal Regression Poisson Regression Negative Binomial Regression 7-3
146 Linear probability model for binary variables Binary variables (y = or 1) E(y)=π (probability y=1) Linear probability model: π = + β 1 x x 7-4
147 Problems of linear probability model Violation of regression assumptions The variance of y for binary variables is π(1 π) residual variance depends on x i heteroskedasticity Relationship between response probability and x may not be linear, especially for π close to and π close to 1 Predicted response probabilities may be negative or greater than one Residuals can take only take two values for fixed x residuals are not normally distributed 7-5
148 S shaped function 1 P(y=1) e.g. logistic regression Effect size depends on x Values between and 1 x E.g. cumulative logistic function, cumulative normal distribution 7-6
149 Logistic transformation Non linear transformations for binary variables e i follow a logistic distribution: logit model Normal transformation Pr( y 1) P 1 e P Logit Ln 1 P e i follow normal distribution: Probit model Pr(y=1) = Ф (α + βx) where Ф is the cdf of a standard normal distribution 1 x ( α β 1 x x 1...) β 2 x 2... For practical purposes, logit and probit models are equivalent for statistical analysis (very similar predicted probabilities) 7-7
150 Generalised linear model A latent (unobserved) continuous variable y* which underlies the observed data y* = a + b 1 x e* Assume y i * is generated by a linear regression structure Link function between y and y* E(y) = g 1 (y*) = g 1 (a + b 1 x e*) Choice of the link function depends on distribution of e* e.g. logit, probit, poisson, negative binomial Scale of y* is variable, variance of e i * assumed Probit model: assume e i *~N(,1) Logit model: assume e i * standard logistic (with var=π 2 /3 ~3.29) in general, coefficients cannot be compared across different models 7-8
151 Most frequent link functions Distribution Link function Normal Identity y=y* Poisson Log y=exp(y*) Binomial Logistic y=1/(1+exp(y*)) Probit y= Ф(y*) 7-9
152 Maximum likelihood estimation Usually, non linear models are estimated by maximum likelihood (MLE) Advantages of MLE MLE is extremely flexible and can easily handle both linear and non linear models Desirable asymptotic properties: consistency, efficiency, normality, (consistent if missing at random MAR) Disadvantages of ML estimation Requires assumptions on distribution of residuals Desired properties hold only if model correctly specified Best suited for large samples 7-1
153 Parameter estimation with ML Principle for MLE: Which set of parameters has the highest likelihood to generate the data actually observed (x i, y i )? Observed values (x i, y i ) are a function of the unknown model parameters θ L=f( x i, y i θ) (likelihood function) The maximum likelihood estimate θ is that value which maximizes L Often, there is no closed form (algebraic) solution. Coefficients have to be estimated through numerical methods (iterative algorithms) For practical reasons, often the log likelihood is used instead of the likelihood Linear model: MLE = OLS estimator 7-11
154 Example: logistic regression (y: 1 vote SVP; vote other party) M1 M2 Female -.51*** -.38*** 18-3 years years years Education: intermed -.239* -.24 Education: high *** -.929*** Income (hh, log) -.395*** -.281*** EU: against joining 2.11*** Foreigners: prefer Swiss 1.7*** Nuclear energy: pro.57*** Satisfaction with democracy -.233*** constant 3.51*** -.233*** N
155 Example: logit vs probit logit probit b t b t female -.38*** (-3.7) -.222*** (-3.8) 18-3 years.87 (.6).41 (.6) 46-6 years.81 (.6).47 (.6) over 6 years.192 (1.4).18 (1.4) intermed.edu (-1.8) (-1.9) high edu *** (-5.6) -.552*** (-5.9) hh income (log) -.281*** (-3.5) -.163*** (-3.6) stay outside EU 2.11*** (15.) 1.123*** (16.1) prefer Swiss 1.7*** (11.).625*** (11.2) pro nuc. energy.57*** (5.6).334*** (5.8) satisf.democracy-.233*** (-9.1) -.132*** (-9.1) _cons 1.26 (1.4).81 (1.6) Pseudo R-Square N
156 Interpretation of non linear models Because of the non linearity, coefficients cannot be interpreted directly Predicted probabilities have to be computed through a non linear transformation according to the link function e.g. π i = F(α + βx i ) Interpretation of Odds ratios (OR) problematic 7-14
157 Odds ratios (OR) OR Often misunderstood as relative risk P Odds OR RR A Group Group B Group Group C Group Group D Group Group E Group Group Ref.: Best and Wolf 212, Kölner Zeitschrift für Soziologie 7-15
158 Logit model Predicted probabilities: example 1 Pr( y 1) ( 1x1 2x2 ) 1e exp(logit) 1exp(logit) Example Individual i with x 1 =2 and x 2 =1, ^ ^ α 3; 1-2 ; 2 ^.5 ^ i 1e 1 1e.5 1 (3 ( 2)2.5)
159 Calculate predicted probabilities Remember: predicted probabilities depend on values of x and parameter estimates (and unobserved heterogeneity) It is often useful to vary only one covariate and hold the others constant at their mean at realistic values (mean for continuous variables, median for ordinal variables, mode for nominal variables) Note: Regression parameter are estimates (sampling variability, confidence intervals) also predicted probabilities are estimates confidence intervals also for predicted probabilities 7-17
160 How does change in x affect predicted probability (y)? Discrete change The change in predicted probability as one moves from one (discrete) value of a predictor to another, holding all else equal. Marginal effect (MFX) or Partial effect The slope of y* at x, holding all else equal. Average partial effect (APE) The average change in predicted probability at different given values of a predictor (calculate effect for each individual i). Average marginal effect (AME) The average slope of y* at x (calculate effect for each individual i). (in Stata: margins) 7-18
161 Example: predict discrete change Vote intention for SVP 29 margins i.eu_contra i.for_contra i.nuc_pro, /// atmeans at (female= edulevel=2 agegr=2 ) Discrete change Variation of one variable, others at their mean or mode 95% Confidence Intervall Sex Men Women Education Low Intermediate High EU Pro or neither Contra Foreigners Same changes or neither Satisfaction Democracy Better changes for Swiss Minimal () Maximal
162 Example: Average Partial effects Vote intention for SVP 29 margins i.female i.edulevel i. agegr APE 95% Confidence Intervall Sex Men Women Education Low Intermediate High EU Pro or neither Contra Foreigners Same changes or neither Satisfaction Democracy Better changes for Swiss Minimal Maximal
163 Marginal effects Conditional Marginal Effects of satdem with 95% CIs Effects on Pr(Svp) satisf. democracy margins, atmeans at (satdem=((1)1)) dydx (satdem)
164 Average Marginal effect Effects on Pr(Svp) Average Marginal Effects of satdem with 95% CIs satisf. democracy margins, at (satdem=((1)1)) dydx (satdem)
165 Model performance Linear regression: R 2, adjusted R 2 explained var. in y R 2 Non linear regression total var. in y - variance of the residual not observed - many so called pseudo R 2 (,1) (Stata: fitstat) Example: voting model M1 M2 McFadden's R McFadden's Adj R ML (Cox-Snell) R Cragg-Uhler(Nagelkerke) R Count R Adj Count R AIC
166 Model performance ctd. MLE: likelihood as a measure of fit Compare likelihood estimated model to the likelihood of empty model (no covariates): equivalent to McFadden R 2 Compare likelihood of nested models: likelihood ratio (LR) test Test joint significance of variables, or nested models Test non nested models: AIC, BIC How many observations are predicted correctly? Best prediction without covariates: mode (correct prediction in π of cases) How much better is the prediction than compared to modesguess? 7-24
167 The likelihood ratio test To test hypotheses involving several predictors (multiple constraints) (e.g. Test β 2 = β 3 = ) Compare log likelihoods of constrained and unconstrained model, e.g. M u : π=(f(α + β 1 x 1 +β 2 x 2 + β 3 x 3 ) M c = π=(f(α + β 1 x 1 ) Generally: L c L u Constraints valid: Lu Lc = Constraints invalid: Lu Lc > Test statistic: LR = 2 ( L u L c ) ~ Х 2 (q) ; (q: number of constraints, d.o.f.) Prerequisites of LR test Models are based on the same sample Models are nested 7-25
168 LR test example: test for joint significance Vote intention model, probability to support SVP vs. supporting another party π=f(α + β 1 (contra EU) +β 2 (satisfaction democracy) +β 3 (female) +β 4 (lnincome) +β 5 (age)) Do demographic variables (age, sex) matter? H: β 3 =β 5 = Model fits: L c = 9.13, L u = LR = 2*( 9.13 ( 911.9))=23.4 Chi squared distribution with 2 degrees of freedom: p=. > we reject H (demographic variables seem to affect the probability to vote for SVP) 7-26
169 Difficulties of nonlinear models: frequent mistakes Interpretation of coefficients (Logits, OR) Comparison of estimates across models and samples (estimates reflect also unobserved heterogeneity) Be cautious with interpretation Use different measures to show effects (MFX, AME) Correction proposed by Karlson et al. (212) References: Mood 21, Best and Wolf 212, Karlson et al. 212, Stata: ado khb Interaction effect Cannot be interpreted as in linear models References: Ai and Norton 23, Norton Wan and Ai 24, Stata: ado inteff Model performance 7-27
170 Multinomial dependent variables More than two response categories (m categories) Unordered e.g. Voting preference (different parties, no party), type of education, social class Multinomial regression compare each pair of response categories estimate probability for each category (1 reference category, m 1 equations) Ordered e.g. Opinions (strongly agree, agree, neither, disagree, strongly disagree), health status Ordinal regression latent variable with m 1 thresholds estimate cumulative probability (prob. y mi) (one equation with dummies for m 1 thresholds) 7-28
171 Example: multinomial regression Voting: FDP/CVP, SP/Greens, other parties; Base category: vote for SVP FDP & CVP SP & Greens Other party No party female.377*** (3.5).46*** (3.6).274* (2.).583*** (5.9) age ( 1.).7 (.4).167 (.8).22 ( 1.5) age (.7).31 (.2).129 (.7).75 (.6) age6plus.52 (.4).413** ( 2.8).584** ( 3.1).274* ( 2.1) edu2.179 (1.2).3* (2.).26 (1.).135 ( 1.1) edu3.78*** (4.5).982*** (5.4) 1.114*** (4.8).287 (1.8) lnincome.317*** (3.7).229** (2.6).397*** (3.5).113 (1.5) eu_contra 1.74*** ( 11.8) 2.79*** ( 18.9) 1.559*** ( 9.2) 1.665*** ( 11.9) for_contra.746*** ( 7.2) 1.68*** ( 13.4) 1.254*** ( 8.4).862*** ( 9.1) nuc_pro.64 (.6) 1.542*** ( 13.).967*** ( 6.7).57*** ( 5.9) satdem.252*** (9.4).193*** (6.9).183*** (5.).44 (1.9) _cons 3.138*** ( 3.3).694 (.7) 4.259*** ( 3.3) (1.6) 7-29
172 Refresher: Panel data models Fixed effects models Y it = β 1 x i + β 1 x i +u i + ε u i: unobservable stable individual characteristics Only variance within individuals taken account Equivalent to OLS with dummies for each individual Controls for unobserved heterogeneity (consistent also if Cov(u,x) ) > causal interpretation Effects of time invariant characteristics cannot be estimated (e.g., sex, cohorts) Random effect models Y it = α + β 1 x i + β 1 x i +u i + ε it assumes u i ~N(,σ u ) u i : unobservable stable individual characteristics Multilevel model with random intercepts Controls for unobserved heterogeneity (but consistent only if Cov(u,x)=) Effects of time invariant and timevarying covariates Test for RE against FE model: Hausman test 7-3
173 Fixed effects for non linear models Linear model: by differencing out (or including dummy variables), the u i disappears from (FE) equation Non linear model: there is no equivalent FE model Incidental parameter problem > inconsistent estimates Instead: Conditional ML estimation (similar to FE) Technical trick to eliminate individual specific intercepts (Also called Chamberlain fixed effects model) Only possible for logit and poisson (they use exponentials) Only subsample of individuals for whom there is some change in y it > information loss > potential bias because stable individuals are excluded 7-31
174 Random effects for non linear models RE model equivalent to linear regression y it = F(α + α i + x 1it β 1 + x 2it β u i + ε it ) with u i ~ N(, σ u ), Cov(u i,x i )= But in contrast to linear models Predicted probabilities depend on values of u i : we have to assume a value for u i to estimate predicted probabilities Measures for variance decomposition questionable only variance of unobserved heterogeneity estimated within variance is fixed (usually at 1) proportion of the variance of error components (ρ)not meaningful Alternative: How much can the unexplained variance between individuals be reduced relative to the empty model? 7-32
175 Stata commands for non linear panel models Stata built in commands Random intercept models: xt prefix xtlogit, xtprobit, xtpoisson, xttobit, xtcloglog, xtnbreg Random slope models: xtmelogit, xtmepoisson Other software necessary for multinomial panel models (run from Stata) gllamm add on (Rabe Hesketh and Skrondal) to Stata (very powerful, freely available (but Stata necessary), slow, become familiar with syntax) runmlwin: command to run mlwin software from within Stata (Mlwin software needs to be purchased) 7-33
176 svp logit random fe b b b female.116**.214***.o 18-3 years -.133** *** 46-6 years -.119** -.162** -.36*** over 6 years -.97* -.25*** -.61*** med education -.229*** -.497*** -.318* high education -.633*** -1.35*** -.537*** hh income, log -.128*** -.159*** -.13 stay outside EU.749***.622***.49 prefer Swiss.471***.42*** -.1 nuc_pro.225*** satisf. democracy -.16*** -.153*** -.43*** _cons 1.19*** 1.818*** Lnsig2u _cons 2.34*** Pseudo R-Square.77.5 N
177 Example: RE model for event history analysis Example for discrete event history analysis Dependent variable event has not occurred 1 event has occurred (since last observation) Independent variable Time until event occurrence Any other variable 7-35
178 8 The Multilevel model for growth 8-1
179 Motivation Many phenomena in social sciences have high variation of rates of growth between individuals over time and There are many unobserved variables with effects on these rates of growth given this, alternative hypotheses: do very motivated people have high initial score and high increase over time? or does high initial score of very motivated people rather level out over time? 8-2
180 Graphical interpretation of multilevel models Random intercept model - Distribution of individual intercepts random Random slope model - Distribution of individual intercepts and slopes random - all lines parallel (same slope) Other name: mixed effects models - Intercepts and slopes correlated 8-3
181 Level-1 and level-2 submodels steps Level-1 Submodel: within-behavior Q.: functional form? Level-2 Submodels: between-behavior Q.: parameters of this function? Model estimation Interpretation of results - regression coefficients - variance components If only within: no multilevel model! If in addition level (u ) between: random intercept If in addition slope (u 1 ) between: random slope 8-4
182 The Multilevel idea II: letting the slope (time) become random Suppose we change the model y it = u i + b x it + it y it = u i + b i x it + it with u i ~ N(, u ) and b = b + u i 1i and u 1i ~ N(, u1 ) to give the random slope model y it = u i + (b + u 1i ) x it + it = b x it + (u i + u 1i x it ) + it Parameters to be estimated are: b, u, u1,,+ covariance between u and u 1 = cov(u, u 1 ) = u1 heteroscedasticity: because variance depends on x All other covariances are assumed to be : cov(x,u i ) = cov(u i, i ) = cov( i, j ) = to 8-5
183 Random parameters in random slope model y it = a i + b i x it + it with a i = a + u i b i = b + u 1i u1 = cov(u, u 1 ) it ~ N(, ), u. i ~ N(, u. ) a, b : Intercept and slope of average line u1 u u1 : Standard Deviation of random slopes across individuals : Standard Deviation of random intercepts across individ. : Covariance between random intercepts and rand. slopes 8-6
184 Interpretation of covariance in random slope models u1 > fanning out u1 < fanning in u1 is not defined (no slope variation) u1 = no pattern 8-7
185 Example: body mass index (bmi) change Data: subsample of the CNEF - Adults from CNEF-CH (24-27), CNEF-US (1999/21/3/5), CNEF-DE (22/4/6/8) - Random 999 individuals per survey (country) - Context (between persons): sex, age, health, education, working, wife in household, children under 16 in household, country Design: Each individual measured up to 4 times Research Question: Level and change of bmi for different subgroups (e.g., cohorts by country) 8-8
186 Data structure: person-period: 1 line per person and period. list survey pers1 y bmi age male, sepby(pers1) Unbalanced Panel, 1-4 waves per adult (measured in years) survey pers1 year bmi age male PSID PSID PSID PSID PSID PSID PSID PSID PSID SHP SHP SHP SHP SHP SHP SHP SHP SHP SOEP SOEP SOEP SOEP
187 Level-1 model (within) we assume linear change Coefficients, true individual growth curve Random Part, interpreted as measurement error of person i at time t a i : intercept of i s true trajectory bmiit i s true trajectory Health bmi a i + b i time 2 + it i: individual t: time i1, i4: deviation of i s true trajectory b i: : slope of i s true trajectory time wave 8-1
188 Level-2 model (between): random intercept / slopes models PSID SHP SOEP Participation Year Graphs by Survey: USA (PSID), Germany (SOEP) or Switzerland (SHP) Data: CNEF 27 subsample8-11
189 Level-2 coefficients Level-2 Intercept ai a a1surveyui grand mean of intercepts a i (dep.survey) difference of intercepts ai (dep survey) Level-2 Slope bi b b1surveyu1 i grand mean of slopes b i ( dep survey) difference of slopes b i ( dep survey) 8-12
190 Level-2 random effects Idea: Individual specific deviations from mean (intercept u i and slope u 1i ) may be correlated ai a a1survey u i bi b b1survey u1i variance of intercepts a i (survey controlled) u u i 1i ~ N, u 2 u u1 2 1 u 1 variance of slopes b i (survey controlled) co-variance of intercepts a i and slopes b i 8-13
191 Parameters of level-1 and level-2 model ai au i bi bu1 i bmi it = a i + b i time + it r it = u i + u 1i time + it Level 2 Level 1 Residuals We model: Level-1 variance σ 2 Level-2 grand intercept (a) and its variance σ u 2 Level-2 grand slope (b) and its variance σ u1 2 Level-2 covariance cov(σ u,σ u1 ) (check if Estimation methods: depends on software Coefficients (a,b) are weighted means of the individual OLS results (max. flexible) and of the population curves (max. stable), weights depend on preciseness of the OLS-estimation) 8-14
192 Multilevel as weighted individual OLS and population curves PSID,, 363 PSID, 1, 21 SHP,, 213 SHP, 1, 719 SOEP,, 623 SOEP, 1, Data points Population average Participation Year Graphs by survey, Sex: Male, and Unique person number in each survey OLS ML random slope Population average (OLS between Var. survey, male): stable,efficient Weight high if OLS imprecise Individual OLS: unbiased but inefficient Weight high if OLS precise ML: biased but efficient 8-15
193 Steps to do Never without empirical Plots and fitted OLS-curves Person-Period Dataset ok? First: model without covariates. Find true covariance matrix! First: unconditional Why unconditional? 1. zero model Intercept only, Target: variancedecomposition between levels 1. Does systematic variation of dependent variable exist and, on which level(s)? 2. Unconditional growth model only TIME as level-1 predictor and intercept as level-2, Target: information on level-2 variation of grand slope 2. Magnitude of variation within and between are basis for further predictors 8-16
194 Unconditional means/growth model (max. likelihood estimation). xtreg bmi, i(pers) /* VC modfel */ Random-effects GLS regression Number of obs = 9189 Group variable: pers Number of groups = 2997 R-sq: within =. Obs per group: min = 1 between =. avg = 3.1 overall =. max = 4 Wald chi2() =. corr(u_i, X) = (assumed) Prob > chi2 = bmi Coef. Std. Err. z P> z [95% Conf. Interval] _cons sigma_u sigma_e rho (fraction of variance due to u_i) xtmixed bmi year pers:year, cov(unst) mle Mixed-effects ML regression Number of obs = 9189 Log likelihood = Prob > chi2 = bmi Coef. Std. Err. z P> z [95% Conf. Interval] year _cons Random-effects Parameters Estimate Std. Err. [95% Conf. Interval] pers: Unstructured sd(year) sd(_cons) corr(year,_cons) sd(residual) LR test vs. linear regression: chi2(3) = Prob > chi2 =. 8-17
195 For regression coefficients : use z-test (5% if IzI>1.96) Excursus: hypotheses testing For random effects : use the likelihood ratio test: -test current model against base model - Important: current models uses same data ( nested ) but more parameters (regression coefficients, random effects, covariance) - Null Hypothesis: these coefficients are all - Test statistic: = 2*[loglikelihood(current) - loglikelihood(base)] - This statistic is χ 2 -distributed with the additional number of coefficients as degree of freedom 8-18
196 _cons measures grand mean of individual with on all variables (year=survey==reference, bmi=) bmi= shows: de-mean independent variables - Eliminates multi-collinearity between independent variables (generally in regressions) and in ML models between first level-specific and context variables - Reduces covariance between const. and slope (a and b) - Improves estimation precision and speed - Facilitates interpretation Care! Model fit separately for within and between model, (special pseudo R 2 measures total fit) 8-19
197 9 Missing Data: Incomplete panels, imputation techniques 9-1
198 Underrepresentation Sampling frame errors (mostly undercoverage) Unit non-response: First wave ( initial ) nonresponse Attrition: respondents drop out or are temporary absent Reasons: - Non-contact (out-of-home, no telephone answer, or unsuccessful tracking of movers) -Refusal(also on behalf of others) Item non-response Respondents fail to answer particular questions, e.g. income 9-2
199 Initial nonresponse vs attrition Initial nonresponse Only little information on non-participants (from sampling frame eventually sex, age, nationality, civil status, region ) Attrition Rich information on these individuals from at least one wave Methodological research: attitudinal characteristics (e.g. political interest, political orientation, social integration) more biased than socio-demography Bias correction of bias easier for attrition than for undercoverage or initial nonresponse 9-3
200 Types of missingness Missing completely at random (MCAR) Missing at random (MAR) conditional on observables (x it ), response is random = systematic differences in response can be explained by observable characteristics. Not missing at random (NMAR) non-ignorable: systematic differences in response remain after controlling for observables (x it ) 9-4
201 Consequences of missingness Missing completely at random (MCAR) Statistics unbiased Loss of efficiency (smaller n) Missing at random (MAR) Univariate statistics: biased Example: if the politically uninterested do not take part in the survey we will overestimate the political interest rate Multivariate statistics: unbiased if controlled for variables which explain missingness Not missing at random (NMAR) Statistics biased 9-5
202 Strategies to deal with missingness Multivariate models: Control for variables which explain missing mechanism (under MAR assumption) Weighting: mainly for unit-nonresponse Imputation: mainly for item-nonresponse Look at the data, analyze missingness, assess bias, careful interpretation! 9-6
203 Univariate statistics: use weights Multivariate statistics Weighting Weights may or may not be appropriate: depends on research question, model, construction of weights Use of weights is not always possible not implemented in RE models in Stata, but possible in xtmixed or with gllamm software Sensitivity analysis: run models with and without weights 9-7
204 Imputation strategies I Exclude cases from analysis (listwise deletion in regression) Assumes MCAR Loss of statistical power Mean or median imputation Assumes MCAR, variance reduction, bias to the Mean Regression imputation Assumes MAR, variance reduction, regression to the mean Hot Deck Identification of donor with same characteristics of co-varying observed variables assumes MAR, variance reduction Cold deck donor comes from another survey assumes MAR, variance reduction, loss of statistical power 9-8
205 Imputation strategies II Nearest neighbor for continuous variables; donor has smallest distance assumes MAR, variance reduction Multiple Imputation several imputations per missing value, stochastic In Panel Surveys: information of the same unit (individual) from other waves Simple carryover: take value from another wave (longitudinal change reduction) Row and Column : uses longitudinal and cross-sectional information, stochastic (longitudinal change reduction) 9-9
206 Testing for attrition bias Compare individuals who drop out with individuals who stay: characteristics of first wave example: individuals present in 23 vs. not: political interest in 1999 stayers mean se(mean) N Total
207 Testing form non-random missing Missing value is random (MCAR or MAR after control) response r it is independent of u i or ε it 1. check whether r it explains the outcome y it after controlling for other characteristics Functions of r it can be added to the equation and their significance tested. E.g.: lagged response indicator r it-1 number of waves present correction models (eg Heckman Selection) 2. compare results from the unbalanced panel with the balanced sub-panel regressor differences (use Hausman test) 9-11
208 1 Dynamic models 1_1
209 Longitudinal models Within estimators: Fixed effects (FE), Difference in Difference (DID) Long panels: causal influence Short panels: policy evaluation Dynamic panel models (complex!) Model pathways through which variables measured over time are associated Model stability over time, true state dependence Lagged dependent variable as a regressor Cross-lagged models Lagged dependent and lagged independent variables as regressors Establish causality through time ordering 1_2
210 Dynamic panel models Example: does unemployment cause future unemployment? Does poverty cause future poverty? Dynamic model: y it = a + b 1 y i,t-1 + b 2 x i,t + λ t + u i + e it b 1 represents state dependence (how quickly do people return t their equilibrium after a shock?) u i represents stable characteristics (trait dependence) λ i year dummies (control for temporal shocks) e it person- and time-specific residual Problems (Strict) exogeneity fails: y i,t correlated with e i,t-1 Y i,t-1 and u i are correlated Initial condition: y assumed exogenous -> all OLS estimators potentially biased -> (true) state dependence cannot be estimated (counfounded with persistence of unobservables) 1_3
211 Instrumental variable dynamic models Solution to endogeneity problem of lagged dependent variable Use previous values of y to instrument y i,t-1 Arellano and Bond 1991 Arellano and Bover 1995, Blundel and Bond 1998 (GMM estimators) Estimates are very sensitive to model specification Specification tests: Sargan test, Hansen test References: Wooldridge 29, Wawro 22 (application) 1_4
212 Cross-lagged models y it = a + b 1 y i,t-1 + b 2 w i, t-1 + e it w it = a + b 3 y i,t-1 + b 4 w i, t-1 + e it Examining questions of reciprocal causality compare size of b 2 and b 3 Variables is regressed onto its lagged measure and the lagged other (independent) variables Initial condition: y and w assumed exogenous References: Finkel 1995, Carsey and Layman 26 (application) 1_5
ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2
University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages
IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results
IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results How important is R-squared? R-squared Published in Agricultural Economics 0.45 Best article of the
Panel Data Analysis Josef Brüderl, University of Mannheim, March 2005
Panel Data Analysis Josef Brüderl, University of Mannheim, March 2005 This is an introduction to panel data analysis on an applied level using Stata. The focus will be on showing the "mechanics" of these
Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week 10 + 0.0077 (0.052)
Department of Economics Session 2012/2013 University of Essex Spring Term Dr Gordon Kemp EC352 Econometric Methods Solutions to Exercises from Week 10 1 Problem 13.7 This exercise refers back to Equation
DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS
DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS Nađa DRECA International University of Sarajevo [email protected] Abstract The analysis of a data set of observation for 10
Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:
Lab 5 Linear Regression with Within-subject Correlation Goals: Data: Fit linear regression models that account for within-subject correlation using Stata. Compare weighted least square, GEE, and random
Correlated Random Effects Panel Data Models
INTRODUCTION AND LINEAR MODELS Correlated Random Effects Panel Data Models IZA Summer School in Labor Economics May 13-19, 2013 Jeffrey M. Wooldridge Michigan State University 1. Introduction 2. The Linear
Panel Data Analysis Fixed and Random Effects using Stata (v. 4.2)
Panel Data Analysis Fixed and Random Effects using Stata (v. 4.2) Oscar Torres-Reyna [email protected] December 2007 http://dss.princeton.edu/training/ Intro Panel data (also known as longitudinal
Lecture 15. Endogeneity & Instrumental Variable Estimation
Lecture 15. Endogeneity & Instrumental Variable Estimation Saw that measurement error (on right hand side) means that OLS will be biased (biased toward zero) Potential solution to endogeneity instrumental
MULTIPLE REGRESSION EXAMPLE
MULTIPLE REGRESSION EXAMPLE For a sample of n = 166 college students, the following variables were measured: Y = height X 1 = mother s height ( momheight ) X 2 = father s height ( dadheight ) X 3 = 1 if
Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software
STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used
Sample Size Calculation for Longitudinal Studies
Sample Size Calculation for Longitudinal Studies Phil Schumm Department of Health Studies University of Chicago August 23, 2004 (Supported by National Institute on Aging grant P01 AG18911-01A1) Introduction
Chapter 10: Basic Linear Unobserved Effects Panel Data. Models:
Chapter 10: Basic Linear Unobserved Effects Panel Data Models: Microeconomic Econometrics I Spring 2010 10.1 Motivation: The Omitted Variables Problem We are interested in the partial effects of the observable
Introduction to Regression Models for Panel Data Analysis. Indiana University Workshop in Methods October 7, 2011. Professor Patricia A.
Introduction to Regression Models for Panel Data Analysis Indiana University Workshop in Methods October 7, 2011 Professor Patricia A. McManus Panel Data Analysis October 2011 What are Panel Data? Panel
Introduction to Longitudinal Data Analysis
Introduction to Longitudinal Data Analysis Longitudinal Data Analysis Workshop Section 1 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development Section 1: Introduction
Handling missing data in Stata a whirlwind tour
Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled
From the help desk: Swamy s random-coefficients model
The Stata Journal (2003) 3, Number 3, pp. 302 308 From the help desk: Swamy s random-coefficients model Brian P. Poi Stata Corporation Abstract. This article discusses the Swamy (1970) random-coefficients
Milk Data Analysis. 1. Objective Introduction to SAS PROC MIXED Analyzing protein milk data using STATA Refit protein milk data using PROC MIXED
1. Objective Introduction to SAS PROC MIXED Analyzing protein milk data using STATA Refit protein milk data using PROC MIXED 2. Introduction to SAS PROC MIXED The MIXED procedure provides you with flexibility
Stata Walkthrough 4: Regression, Prediction, and Forecasting
Stata Walkthrough 4: Regression, Prediction, and Forecasting Over drinks the other evening, my neighbor told me about his 25-year-old nephew, who is dating a 35-year-old woman. God, I can t see them getting
Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares
Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation
ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics
ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Quantile Treatment Effects 2. Control Functions
Multicollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015
Multicollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015 Stata Example (See appendices for full example).. use http://www.nd.edu/~rwilliam/stats2/statafiles/multicoll.dta,
Clustering in the Linear Model
Short Guides to Microeconometrics Fall 2014 Kurt Schmidheiny Universität Basel Clustering in the Linear Model 2 1 Introduction Clustering in the Linear Model This handout extends the handout on The Multiple
August 2012 EXAMINATIONS Solution Part I
August 01 EXAMINATIONS Solution Part I (1) In a random sample of 600 eligible voters, the probability that less than 38% will be in favour of this policy is closest to (B) () In a large random sample,
17. SIMPLE LINEAR REGRESSION II
17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.
25 Working with categorical data and factor variables
25 Working with categorical data and factor variables Contents 25.1 Continuous, categorical, and indicator variables 25.1.1 Converting continuous variables to indicator variables 25.1.2 Converting continuous
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
Introduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
I n d i a n a U n i v e r s i t y U n i v e r s i t y I n f o r m a t i o n T e c h n o l o g y S e r v i c e s
I n d i a n a U n i v e r s i t y U n i v e r s i t y I n f o r m a t i o n T e c h n o l o g y S e r v i c e s Linear Regression Models for Panel Data Using SAS, Stata, LIMDEP, and SPSS * Hun Myoung Park,
problem arises when only a non-random sample is available differs from censored regression model in that x i is also unobserved
4 Data Issues 4.1 Truncated Regression population model y i = x i β + ε i, ε i N(0, σ 2 ) given a random sample, {y i, x i } N i=1, then OLS is consistent and efficient problem arises when only a non-random
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY
Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY ABSTRACT: This project attempted to determine the relationship
Regression Analysis (Spring, 2000)
Regression Analysis (Spring, 2000) By Wonjae Purposes: a. Explaining the relationship between Y and X variables with a model (Explain a variable Y in terms of Xs) b. Estimating and testing the intensity
Interaction effects and group comparisons Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015
Interaction effects and group comparisons Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015 Note: This handout assumes you understand factor variables,
Nonlinear Regression Functions. SW Ch 8 1/54/
Nonlinear Regression Functions SW Ch 8 1/54/ The TestScore STR relation looks linear (maybe) SW Ch 8 2/54/ But the TestScore Income relation looks nonlinear... SW Ch 8 3/54/ Nonlinear Regression General
Correlation and Regression
Correlation and Regression Scatterplots Correlation Explanatory and response variables Simple linear regression General Principles of Data Analysis First plot the data, then add numerical summaries Look
Models for Longitudinal and Clustered Data
Models for Longitudinal and Clustered Data Germán Rodríguez December 9, 2008, revised December 6, 2012 1 Introduction The most important assumption we have made in this course is that the observations
Efficient and Practical Econometric Methods for the SLID, NLSCY, NPHS
Efficient and Practical Econometric Methods for the SLID, NLSCY, NPHS Philip Merrigan ESG-UQAM, CIRPÉE Using Big Data to Study Development and Social Change, Concordia University, November 2103 Intro Longitudinal
1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
Multilevel Models for Longitudinal Data. Fiona Steele
Multilevel Models for Longitudinal Data Fiona Steele Aims of Talk Overview of the application of multilevel (random effects) models in longitudinal research, with examples from social research Particular
Standard errors of marginal effects in the heteroskedastic probit model
Standard errors of marginal effects in the heteroskedastic probit model Thomas Cornelißen Discussion Paper No. 320 August 2005 ISSN: 0949 9962 Abstract In non-linear regression models, such as the heteroskedastic
5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
From the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
Panel Data Analysis in Stata
Panel Data Analysis in Stata Anton Parlow Lab session Econ710 UWM Econ Department??/??/2010 or in a S-Bahn in Berlin, you never know.. Our plan Introduction to Panel data Fixed vs. Random effects Testing
MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING
Interpreting Interaction Effects; Interaction Effects and Centering Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015 Models with interaction effects
Simple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
A Simple Feasible Alternative Procedure to Estimate Models with High-Dimensional Fixed Effects
DISCUSSION PAPER SERIES IZA DP No. 3935 A Simple Feasible Alternative Procedure to Estimate Models with High-Dimensional Fixed Effects Paulo Guimarães Pedro Portugal January 2009 Forschungsinstitut zur
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
2. Linear regression with multiple regressors
2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions
MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
Basic Statistical and Modeling Procedures Using SAS
Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom
1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ
STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material
Econometrics Simple Linear Regression
Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight
2013 MBA Jump Start Program. Statistics Module Part 3
2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just
Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade
Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 21, 2015
Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 21, 2015 References: Long 1997, Long and Freese 2003 & 2006 & 2014,
HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009
HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. A General Formulation 3. Truncated Normal Hurdle Model 4. Lognormal
Rethinking the Cultural Context of Schooling Decisions in Disadvantaged Neighborhoods: From Deviant Subculture to Cultural Heterogeneity
Rethinking the Cultural Context of Schooling Decisions in Disadvantaged Neighborhoods: From Deviant Subculture to Cultural Heterogeneity Sociology of Education David J. Harding, University of Michigan
MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal
MGT 267 PROJECT Forecasting the United States Retail Sales of the Pharmacies and Drug Stores Done by: Shunwei Wang & Mohammad Zainal Dec. 2002 The retail sale (Million) ABSTRACT The present study aims
Chapter 4: Vector Autoregressive Models
Chapter 4: Vector Autoregressive Models 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie IV.1 Vector Autoregressive Models (VAR)...
Moderation. Moderation
Stats - Moderation Moderation A moderator is a variable that specifies conditions under which a given predictor is related to an outcome. The moderator explains when a DV and IV are related. Moderation
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing
Regression Analysis: A Complete Example
Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty
Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.
Analyzing Complex Survey Data: Some key issues to be aware of Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 24, 2015 Rather than repeat material that is
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma [email protected] The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
Multinomial and Ordinal Logistic Regression
Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,
MULTIPLE REGRESSION WITH CATEGORICAL DATA
DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 86 MULTIPLE REGRESSION WITH CATEGORICAL DATA I. AGENDA: A. Multiple regression with categorical variables. Coding schemes. Interpreting
Module 5: Multiple Regression Analysis
Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College
MODELING AUTO INSURANCE PREMIUMS
MODELING AUTO INSURANCE PREMIUMS Brittany Parahus, Siena College INTRODUCTION The findings in this paper will provide the reader with a basic knowledge and understanding of how Auto Insurance Companies
Note 2 to Computer class: Standard mis-specification tests
Note 2 to Computer class: Standard mis-specification tests Ragnar Nymoen September 2, 2013 1 Why mis-specification testing of econometric models? As econometricians we must relate to the fact that the
Moderator and Mediator Analysis
Moderator and Mediator Analysis Seminar General Statistics Marijtje van Duijn October 8, Overview What is moderation and mediation? What is their relation to statistical concepts? Example(s) October 8,
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,
Discussion Section 4 ECON 139/239 2010 Summer Term II
Discussion Section 4 ECON 139/239 2010 Summer Term II 1. Let s use the CollegeDistance.csv data again. (a) An education advocacy group argues that, on average, a person s educational attainment would increase
Interaction effects between continuous variables (Optional)
Interaction effects between continuous variables (Optional) Richard Williams, University of Notre Dame, http://www.nd.edu/~rwilliam/ Last revised February 0, 05 This is a very brief overview of this somewhat
Applied Panel Data Analysis
Applied Panel Data Analysis Using Stata Prof. Dr. Josef Brüderl LMU München April 2015 Contents I I) Panel Data 06 II) The Basic Idea of Panel Data Analysis 17 III) An Intuitive Introduction to Linear
Part 2: Analysis of Relationship Between Two Variables
Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable
10. Analysis of Longitudinal Studies Repeat-measures analysis
Research Methods II 99 10. Analysis of Longitudinal Studies Repeat-measures analysis This chapter builds on the concepts and methods described in Chapters 7 and 8 of Mother and Child Health: Research methods.
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
1 Simple Linear Regression I Least Squares Estimation
Simple Linear Regression I Least Squares Estimation Textbook Sections: 8. 8.3 Previously, we have worked with a random variable x that comes from a population that is normally distributed with mean µ and
SYSTEMS OF REGRESSION EQUATIONS
SYSTEMS OF REGRESSION EQUATIONS 1. MULTIPLE EQUATIONS y nt = x nt n + u nt, n = 1,...,N, t = 1,...,T, x nt is 1 k, and n is k 1. This is a version of the standard regression model where the observations
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS
Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple
Forecasting in STATA: Tools and Tricks
Forecasting in STATA: Tools and Tricks Introduction This manual is intended to be a reference guide for time series forecasting in STATA. It will be updated periodically during the semester, and will be
Data Analysis Methodology 1
Data Analysis Methodology 1 Suppose you inherited the database in Table 1.1 and needed to find out what could be learned from it fast. Say your boss entered your office and said, Here s some software project
Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group
Introduction to Multilevel Modeling Using HLM 6 By ATS Statistical Consulting Group Multilevel data structure Students nested within schools Children nested within families Respondents nested within interviewers
Longitudinal Data Analysis: Stata Tutorial
Part A: Overview of Stata I. Reading Data: Longitudinal Data Analysis: Stata Tutorial use Read data that have been saved in Stata format. infile Read raw data and dictionary files. insheet Read spreadsheets
IMPACT EVALUATION: INSTRUMENTAL VARIABLE METHOD
REPUBLIC OF SOUTH AFRICA GOVERNMENT-WIDE MONITORING & IMPACT EVALUATION SEMINAR IMPACT EVALUATION: INSTRUMENTAL VARIABLE METHOD SHAHID KHANDKER World Bank June 2006 ORGANIZED BY THE WORLD BANK AFRICA IMPACT
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
International Statistical Institute, 56th Session, 2007: Phil Everson
Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: [email protected] 1. Introduction
Do Supplemental Online Recorded Lectures Help Students Learn Microeconomics?*
Do Supplemental Online Recorded Lectures Help Students Learn Microeconomics?* Jennjou Chen and Tsui-Fang Lin Abstract With the increasing popularity of information technology in higher education, it has
Linear Models in STATA and ANOVA
Session 4 Linear Models in STATA and ANOVA Page Strengths of Linear Relationships 4-2 A Note on Non-Linear Relationships 4-4 Multiple Linear Regression 4-5 Removal of Variables 4-8 Independent Samples
Module 14: Missing Data Stata Practical
Module 14: Missing Data Stata Practical Jonathan Bartlett & James Carpenter London School of Hygiene & Tropical Medicine www.missingdata.org.uk Supported by ESRC grant RES 189-25-0103 and MRC grant G0900724
Section 1: Simple Linear Regression
Section 1: Simple Linear Regression Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction
STATA Commands for Unobserved Effects Panel Data
STATA Commands for Unobserved Effects Panel Data John C Frain November 24, 2008 Contents 1 Introduction 1 2 Examining Panel Data 4 3 Estimation using xtreg 8 3.1 Introduction..........................................
A Panel Data Analysis of Corporate Attributes and Stock Prices for Indian Manufacturing Sector
Journal of Modern Accounting and Auditing, ISSN 1548-6583 November 2013, Vol. 9, No. 11, 1519-1525 D DAVID PUBLISHING A Panel Data Analysis of Corporate Attributes and Stock Prices for Indian Manufacturing
Multiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
Overview of Factor Analysis
Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,
