PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu
The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference. (Order?) Examples: Presidential approval; International conflict/civil war. Identification: Can quantities of interest be determined from the model/data, assuming sufficient sample size? (asymptotic concept) Parameters in structural equation models, for example, are often of theoretical interests or directly code causal assumptions. Can they be uniquely determined with available measured variables?
Endogenous vs. ex exogenous variables; exclusion restrictions (certain causal links are ruled out); order condition (necessary condition for identification. number of excluded exogenous vars at least equal the number of included endogenous vars.) A single equation models can be considered part of a SEM (with some of the right-hand side variables potentially endogenous.) Standard models (parametric or non-parametric matching) typically assume a set of control variables are measured that makes identification of the causal parameter possible. What variables should be in the model? Is the same model good for both prediction and causal inference? 3
Standard practice: use the same (parametric) model for prediction and causal inference, often for studying causal effects of each independent variable in the model in turn. e.g.: Pr(Voting) = f(education, income, party ID, race, gender, etc.) But: different objectives may require very different x s to enter the model. Prediction: all direct causes of y; Causal inference on x i : all x j s that confound the relationship between x i and y. 4
5 x 2 x 1 x 3 y In this hypothetical causal structure: prediction of y: all x s; causal effect of x 1 on y: x 1 and x 2 ; causal effect of x 2 on y: x 2 (controlling for x 1, its consequence, leads to bias on the total effects). causal effect of x 3 on y: x 3
Finding the right set of control variables is hard 6 In practice, decision is often made informally, on a case-bycase basis, resting on folklore and intuition rather than on hard mathematics. (Pearl 2009) Different studies of the same causal relationship often use different sets of control variables, guided by even slightly different substantive theories. Lead to not only changes in magnitude but even reversal of signs in estimated effects. Simpson s Paradox. Pearl (2009) and related work (being introduced to political science); Causal graph theory
the possibility of causal inference from observational data 7 the discovery of underlying causal graphs from data; graphical tools for control variable selection based on the causal graph.
8 Data source/measurement: Experimental data If done right, the gold standard. Random assignment makes treatment exogenous and treatment and control group comparable (for sufficient N) Can be expensive/infeasible (regime type change?) Issues like noncompliance, external validity, Hawthorne effect (effect of observation) Observational data, such as from surveys Issues of sampling design. e.g, stratification with different sam-
pling rates (weighting necesary). Clustering (correlations within clusters). Selection bias. Response-based sampling (e.g., rare events data) missing data; sensitive questions cross sectional, panel (small T), tscs Measurement: e.g, Party identification? Economic wellbeing? Ideal point? Power? Structural characteristics of the international system? Some easier, some harder. E.g. Party ID can be obtained directly from survey data; others require more sophisticated methods, as in recovering ideal points from roll call data (e.g. Item response 9
10 model) Social network analysis useful for measuring structural characteristics (such as polarization, globalization)
Modeling: 11 Abstraction: no model is ever perfect (if it is, then not a model ). Reality itself is infinitely rich and complex Seek to capture the essential features of the data generating process; A collection of assumptions about the process. Systematic and stochastic components: e.g. Linear regression: Y = Xβ + ɛ (1) (Why ɛ: Never could measure all relevant variables; plus the universe is inherently probabilistic, according to quantum physics.)
Y : N 1; X: N k; β: k 1; 12 ɛ N(0, σ 2 I) Equivalently, Y N(Xβ, σ 2 I) For each individual i, i = 1, 2,..., N: Y i N(X i β, σ 2 )
Also equivalent: 13 Y i f N (y i µ i, σ 2 ), µ i = x i β where y i is an observed value of the random variable Y i. Read: The density of Y i at a particular location y i is given by the normal distribution density with mean µ i = x i β and variance σ 2. We ll be looking at a variety of forms of systematic and stochastic components (distribution functions) suitable for different types of data Y (binary, multinomial, ordinal, counted, censored/truncated, duration, etc.)
Parametric, semi-parametric, non-parametric 14 We ve just seen an example of a parametric model. The data generating process is known up to a set of unknown parameters (in the regression model, {β, σ}) Estimation of these parameters (more below): OLS, Least absolute deviation, MLE, Bayesian.. Semi-parametric models combine a parametric component with a non-parametric component more flexible/robust than fully parametric models (but less efficient, if parametric forms can be correctly specified). This can be in terms of partially specified functional form for the systematic part (such as in neural net-
work model; Cox proportional hazard model), or in the form of avoiding distributional assumptions for the stochastic term. Method of Moment (and GMM, generalized MM) estimations are semi-parametric, more robust to distributional assumptions on the stochastic part. Moments: mean, variance, etc. 15 n th moment: M n = x n f(x)dx Basic idea: making use of the fact that sample moments approximates population moments, regardless of the distribution. find a set of equations known to hold in the population given
a model. The equations involve population moments which are functions of the unknown parameters. Obtain estimates by substituting sample moments for the population moments. e.g. the OLS estimator is also a method of moment estimator. One of the key assumptions of the classical linear model is E[ɛ i x i ] = E[(y i x i β)x i ] = 0 (for simplicity, assuming x i scalar) Sample version: (y i x i β)x i = 0 1 N i This is the same as the OLS normal equation: (first order derivative=0) 16
17 min i ɛ2 i = min i (y i x i β) 2 2 i (y i x i β)x i = 0 (y i x i β)x i = 0 1 N i Non-parametric models avoid such functional form assumptions as well as distributional assumptions. The less assumed, the more robust. But the less efficient (in case parametric assumptions are correct) e.g.1. Kernel smoothing. ˆm h (x) = n i=1 K h(x x i )y i n i=1 K h(x x i ) (K: some kernel function; h; bandwidth)
Local methods. 18 e.g.2. non-parametric matching. propensity score approach. program evaluation. (will discuss in detail later) The vast majority of standard models used in political science are parametric (logit/probit/ordered logit/tobit/heckit/poisson regression, etc.) Pros: if assumptions are (approximately) right, more efficient inference. Can do a lot of things with the precise functional relations after estimation, such as marginal effects, prediction. Cons: assumptions can be wrong.
Examples of functional forms for the systematic part: 19
Functional complexity in social science data. Neural networks as 20 universal learning machines. y Output Layer γ 1 γ 2 γ Weights z 1 z 2 Hidden Layer β 11 β β β 32 12 31 β21 β 22 β Weights Input Layer x 1 x 2 x 3 Figure 1: A one hidden layer feed forward neural network Model selection:
Fitting vs. Out of sample performance. 21 Bayesian model averaging: in the Bayesian framework, no single model is true. Each is valid with certain probability. Average the ones with relatively high probability to be true. Estimation: (focusing on parametric models) How to learn about the unknown parameters (i.e., the unknown part of the model) from data Estimation criteria/principles How to fit a line/curve to the scatter plot data? visual
22 y Model 2 Model 1 Least Square: minimize sum of squared errors. (have seen) Least absolute deviation (more robust w.r.t. outliers). Mathematically more difficult to handle than OLS Maximum likelihood: parameter values that maximize the probability of observed data given the model are most plausible. x These are point estimates. Confidence intervals can be con-
structed based on the sampling distribution of the estimators. the Bayesian approach: start with a prior belief about the unknown. Update our knowledge according to the Bayes rule. As the posterior density is proportional to likelihood times prior, the data influence inference only through the likelihood function. When data dominate prior, the likelihood resembles the posterior. From the posterior distribution one can obtain point estimate (e.g., the posterior mean or the most probable value) and interval estimate (probability intervals based on the posterior distribution). 23
24 P (θ y) = = = P (θ, y) P (y) P (y θ)p (θ) P (y) P (y θ)p (θ) P (y θ)p (θ)dθ Computationally, the main distinction is optimization of a function vs. sampling from a distribution. Maximum likelihood estimation is obtained through optimization: find values of parameters that maximizes the likelihood function. But one can explore the likelihood function by sampling from
the entire distribution (e.g., Gill & King paper on Hessian not invertable mode doesn t work, explore the mean instead.) MCMC uses computational algorithms for obtaining samples from a distribution. Heavily used in Bayesian inference. e.g., Gibbs Sampler (alternating conditional sampling). Convergence is proved. Software such as BUGS (Bayesian inference Using Gibbs Sampler; WinBugs Window version), JAGS (Just Another Gibs Sampler). Several R packages interface these with R or implement various specific models (e.g. MCMCPack). Note that MCMC Bayesian inference. Where posterior distribution is known or approximated through analytical methods, MCMC 25
is unnecessary. When the posterior/likelihood are well behaved (such as being globally concave), optimization is more efficient and more reliable. For complex function/distributions, MCMC returns some results when optimization is difficult to do. Of course, where optimization may fail, the quality of posterior approximation through sampling could be low too. there is no magic. how special data features require special sampling and/or estimation strategies, e.g. rare events (logit estimates biased); endogenous dependence structure (independence assumption doesn t hold). 26
Inference 27 Quantities of interest can be computed based on the model and the parameter estimates. e.g. marginal effect of an x. Except in linear models with no higher order terms, this is generally not the coefficient of x. But they are usually functions of the parameters. Uncertainty measures should be reported, based on uncertainty measures for the parameters. (for quantities pertaining to individual observations, also the fundamental uncertainty in the error term. e.g. E(Y i X i ) vs. Y i X i Model dependence: to what extent inference depends on the as-
sumption that the model is true. 28 Data quality: What kind of questions can be reliably answered from available data? Or, when can history be our guide?