PS 271B: Quantitative Methods II. Lecture Notes

Similar documents

Basics of Statistical Machine Learning

Bayesian Statistics in One Hour. Patrick Lam

Econometrics Simple Linear Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression

CHAPTER 2 Estimating Probabilities

Lecture 3: Linear methods for classification

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Introduction to Regression and Data Analysis

A Basic Introduction to Missing Data

Towards running complex models on big data

From the help desk: Bootstrapped standard errors

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

II. DISTRIBUTIONS distribution normal distribution. standard scores

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Model-based Synthesis. Tony O Hagan

Centre for Central Banking Studies

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

2. Linear regression with multiple regressors

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Statistics Graduate Courses

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach

Introduction to Fixed Effects Methods

Simple Linear Regression Inference

Examples. David Ruppert. April 25, Cornell University. Statistics for Financial Engineering: Some R. Examples. David Ruppert.

Fairfield Public Schools

problem arises when only a non-random sample is available differs from censored regression model in that x i is also unobserved

STA 4273H: Statistical Machine Learning

Statistical Machine Learning

How To Understand The Theory Of Probability

5. Linear Regression

1 Teaching notes on GMM 1.

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

1 Prior Probability and Posterior Probability

Forecasting in supply chains

Lasso on Categorical Data

Simple Regression Theory II 2010 Samuel L. Baker

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Markov Chain Monte Carlo Simulation Made Simple

Univariate Regression

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Module 3: Correlation and Covariance

Big Data, Statistics, and the Internet

Research Methods & Experimental Design

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Nonparametric adaptive age replacement with a one-cycle criterion

Non Parametric Inference

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation

Comparing Features of Convenient Estimators for Binary Choice Models With Endogenous Regressors

A General Approach to Variance Estimation under Imputation for Missing Survey Data

A Bayesian hierarchical surrogate outcome model for multiple sclerosis

430 Statistics and Financial Mathematics for Business

Parallelization Strategies for Multicore Data Analysis

Experiment #1, Analyze Data using Excel, Calculator and Graphs.

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Elements of statistics (MATH0487-1)

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

1 Maximum likelihood estimation

Multiple Regression: What Is It?

WHAT IS A JOURNAL CLUB?

CALCULATIONS & STATISTICS

Local classification and local likelihoods

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Statistics. Measurement. Scales of Measurement 7/18/2012

Sample Size and Power in Clinical Trials

Inference of Probability Distributions for Trust and Security applications

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Data Mining mit der JMSL Numerical Library for Java Applications

August 2012 EXAMINATIONS Solution Part I

Interpretation of Somers D under four simple models

Handling missing data in Stata a whirlwind tour

PARTIAL LEAST SQUARES IS TO LISREL AS PRINCIPAL COMPONENTS ANALYSIS IS TO COMMON FACTOR ANALYSIS. Wynne W. Chin University of Calgary, CANADA

Data Mining Practical Machine Learning Tools and Techniques

Probability and statistics; Rehearsal for pattern recognition

Normality Testing in Excel

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

Analysis of Bayesian Dynamic Linear Models

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

The primary goal of this thesis was to understand how the spatial dependence of

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Gamma Distribution Fitting

Supervised Learning (Big Data Analytics)

Fitting Subject-specific Curves to Grouped Longitudinal Data

17. SIMPLE LINEAR REGRESSION II

The Basics of Graphical Models

On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information

A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study

A Latent Variable Approach to Validate Credit Rating Systems using R

Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques Page 1 of 11. EduPristine CMA - Part I

Transcription:

PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu

The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference. (Order?) Examples: Presidential approval; International conflict/civil war. Identification: Can quantities of interest be determined from the model/data, assuming sufficient sample size? (asymptotic concept) Parameters in structural equation models, for example, are often of theoretical interests or directly code causal assumptions. Can they be uniquely determined with available measured variables?

Endogenous vs. ex exogenous variables; exclusion restrictions (certain causal links are ruled out); order condition (necessary condition for identification. number of excluded exogenous vars at least equal the number of included endogenous vars.) A single equation models can be considered part of a SEM (with some of the right-hand side variables potentially endogenous.) Standard models (parametric or non-parametric matching) typically assume a set of control variables are measured that makes identification of the causal parameter possible. What variables should be in the model? Is the same model good for both prediction and causal inference? 3

Standard practice: use the same (parametric) model for prediction and causal inference, often for studying causal effects of each independent variable in the model in turn. e.g.: Pr(Voting) = f(education, income, party ID, race, gender, etc.) But: different objectives may require very different x s to enter the model. Prediction: all direct causes of y; Causal inference on x i : all x j s that confound the relationship between x i and y. 4

5 x 2 x 1 x 3 y In this hypothetical causal structure: prediction of y: all x s; causal effect of x 1 on y: x 1 and x 2 ; causal effect of x 2 on y: x 2 (controlling for x 1, its consequence, leads to bias on the total effects). causal effect of x 3 on y: x 3

Finding the right set of control variables is hard 6 In practice, decision is often made informally, on a case-bycase basis, resting on folklore and intuition rather than on hard mathematics. (Pearl 2009) Different studies of the same causal relationship often use different sets of control variables, guided by even slightly different substantive theories. Lead to not only changes in magnitude but even reversal of signs in estimated effects. Simpson s Paradox. Pearl (2009) and related work (being introduced to political science); Causal graph theory

the possibility of causal inference from observational data 7 the discovery of underlying causal graphs from data; graphical tools for control variable selection based on the causal graph.

8 Data source/measurement: Experimental data If done right, the gold standard. Random assignment makes treatment exogenous and treatment and control group comparable (for sufficient N) Can be expensive/infeasible (regime type change?) Issues like noncompliance, external validity, Hawthorne effect (effect of observation) Observational data, such as from surveys Issues of sampling design. e.g, stratification with different sam-

pling rates (weighting necesary). Clustering (correlations within clusters). Selection bias. Response-based sampling (e.g., rare events data) missing data; sensitive questions cross sectional, panel (small T), tscs Measurement: e.g, Party identification? Economic wellbeing? Ideal point? Power? Structural characteristics of the international system? Some easier, some harder. E.g. Party ID can be obtained directly from survey data; others require more sophisticated methods, as in recovering ideal points from roll call data (e.g. Item response 9

10 model) Social network analysis useful for measuring structural characteristics (such as polarization, globalization)

Modeling: 11 Abstraction: no model is ever perfect (if it is, then not a model ). Reality itself is infinitely rich and complex Seek to capture the essential features of the data generating process; A collection of assumptions about the process. Systematic and stochastic components: e.g. Linear regression: Y = Xβ + ɛ (1) (Why ɛ: Never could measure all relevant variables; plus the universe is inherently probabilistic, according to quantum physics.)

Y : N 1; X: N k; β: k 1; 12 ɛ N(0, σ 2 I) Equivalently, Y N(Xβ, σ 2 I) For each individual i, i = 1, 2,..., N: Y i N(X i β, σ 2 )

Also equivalent: 13 Y i f N (y i µ i, σ 2 ), µ i = x i β where y i is an observed value of the random variable Y i. Read: The density of Y i at a particular location y i is given by the normal distribution density with mean µ i = x i β and variance σ 2. We ll be looking at a variety of forms of systematic and stochastic components (distribution functions) suitable for different types of data Y (binary, multinomial, ordinal, counted, censored/truncated, duration, etc.)

Parametric, semi-parametric, non-parametric 14 We ve just seen an example of a parametric model. The data generating process is known up to a set of unknown parameters (in the regression model, {β, σ}) Estimation of these parameters (more below): OLS, Least absolute deviation, MLE, Bayesian.. Semi-parametric models combine a parametric component with a non-parametric component more flexible/robust than fully parametric models (but less efficient, if parametric forms can be correctly specified). This can be in terms of partially specified functional form for the systematic part (such as in neural net-

work model; Cox proportional hazard model), or in the form of avoiding distributional assumptions for the stochastic term. Method of Moment (and GMM, generalized MM) estimations are semi-parametric, more robust to distributional assumptions on the stochastic part. Moments: mean, variance, etc. 15 n th moment: M n = x n f(x)dx Basic idea: making use of the fact that sample moments approximates population moments, regardless of the distribution. find a set of equations known to hold in the population given

a model. The equations involve population moments which are functions of the unknown parameters. Obtain estimates by substituting sample moments for the population moments. e.g. the OLS estimator is also a method of moment estimator. One of the key assumptions of the classical linear model is E[ɛ i x i ] = E[(y i x i β)x i ] = 0 (for simplicity, assuming x i scalar) Sample version: (y i x i β)x i = 0 1 N i This is the same as the OLS normal equation: (first order derivative=0) 16

17 min i ɛ2 i = min i (y i x i β) 2 2 i (y i x i β)x i = 0 (y i x i β)x i = 0 1 N i Non-parametric models avoid such functional form assumptions as well as distributional assumptions. The less assumed, the more robust. But the less efficient (in case parametric assumptions are correct) e.g.1. Kernel smoothing. ˆm h (x) = n i=1 K h(x x i )y i n i=1 K h(x x i ) (K: some kernel function; h; bandwidth)

Local methods. 18 e.g.2. non-parametric matching. propensity score approach. program evaluation. (will discuss in detail later) The vast majority of standard models used in political science are parametric (logit/probit/ordered logit/tobit/heckit/poisson regression, etc.) Pros: if assumptions are (approximately) right, more efficient inference. Can do a lot of things with the precise functional relations after estimation, such as marginal effects, prediction. Cons: assumptions can be wrong.

Examples of functional forms for the systematic part: 19

Functional complexity in social science data. Neural networks as 20 universal learning machines. y Output Layer γ 1 γ 2 γ Weights z 1 z 2 Hidden Layer β 11 β β β 32 12 31 β21 β 22 β Weights Input Layer x 1 x 2 x 3 Figure 1: A one hidden layer feed forward neural network Model selection:

Fitting vs. Out of sample performance. 21 Bayesian model averaging: in the Bayesian framework, no single model is true. Each is valid with certain probability. Average the ones with relatively high probability to be true. Estimation: (focusing on parametric models) How to learn about the unknown parameters (i.e., the unknown part of the model) from data Estimation criteria/principles How to fit a line/curve to the scatter plot data? visual

22 y Model 2 Model 1 Least Square: minimize sum of squared errors. (have seen) Least absolute deviation (more robust w.r.t. outliers). Mathematically more difficult to handle than OLS Maximum likelihood: parameter values that maximize the probability of observed data given the model are most plausible. x These are point estimates. Confidence intervals can be con-

structed based on the sampling distribution of the estimators. the Bayesian approach: start with a prior belief about the unknown. Update our knowledge according to the Bayes rule. As the posterior density is proportional to likelihood times prior, the data influence inference only through the likelihood function. When data dominate prior, the likelihood resembles the posterior. From the posterior distribution one can obtain point estimate (e.g., the posterior mean or the most probable value) and interval estimate (probability intervals based on the posterior distribution). 23

24 P (θ y) = = = P (θ, y) P (y) P (y θ)p (θ) P (y) P (y θ)p (θ) P (y θ)p (θ)dθ Computationally, the main distinction is optimization of a function vs. sampling from a distribution. Maximum likelihood estimation is obtained through optimization: find values of parameters that maximizes the likelihood function. But one can explore the likelihood function by sampling from

the entire distribution (e.g., Gill & King paper on Hessian not invertable mode doesn t work, explore the mean instead.) MCMC uses computational algorithms for obtaining samples from a distribution. Heavily used in Bayesian inference. e.g., Gibbs Sampler (alternating conditional sampling). Convergence is proved. Software such as BUGS (Bayesian inference Using Gibbs Sampler; WinBugs Window version), JAGS (Just Another Gibs Sampler). Several R packages interface these with R or implement various specific models (e.g. MCMCPack). Note that MCMC Bayesian inference. Where posterior distribution is known or approximated through analytical methods, MCMC 25

is unnecessary. When the posterior/likelihood are well behaved (such as being globally concave), optimization is more efficient and more reliable. For complex function/distributions, MCMC returns some results when optimization is difficult to do. Of course, where optimization may fail, the quality of posterior approximation through sampling could be low too. there is no magic. how special data features require special sampling and/or estimation strategies, e.g. rare events (logit estimates biased); endogenous dependence structure (independence assumption doesn t hold). 26

Inference 27 Quantities of interest can be computed based on the model and the parameter estimates. e.g. marginal effect of an x. Except in linear models with no higher order terms, this is generally not the coefficient of x. But they are usually functions of the parameters. Uncertainty measures should be reported, based on uncertainty measures for the parameters. (for quantities pertaining to individual observations, also the fundamental uncertainty in the error term. e.g. E(Y i X i ) vs. Y i X i Model dependence: to what extent inference depends on the as-

sumption that the model is true. 28 Data quality: What kind of questions can be reliably answered from available data? Or, when can history be our guide?