The Identification Zoo - Meanings of Identification in Econometrics: PART 2

Similar documents

Comparing Features of Convenient Estimators for Binary Choice Models With Endogenous Regressors

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation

IDENTIFICATION IN A CLASS OF NONPARAMETRIC SIMULTANEOUS EQUATIONS MODELS. Steven T. Berry and Philip A. Haile. March 2011 Revised April 2011

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Persuasion by Cheap Talk - Online Appendix

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

1 Short Introduction to Time Series

PS 271B: Quantitative Methods II. Lecture Notes

Basics of Statistical Machine Learning

THE IMPACT OF 401(K) PARTICIPATION ON THE WEALTH DISTRIBUTION: AN INSTRUMENTAL QUANTILE REGRESSION ANALYSIS

CPC/CPA Hybrid Bidding in a Second Price Auction

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

Maximum Likelihood Estimation

FIXED EFFECTS AND RELATED ESTIMATORS FOR CORRELATED RANDOM COEFFICIENT AND TREATMENT EFFECT PANEL DATA MODELS

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

SYSTEMS OF REGRESSION EQUATIONS

problem arises when only a non-random sample is available differs from censored regression model in that x i is also unobserved

Marketing Mix Modelling and Big Data P. M Cain

CHAPTER 2 Estimating Probabilities

You Are What You Bet: Eliciting Risk Attitudes from Horse Races

Generalized Random Coefficients With Equivalence Scale Applications

Non Parametric Inference

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Chapter 3: The Multiple Linear Regression Model

On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information

Clustering in the Linear Model

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Labor Economics, Lecture 3: Education, Selection, and Signaling

Nonparametric adaptive age replacement with a one-cycle criterion

On Marginal Effects in Semiparametric Censored Regression Models

1 if 1 x 0 1 if 0 x 1

Lecture 3: Linear methods for classification

Master s Theory Exam Spring 2006

A Simple Model of Price Dispersion *

Please Call Again: Correcting Non-Response Bias in Treatment Effect Models

A HYBRID GENETIC ALGORITHM FOR THE MAXIMUM LIKELIHOOD ESTIMATION OF MODELS WITH MULTIPLE EQUILIBRIA: A FIRST REPORT

Data Mining: Algorithms and Applications Matrix Math Review

Econometrics Simple Linear Regression

A Basic Introduction to Missing Data

Stochastic Inventory Control

Unraveling versus Unraveling: A Memo on Competitive Equilibriums and Trade in Insurance Markets

Recall this chart that showed how most of our course would be organized:

2. Linear regression with multiple regressors

Inference of Probability Distributions for Trust and Security applications

Statistics 104: Section 6!

Multivariate Normal Distribution

All you need is LATE Primary Job Market Paper

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE

Online Appendix Feedback Effects, Asymmetric Trading, and the Limits to Arbitrage

Lecture 15. Endogeneity & Instrumental Variable Estimation

1 Prior Probability and Posterior Probability

Statistics Review PSY379

The equivalence of logistic regression and maximum entropy models

From the help desk: Bootstrapped standard errors

Part 2: One-parameter models

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Chapter 7. Sealed-bid Auctions

Statistical Machine Learning

2. Information Economics

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Bayesian Analysis for the Social Sciences

Statistics in Retail Finance. Chapter 6: Behavioural models

MATHEMATICAL METHODS OF STATISTICS

Principle of Data Reduction

E3: PROBABILITY AND STATISTICS lecture notes

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Inflation. Chapter Money Supply and Demand

Markov Chain Monte Carlo Simulation Made Simple

Normalization and Mixed Degrees of Integration in Cointegrated Time Series Systems

Economics 1011a: Intermediate Microeconomics

Hypothesis Testing for Beginners

Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania

Not Only What But also When: A Theory of Dynamic Voluntary Disclosure

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

Imbens/Wooldridge, Lecture Notes 5, Summer 07 1

Quantile Regression under misspecification, with an application to the U.S. wage structure

FIRST YEAR CALCULUS. Chapter 7 CONTINUITY. It is a parabola, and we can draw this parabola without lifting our pencil from the paper.

Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes

6.254 : Game Theory with Engineering Applications Lecture 2: Strategic Form Games

Introduction to time series analysis

Magne Mogstad and Matthew Wiswall

Chapter 4: Vector Autoregressive Models

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Revised Version of Chapter 23. We learned long ago how to solve linear congruences. ax c (mod m)

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Bayesian Statistics in One Hour. Patrick Lam

1 Teaching notes on GMM 1.

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Chapter 1. Vector autoregressions. 1.1 VARs and the identi cation problem

Transcription:

The Identification Zoo - Meanings of Identification in Econometrics: PART 2 Arthur Lewbel Boston College 2016 Lewbel (Boston College) Identification Zoo 2016 1 / 75

The Identification Zoo - Part 2 - sections 5, 6, 7, 8. (These are notes to accompany a survey article on econometric identification commissioned by the Journal of Economic Literature). More than two dozen types of identification appear in the econometrics literature, including (in alphabetical order): Baysian identification, causal identification, eventual identification, exact identification, first order identification, frequentist identification, generic identification, global identification, identification arrangement, identification at infinity, identification of bounds, ill-posed identification, irregular identification, local identification, nearly-weak identification, over identification, nonparametric identification, non-robust identification, nonstandard weak identification, parametric identification, partial identification, point identification, sampling identification, semiparametric identification, set identification, semi-strong identification, strong identification, structural identification, thin-set identification, and weak identification. Lewbel (Boston College) Identification Zoo 2016 2 / 75

1. Introduction Goals: Summarize the different types of identification and their meanings. Show how they relate to each other. Look at issues closely related to identification, such as coherence. Study some examples of identification. Lewbel (Boston College) Identification Zoo 2016 3 / 75

Table of Contents: 1. Introduction 2. Identifying Ordinary (Global, Point) Identification 2.1 Historical Roots of Identification 2.2 Unifying Notions of Ordinary Identification 2.3 Demonstrating Ordinary Identification 2.4 Reasons for Ordinary Nonidentification 2.5 Identification by Functional Form 2.6 Over, Under, and Exact Identification, Rank and Order conditions 3. Coherence and Completeness 4. Functions and Sets 4.1 Partial, Point, and Set Identification 4.2 Nonparametric and Semiparametric Identification 4.3 The Meaning and Role of Normalizations in Identification 4.4 Examples: Some Special Regressor Models Continued next slide Lewbel (Boston College) Identification Zoo 2016 4 / 75

Table of Contents - continued: 5. Causal Inference vs Structural Model Identification 5.1 Causal vs Structural Identification: An Example 5.2 Causal vs Structural Simultaneous Systems 5.1 Causal vs Structural Conclusions 6. Limited Forms of Identification 6.1 Local and Global Identification 6.2 Generic Identification 7. Identification Concepts that Affect Inference 7.1 Weak vs Strong Identification 7.2 Identification at Infinity or Zero; Irregular and Thin set identification 7.3 Ill-Posed Identification 7.4 Bayesian Identification 8. Conclusions Lewbel (Boston College) Identification Zoo 2016 5 / 75

Part 1 had sections: 2. Identifying Ordinary Identification 3. Coherence and Completeness 4. Identification of Functions and Sets Lewbel (Boston College) Identification Zoo 2016 6 / 75

5. Causal Inference vs Structural Model Identification Most identification theory was developed for structural models, Came before the rise of randomized/causal modeling in econometrics. Most surveys of identification don t touch causal identification Much structural identification (like Cowles) is for simultaneous systems. In contrast, consider Rubin (1974) counterfactual notation: Y (t), the outcome Y that would have occurred if treatment T had equaled t. This notation generally assumes T affects Y, not the reverse. Causal inference doesn t care about identifying structural parameters. Causal estimands to be identified are summary measures. Example: Average treatment effect (ATE): θ = E (Y (1) Y (0)). Lewbel (Boston College) Identification Zoo 2016 7 / 75

Other common causal estimands: average treatment effects on the treated (ATT), local average treatments effects (LATE), marginal treatment effects (MTE), quantile treatment effects (QTE). Typical structural model obstacles to identification are simultaneity, nonuniqueness, multiple solutions. The main obstacle to identification of causal parameters is that they are defined in terms of unobservable counterfactuals. Instead of seeing whether a structural parameter can be identified, Angrist and Imbens (1994) LATE starts with a standard estimator, and asks whether one can give the identifiable estimand a causal interpretation. Lewbel (Boston College) Identification Zoo 2016 8 / 75

5.1 Causal vs Structural Identification: An Example Here we compare identification of standard causal LATE estimator (Angrist and Imbens 1994) to standard structural ATE estimation. One diffi culty in comparing is that causal and structural methods use different notations. To facilitate the comparison, I ll rewrite the LATE assumptions completely in terms of the corresponding restrictions on the standard structural model. Let Y i be an observed outcome for individual i. Let T i be a binary endogenous regressor (treatment indicator). Let Z i be a binary covariate (potential instrument), correlated with T. Lewbel (Boston College) Identification Zoo 2016 9 / 75

Assume a DGP for which the second moments of Y, T, Z exist and can be identified. Let c = cov (Z, Y ) /cov (Z, T ), which is therefore identified (by construction). No model has yet been specified, but this c would be the estimated coeffi cient of T in a linear instrumental variables regression of Y on a constant and on T, using a constant and Z as instruments. Lewbel (Boston College) Identification Zoo 2016 10 / 75

Start with ( the most ) general possible ( model ) for Y and T : Y = G T, Z, Ũ and T = R Y, Z, Ṽ Ũ and Ṽ are vectors of unobservable errors. G and R are arbitrary, unknown functions. Assume G does not depend directly on Z, and R does not depend ( directly ) on Y. Model becomes Y = G T, Ũ and T = R ( ) Z, Ṽ. These are exclusion restrictions. Makes the model Triangular. Lewbel (Boston College) Identification Zoo 2016 11 / 75

( ) Have Y = G T, Ũ and T = R ( ) Z, Ṽ ( ) ( ) ( ) Let U 0 = G 0, Ũ, U 1 = G 1, Ũ G 0, Ũ, ( ) ( ) ( ) V 0 = R 0, Ṽ, V 1 = R 1, Ṽ R 0, Ṽ. Since both T and Z are binary, this lets us without loss of generality rewrite as the model as Y = U 0 + U 1 T and T = V 0 + V 1 Z. This is a linear random coeffi cients model. Also: Since T and Z are binary, we also have that V 0 and V 1 + V 0 are binary. Lewbel (Boston College) Identification Zoo 2016 12 / 75

Now start over. Take the same model, but write it causally, using unobserved counterfactuals instead of unobserved errors. Rubin (1974) Causal notation: Y (t) is the random variable denoting outcome Y if T = t. T (z) is the random variable denoting treatment T if Z = z. How do the notations relate? Y (t) = G ( ) ( ) t, Ũ, T (z) = R z, Ṽ As in the structural model, this assumes the exclusion that Z does not directly affect Y. More precisely, Z does not affect either Y (0) or Y (1). and Y does not directly affect T. We could have started with Y (t, z), the exclusion is then Y (t, 1) = Y (t, 0). Lewbel (Boston College) Identification Zoo 2016 13 / 75

So far, structural and causal are identical; both assumed nothing except exclusions/triangular structure. A required causal inference assumption: SUTVA (Stable Unit Treatment Value Assumption): Any one person s outcome is unaffected by the treatment that other people receive. The term SUTVA was coined by Rubin (1980), concept goes back at least to Cox (1958), and perhaps implicit in both Fisher (1935) and Neyman (1935). Lewbel (Boston College) Identification Zoo 2016 14 / 75

SUTVA (Stable Unit Treatment Value Assumption) formal definition: For any two different people i and j, Y i (0) and Y i (1) are independent of T j. SUTVA essentially rules out social interactions, peer effects, most general equilibrium effects. Structural analogs to SUTVA are restrictions on correlations between {U 0i, U 1i } and {V 0j, V 1j, Z j }. Suffi cient much stronger than necessary for SUTVA: {U 0i, U 1i, V 0i, V 1i, Z i } independent across i. Assume this for simplicity. Lewbel (Boston College) Identification Zoo 2016 15 / 75

Y = U 0 + U 1 T and T = V 0 + V 1 Z. Same model as Y (T ) and T (Z ). Have assumed exclusions, and independence across people. What else? By construction T depends on V 0 and V 1. Endogenous T could also correlate with U 0 and U 1. Want Z to be an instrument, so add the causal assumption: {Y (1), Y (0), T (1), T (0)} is independent of Z. This is called unconfoundedness. would be implied by randomly assigned Z. Structurally, unconfoundedness is the same as: {U 1, U 0, V 0, V 1 } is independent of Z. Lewbel (Boston College) Identification Zoo 2016 16 / 75

Y = U 0 + U 1 T and T = V 0 + V 1 Z. Same model as Y (T ) and T (Z ). Unconfoundedness is: {Y (1), Y (0), T (1), T (0)} is independent of Z. equivalently, {U 1, U 0, V 0, V 1 } is independent of Z. typically justified by randomly assigned Z. somewhat stronger than usual structural assumptions. Assumptions so far: Exclusions/triangular, SUTVA/independence, unconfoundedness. Assume also structurally that E (V 1 ) = 0, or equivalently in the causal notation, that E (T (1) T (0)) = 0. This assumption ensures that the instrument Z is relevant. Lewbel (Boston College) Identification Zoo 2016 17 / 75

Given just these assumptions, what does the instrumental variables estimand c equal? c = cov (Z, Y ) cov (Z, T ) = cov (Z, U 0 + U 1 T ) cov (Z, (V 0 + V 1 Z )) = cov (Z, U 0 + U 1 (V 0 + V 1 Z )) cov (Z, (V 0 + V 1 Z )) = cov (Z, U 1V 1 Z ) cov (Z, V 1 Z ) = E (U 1V 1 ) var(z ) E (V 1 ) var (Z ) = E (U 1V 1 ) E (V 1 ) For this problem, the only difference between structural and causal approaches is that different additional assumptions are made to interpret the equation c = E (U 1 V 1 ) /E (V 1 ). Lewbel (Boston College) Identification Zoo 2016 18 / 75

Consider structural identification first: Model is still: Y = U 0 + U 1 T and T = V 0 + V 1 Z. Can rewrite this as: Y = a + bt + e, where b = E (U 1 ), e = (U 1 b) T + U 0 a. Structural identification question: When does IV estimand c = cov (Z, Y ) /cov (Z, T ) equal the structural coeffi cient b? Standard answer: Z being a valid structural model instrument requires cov (e, Z ) = 0. And what does b mean? Recall definition ATE = E [Y (1) Y (0)]. Under our assumptions, E [Y (t)] = E [(U 0 + U 1 t)] = E (U 0 ) + E (U 1 ) t for t = 0 and for t = 1. Therefore E [Y (1) Y (0)] = E (U 1 ) = b. So the structural coeffi cient b is precisely the causal ATE. Lewbel (Boston College) Identification Zoo 2016 19 / 75

With Y = a + bt + e, have Z is valid structural instrument if cov (e, Z ) = 0, and then IV estimand c = b. But, given our previous, causal type assumptions, what does cov (e, Z ) = 0 mean? What does this structural restriction imply? Since Z is independent of {U 1, U 0, V 0, V 1 }, we have that cov (e, Z ) = cov (U 1, V 1 ) var (Z ), so cov (e, Z ) = 0 if cov (U 1, V 1 ) = 0. Note from above that c = E (U 1V 1 ) E (V 1 ) = E (U 1 ) + cov (U 1, V 1 ) E (V 1 ) So again c = b if cov (U 1, V 1 ) = 0. = E (U 1) E (V 1 ) + cov (U 1, V 1 ) E (V 1 ) = b + cov (U 1, V 1 ) E (V 1 ) Summary: the structural identifying assumption is cov (U 1, V 1 ) = 0. Under this assumption, c = b = ATE for the population. Lewbel (Boston College) Identification Zoo 2016 20 / 75

Now look at causal identification, and then compare. Define a complier: A person i for whom Z i and T i are the same random variable. If a complier has Z = 0, then he has T = 0, and if has Z = 1, then has T = 1. Define a defier: A person i for whom Z i and 1 T i are the same random variable. We can t know who the compliers and defiers are! Because they are defined in terms of counterfactuals. If someone has Z = 0, we can see if they have T = 0 or T = 1, but we can t know what their T would have been if they had been assigned Z = 1. Lewbel (Boston College) Identification Zoo 2016 21 / 75

Recall T = V 0 + V 1 Z. All the possibilities relating T to Z are: Compliers: T = Z: V 0 = 0 and V 1 = 1. Defiers: T = 1 Z: V 0 = 1 and V 1 = 1. Always takers: T = 1 for any Z: V 0 = 1 and V 1 = 0. Never takers: T = 0 for any Z: V 0 = 0 and V 1 = 0. Compliers are the only type that have V 1 = 1. Angrist and Imbens (1994) define the LATE (local average treatment effect) as the ATE just among compliers. (note: their use of the word local is not the same as in local identification). In our notation here, LATE = E [Y (1) Y (0) V 1 = 1]. Lewbel (Boston College) Identification Zoo 2016 22 / 75

Now revisit the IV coeffi cient c. Let P v be the probability that V 1 = v. Then, by definition of expectations, c = E (U 1V 1 ) E (V 1 ) = E (U 1 V 1 V 1 = 1) P 1 + E (U 1 V 1 V 1 = 0) P 0 + E (U 1 V 1 V 1 = 1) P 1 E (V 1 V 1 = 1) P 1 + E (V 1 V 1 = 0) P 0 + E (V 1 V 1 = 1) P 1 = E (U 1 V 1 = 1) P 1 E (U 1 V 1 = 1) P 1 P 1 P 1 Angrist and Imbens (1994): Assume that there are no defiers in the population. This rules out the V 1 = 1 case, making P 1 = 0. Then above simplifies to: c = E (U 1 V 1 = 1) = LATE. So causal assumes V 1 = 1 (no defiers) and identifies LATE for just compliers. Lewbel (Boston College) Identification Zoo 2016 23 / 75

Comparing Structural and Causal: Model is Y = U 0 + U 1 T and T = V 0 + V 1 Z. where V 1 and V 1 + V 0 are binary. Given our assumptions: Exclusions/triangular, SUTVA/independence, unconfoundedness; Structural makes the additional assumption cov (U 1, V 1 ) = 0. This restricts heterogeneity of the treatment effect U 1. Identifies ATE for the population. Causal instead makes the additional assumption V 1 = 1 (no defiers). This restricts heterogeneity of types of individuals. Identifies LATE for compliers, says nothing about the treatment effect on non-compliers. Lewbel (Boston College) Identification Zoo 2016 24 / 75

Comparing Structural and Causal - continued: Limitations of LATE: Says nothing about treatment effect on noncompliers (ok if Z is a policy variable?). We don t know who the compliers are (though we can identify some observable characteristics of their distribution). Lewbel (Boston College) Identification Zoo 2016 25 / 75 Neither the structural nor causal assumptions above are a priori more plausible. Both require assumptions on unobservables, not just cov (U 1, V 1 ) = 0 or V 1 = 1, but also by assumptions like exclusion restrictions and SUTVA. Structural cov (U 1, V 1 ) = 0 restricts assumed heterogeneity of the treatment effect U 1. Assumes an individual s type V 1 is on average unrelated to the size of personal treatment effect U 1. But delivers b, the population ATE. Causal V 1 = 1 (no defiers) restricts heterogeneity of types of individuals. Compared to cov (U 1, V 1 ) = 0, has the advantage of not imposing restrictions on the Y equation. But only delivers LATE.

Another property of LATE: the definition of a complier depends on Z. Suppose we saw a different instrument Z instead of Z. ) ) Let c = cov (Z, Y ) /cov (Z, T ) and c = cov ( Z, Y /cov ( Z, T. If Z is a valid instrument in the structural sense, then c = c = b = ATE in the population. Also could then test instrument validity. If reject c = c, then either Z or Z is not a (structurally) valid instrument. But even if both are causally valid, will ususally have c = c. Both are LATEs, but for different (unknown) compliers. Are there any testable implications of causal instrument validity? Yes, there exist a few weak inequalities one can test, e.g., P 1 can t be negative. See, e.g., Kitagawa (2015). Lewbel (Boston College) Identification Zoo 2016 26 / 75

An Aside: We started with a nonparametric model Y = G ( ) T, Ũ with any G function and any error vector Ũ, and showed it was equivalent to a linear model Y = θt + e (possibly with T endogenous). Nonparametric model = linear model? Yes! But it s a very special property of binary regressors. Also, if we have Y = g (X ) + e for a discrete vector X (with a finite number of mass points), can also write as linear regression. Let D be vector of dummies. For each value x that X can take on, let an element of D equal one when X = x and zero otherwise. Then Y = g (X ) + e = c D + e. This linear regression is called a saturated regression model. Since g (X ) = c D, identifying vector c is the same as identifying g over the support of X. Lewbel (Boston College) Identification Zoo 2016 27 / 75

5.2 Causal vs Structural Simultaneous Systems Assume now a simultaneous system of equations, say Y = U 0 + U 1 X and X = H (Y, Z, V ). (U 0, U 1 ) and V are unobserved error vectors. Observe Y, X, and instrument Z. Using X not T because might not be treatment (need not be binary). As before, analyze meaning of c = cov (Z, Y ) /cov (Z, X ). Structual analysis is exactly the same as before: Have Y = a + bx + e where b = E (U 1 ) and e = U 0 + (U 1 b) X. If cov (e, X ) = 0 then c = b = average marginal effect of X on Y. In contrast, causal analysis is much more complicated. Angrist, Graddy and Imbens (2000) show under LATE type assumptions, c equals a complicated weighted average of U 1, with weights that depend on Z and some weights are zero. Lewbel (Boston College) Identification Zoo 2016 28 / 75

Why does causal become complicated? In part because the definition of potential outcomes imposes restrictions on the simultaneous system. Structurally we can write Y = G (X, U) and X = H (Y, Z, U). The corresponding counterfactuals would be Y (x) and X (y, z). Assuming the existence of these counterfactual random variables imposes restrictions on the structure. For example, if the structural model is incomplete (see section 3), then the counterfactuals are not well defined. Example: G and H describe actions of players X and Y in a game. If the game has multiple equilibria, then unique random variables Y (x) and X (y, z) do not exist. Any given Y (x) and X (y, z) variables correspond to assuming players follow a specific equilibrium selection rule. More generally, causal methods cannot be used to analyze potentially incomplete models. Lewbel (Boston College) Identification Zoo 2016 29 / 75

Assuming coherence and completeness, consider constructing a semi reduced form equation X = R (Z, U, V ), defined by solving the equation X = H (G (X, U), Z, V ) for X. Instead of counterfactuals Y (x) and X (y, z), in causal analyses it is far more common to define (and place assumptions directly on) counterfactuals Y (x) and X (z), corresponding to the semi reduced form, triangular model Y = G (X, U) and X = R (Z, U, V ). Indeed, causal inference methods are often called reduced form modeling. Lewbel (Boston College) Identification Zoo 2016 30 / 75

One more limitation in applying causal analyses to simultaneous systems: SUTVA. Many structural models exist in where treatment of one person affects outcomes of others. Examples: peer effect models, social interactions models, network models, general equilibrium models. In causal frameworks such effects are generally ruled out by the SUTVA. These diffi culties may be why causal methods have become popular in traditionally partial equilibrium analyses (e.g., labor economics and micro development), but not where general equilibrium models are the norm (e.g., industrial organization and macroeconomics). Lewbel (Boston College) Identification Zoo 2016 31 / 75

5.3 Causal vs Structural Identification: Summary Structural models summary: Provides information about underlying structure. Can incorporate behavioral restrictions implied by economic theory. Can gain identification from such restrictions, even without randomization. Structure, if correctly specified, provides external validity. But can be misspecified. Box (1979): "All models are wrong, but some are useful". Lewbel (Boston College) Identification Zoo 2016 32 / 75

Structural models summary - continued: Common objection to structural modeling approach: Where are the deep parameters that structural models have uncovered? What are their values? One answer: calibrated parameters in macro models. Calibrated parameters are deep parameters with values that users of these models largely agree upon. They are based generally on structural model estimates. Though a few, like rates of time preference and risk aversion, have been additionally reenforced by lab and field experiments. Also, experience from many structural models has lead to consensus regarding the ranges of values for many parameters (such as price and income elasticities) that are widely recognized as reasonable. Lewbel (Boston College) Identification Zoo 2016 33 / 75

Causal models summary: Can also be misspecified (e.g., there could be defiers). But requires far fewer behavioral restrictions, particularly on outcomes. External validity question: what is relevance of LATE to other data sets? How representative are compliers? Without structure, can t identify effects on unobservables of economic interest, e.g., utility, risk aversion, bargaining power, or expectations. Even with random assignment, treatment of large numbers or spread over time may violate SUTVA. Example: Progresa in Mexico, people move to communities that have the program, or change behavior from interacting with treated individuals, or in expectation that the program will expand to their own community (see, e.g., Behrman and Todd 1999). Lewbel (Boston College) Identification Zoo 2016 34 / 75

Causal vs Structural Identification Conclusions Researchers do NOT need to choose between either searching for randomized instruments or building structural models. For identification, the best strategy is to exploit both approaches. 1. Causal inference can be augmented with structural econometric methods to cope with attrition, sample selection, measurement error, and contamination bias. 2. Structural Identification often depends on independence assumptions, which are more likely to hold under randomization. 3. Causal effects can provide useful benchmarks for structure. For example, estimate a structural model from a large survey, check if treatment effects implied by estimated structural parameters equal causal treatment effects from small randomized trials drawn from the same underlying population. Lewbel (Boston College) Identification Zoo 2016 35 / 75

Causal vs Structural Identification Conclusions - continued 4. Can use structure to link estimated causal effects of a treatment (say, access to credit) on observable characteristics (like household s expenditures) to effects on less observable characteristics (like women s utility and bargaining power in a household). 5. Economic theory and structure can provide guidance regarding the external validity of causally identified parameters. E.g., Dong and Lewbel (2015): with a mild structural assumption called local policy invariance, can identify how treatment effects estimated by regression discontinuity would change if the cut off point changed. 6. Big data analyses on large data sets can uncover promising correlations. Structural analyses of such data might then be used to uncover possible economic and behavioral mechanisms that underlie these correlations, and randomization can be used to investigate the causal direction of these correlations. Lewbel (Boston College) Identification Zoo 2016 36 / 75

Causal vs Structural Identification Conclusions - continued Various attempts to formally unify structural and causal methods have been attempted over time. See, e.g., Heckman, Urzua and Vytlacil (2006), Pearl (2009), and Heckman (2010) Other notions of causality exist that are relevant to the economics and econometrics literature, going back to Hume (1739) and Mill (1851), and including temporal based definitions as in Granger (1969). See Hoover (2006) for a survey of alternative notions of causality in economics and econometrics. Lewbel (Boston College) Identification Zoo 2016 37 / 75

6. Limited Forms of Identification It s often hard to prove ordinary identification. Possible responses: 1. Set identification (estimation and inference is hard). 2. Ignore the problem and assume identification. A third way: Prove some more limited form of identification holds. Then the leap of faith in assuming ordinary identification is not as large. Limited forms of identification include local identification and generic identification. Lewbel (Boston College) Identification Zoo 2016 38 / 75

6.1 Local vs. Global Identification Another term for standard identification is global identification. Let Θ be the set of all possible values of θ. Global emphasizes that for each s S, we can identify a unique value of θ over the entire range of possible values Θ (see, e.g., Komunjar 2012). Local identification means θ is identified if restrict Θ to be a small open neighborhood of the true θ 0. Local identification differs from (and predates), the term local as used in LATE (local average treatment effect). In LATE, local means the mean parameter value for a particular subpopulation. Lewbel (Boston College) Identification Zoo 2016 39 / 75

Local and global identification: Suppose m (X ) is an identified, continuous function. Suppose θ is real, and m (θ) = 0. Consider three possible cases: Case a: Suppose we know m (x) is strictly monotonic. Then θ (if it exists) is globally identified. Strict monotonicity ensures only one value of θ can satisfy m (θ) = 0. Case b: Suppose m is known to be a J th order polynomial for some integer J. Then θ will typically not be globally identified. Are up to J different values of θ that satisfy m (θ) = 0. However, in this case θ is locally identified. There always exists a neighborhood Θ of the true θ small enough to exclude all other roots of m (x) = 0. Case c: Suppose all we know about m is that it is continuous. Then θ might not even be locally identified, because m (x) could equal zero for all values of x in some interval. Lewbel (Boston College) Identification Zoo 2016 40 / 75

If a real parameter vector θ is set identified, and the set has a finite number of elements, then θ is locally identified. Example: suppose multiple local optima when maximizing an extremum estimator objective function like maximum likelihood or GMM. If in the population there are a finite number of local optima, and they all have the same value of the objective function, then the parameters are locally but not globally identified. Lewbel (Boston College) Identification Zoo 2016 41 / 75

In nonlinear models, it is often much easier to prove local rather than global identification. Local identification may be suffi cient in practice if we have enough economic intuition about true parameter values to know that the correct θ should lie in a particular region. Example: Lewbel (2012). Parameter is a coeffi cient in a simultaneous system of equations, and the set has two elements, one positive and one negative. In this case have only local identification, unless economic model is suffi cient to determine the sign of the parameter a priori (e.g., it is the price coeffi cient in a supply equation). Here we were helped because regions where different θ s lie are known. Lewbel (Boston College) Identification Zoo 2016 42 / 75

The notion of local identification is described by Fisher (1966) for linear models. Generalized to other parametric models by Rothenberg (1971) and Sargan (1983), as follows: Let θ be vector of structural parameters let φ be set of reduced form parameters Consider structures S (φ θ) of the form r (φ (θ), θ) = 0 for some known vector valued function r. Let R (θ) = r (φ (θ), θ). A suffi cient condition for local identification of θ is that R (θ) be differentiable and that the rank of R (θ) / θ equal the number of elements of θ. Sargan (1983) calls violation of this condition first order (lack of) identification. Lewbel (Boston College) Identification Zoo 2016 43 / 75

For MLE models, the rank condition for local identification is equivalent to the condition that the information matrix evaluated at the true θ be nonsingular. Newey and McFadden (1994) and Chernozhukov, Imbens, and Newey (2007) extend local identification to other models. Chen, Chernozhukov, Lee, and Newey (2014) provide a rank condition for local identification of a finite parameter vector when the structure consists of conditional moment restrictions. Chen and Santos (2015) define a concept of local overidentification. Lewbel (Boston College) Identification Zoo 2016 44 / 75

6.2 Generic Identification Like local identification, generic identification is a weaker condition than ordinary identification that is sometimes easier to prove. Let Φ be the set of all possible φ. Define Φ Φ to be the set of all φ such that, for the given structure S, the parameters θ are not identified. Based on McManus (1992), θ is defined to be generically identified if Φ is a measure zero subset of Φ. Interpretation: Imagine nature picks a true φ 0 by drawing from a continuous distribution over all possible φ Φ. Then θ is generically identified if the probability that it is not point identified is zero. Lewbel (Boston College) Identification Zoo 2016 45 / 75

Generic identification is closely related to the order condition. Consider a linear regression system where the order condition holds. Then the rank condition holds, giving identification, if certain matrix of coeffi cient is nonsingular. In the set of all possible values for that matrix, if the subset that is singular has measure zero, then the order condition suffi ces for generic identification of the model. Similarly, the coeffi cients in a linear regression model Y = X β + e with E (e X ) = 0 are generically identified if the probability is zero that nature chooses a distribution function for X with the property that E (XX ) is singular. Lewbel (Boston College) Identification Zoo 2016 46 / 75

Another example: suppose iid of Y, X are observed with X = X + U, Y = X β + e, unobserved model error e, unobserved measurement error U, and unobserved true covariate X are mutually independent with mean zero. Despite not having instruments, β is identified when Y, X have any joint distribution except a normal (Reiersøl 1950). So β is generically identified if the set of possible Y, X distributions is suffi ciently large, e.g., if e could have been drawn from any continuous distribution. Lewbel (Boston College) Identification Zoo 2016 47 / 75

Given the same independence of U, e, and X, Schennach and Hu (2013) show that the function m in the nonparametric regression model Y = m (X ) + e is nonparametrically identified as long as m and the distribution of e are not members of a certain parametric class of functions. So again could say mutual independence of e, U, and X leads to generic identification of m, as long as m could have been any smooth function or if e could have been drawn from any smooth distribution. Generic Identification is commonly used in social interactions models. See, e.g., the survey by Blume, Brock, Durlauf, and Ioannides (2011). Lewbel (Boston College) Identification Zoo 2016 48 / 75

7. Identification Concepts that Affect Inference Identification preceeds estimation. We first show identification, then consider properties of estimators. However, sometimes the nature of the identification affects inference. The way θ is identified may impact the properties of any possible estimator θ These identification concepts that affect inference include weak identification, Identification at infinity, and ill-posedness. Note many previously discussed identification concepts also relate to estimation:. Extremum based identification, rates of nonparametric vs parametric estimation set vs point estimation. Lewbel (Boston College) Identification Zoo 2016 49 / 75

7.1 Weak vs. Strong Identification Weakly identified and strongly identified parameters are point identified. These should not, strictly speaking, be called identification. These instead refer to specific diffi culties relating to asymptotic inference. They re called forms of identification because they arise from features of the underlying structure s, and hence effect any possible estimator we might propose. Informally, weak identification arises in situations that are, in a particular way, close to being nonidentified. Lewbel (Boston College) Identification Zoo 2016 50 / 75

Usual source of weak identification: low correlations among variables used to attain identification, like instrument Z and covariate X. Parameters are not identified if the correlation is zero. Identification is weak (or the instruments Z are weak) when this correlation is close to zero. First stage of 2SLS is regress X on Z to get fitted values X. ( Some coeffi cients may be weakly identified if E X X ) is ill conditioned (nearly singular). More generally, in a GMM model weak identification may occur if the moments used for estimation yield noisy or generally uninformative estimates of the underlying parameters. Lewbel (Boston College) Identification Zoo 2016 51 / 75

Key feature of weak identification is NOT imprecisely estimates with large standard errors (though they do have that feature). Rather, standard asymptotics poorly approximate the true precision of parameter estimates when identification is weak. (Recall all asymptotics are merely approximations to true small sample distributions). Compare: nonparametric regressions are imprecise with large standard errors (due to slow convergence rates by the curse of dimensionality). But nonparametric regressions are not weakly identified, because standard asymptotic theory (e.g., a CLT with slow rates) adequately approximates their true estimation precision. Lewbel (Boston College) Identification Zoo 2016 52 / 75

Suppose that θ is identified, and can be estimated at rate root-n using an extremum estimator (maximizing some objective function, e.g., least squares or GMM or maximum likelihood). If some parameters are weakly identified, then any objective function used to identify and estimate them will be close to flat in some directions. Flatness yields imprecision, but more relevantly, flatness also means that standard errors and t-stats (either analytic or bootstrapped) will be poorly estimated. They depend on the inverse of a matrix of derivatives of the objective function. Flatness makes this matrix close to singular. Strongly identified parameters are not weak, standard estimated asymptotic distributions provide good approximations to their actual finite sample distributions. Lewbel (Boston College) Identification Zoo 2016 53 / 75

Weak identification resembles multicollinearity, which in linear regression means E (XX ) is ill-conditioned. Like multicollinearity, it is not that parameters either "are" or "are not" weakly identified. Weakness of identification depends on the sample size. For big enough n, standard asymptotic approximations must become valid. This is why weakness of identification is often judged by rules of thumb rather than formal tests. Staiger and Stock (1997) suggest the rule of thumb for linear two stage least squares models that should be concerned about potential weak instruments if the F-statistic on the excluded regressors in the first stage is less than 10. Lewbel (Boston College) Identification Zoo 2016 54 / 75

Two separate things: 1. Parameters that are weakly identified at reasonable sample sizes. 2. The models econometricians use to deal with weak identification. In real data, weak identification disappears when n gets suffi ciently large. This makes it diffi cult to provide a general asymptotic theory to deal with the problem. Econometrician s trick is an alternative asymptotic theory known as drifting parameter models. Lewbel (Boston College) Identification Zoo 2016 55 / 75

Consider Y = θx + U and X = βz + V, data are iid observations of scalars Y, X, Z. Errors E (UZ ) = E (VZ ) = 0, for simplicity all variables are mean zero. If β = 0, then θ is identified by θ = E (ZY ) /E (ZX ), corresponding to standard IV. Since E (ZX ) = βe ( Z 2), if β is close to zero then both E (ZY ) and E (ZX ) are close to zero, making θ weakly identified. But how close is close? The bigger the sample size, the more accurately E (ZX ), and E (ZY ) can be estimated, and hence the closer β can be to zero without causing trouble. To capture this idea asymptotically, pretend true β is not a constant, but instead is β n = bn 1/2. The larger n gets, the smaller the coeffi cient β n becomes. This gives us a model that has the weak identification problem at all sample sizes, and so can be analyzed asymptotically. Lewbel (Boston College) Identification Zoo 2016 56 / 75

If it were true that β n = bn 1/2, then b and hence β n is often not be identified, so tests and confidence regions for θ have been developed that are robust to weak instruments, that is, they do not depend on estimating β n. See, e,g., Andrews, Moreira, and Stock (2006) for an overview of such methods. Lewbel (Boston College) Identification Zoo 2016 57 / 75

In the literature, when we say θ "is" weakly identified, we mean we are providing asymptotic theory for inference that allows for identification to be based on a parameter that drifts towards zero. So in the above model would say θ is strongly identified when the asymptotics we use are based on β n = β, and would say θ is weakly identified when the asymptotics used for inference are based on (or at least allow for) β n = bn 1/2. Lewbel (Boston College) Identification Zoo 2016 58 / 75

Usually, we don t believe that parameters are actually drifting towards zero as n grows. Rather, when assuming weakly identification, we re expressing the belief that the drifting parameter model provides better asymptotic approximations to the true distribution than standard asymptotics. Other related terms include semi-strong identification and nonstandard-weak identification. See, e.g., Andrews and Guggenberger (2014). These and other variants have parameters drift to zero at other rates, or have models with a mix of drifting, nondrifting, and purely unidentified parameters. Lewbel (Boston College) Identification Zoo 2016 59 / 75

Historical notes on weak identification Earliest recognition of the problem may be Sargan (1983): "we can conjecture that if the model is almost unidentifiable then in finite samples it behaves in a way which is diffi cult to distinguish from the behavior of an exactly unidentifiable model." Other early related work: Phillips (1983), Rothenberg (1984). Bound, Jaeger, and Baker (1995) specifically raised the issue of weak instruments in an empirical context, and see Staiger and Stock (1997). A survey of the weak instruments problem is Stock, Wright, and Yogo (2002). Lewbel (Boston College) Identification Zoo 2016 60 / 75

7.2 Identification at infinity or zero; Irregular and Thin set identification Identification at infinity is when identification is based on the distribution of data at points were one or more variables goes to infinity. Suppose iid scalar random variables Y, D, Z. Assume Y = Y D Z Y, Y is a latent unobserved variable D is binary, lim z E (D Z = z) = 1. The goal is identification and estimation of θ = E (Y ). This is a selection model, Y is selected (observed) only when D = 1. So D could be a treatment indicator, Y is the outcome if one is treated, θ is average outcome everyone were treated, and Z is an observed variable (an instrument) that affects the probability of treatment, with the probability of treatment going to one as Z goes to infinity. Lewbel (Boston College) Identification Zoo 2016 61 / 75

Y = Y D, Z Y, Y is latent, D is binary, lim z E (D Z = z) = 1, θ = E (Y ). D and Y are correlated, so θ is identified only by lim z E (Y Z = z). Y and D may be correlated, E (Y ) confounds the two. But everyone who has Z infinite is treated, so looking at the mean of just those people eliminates the problem. In real data we would estimate θ as the average value of Y just for people that have Z > c for some chosen c, and then let c as the sample size grows Or could estimate θ as a weighted average of Y with weights that get arbitrarily large as Z gets arbitrarily large. Lewbel (Boston College) Identification Zoo 2016 62 / 75

First shown by Chamberlain (1986): Estimators based on identification at infinity usually converge slower than root-n. Same estimation problem arises whenever identification is based on a Z taking on a value or range of values that has probability zero. Kahn and Tamer (2010) call this thin set identification. Example: Manski s (1985) maximum score estimator for binary choice models is thin set identified, because it gets identifying power only from info at one point (the median) of a continuously distributed variable. Lewbel (Boston College) Identification Zoo 2016 63 / 75

Kahn and Tamer (2010) use irregular identification to specifically describe cases where thin set identification leads to slower than root-n rates of estimation. Not all thin set identified or identification at infinity parameters are irregular. For example, estimates of θ = E (Y ) in the selection problem can converge at rate root-n if Z has a strictly positive probability of equaling infinity. More subtly, the impossibility theorems of both Chamberlin and Khan and Tamer showing that some thin set identified models cannot converge at rate root n assume that the variables in the DGP have finite variances. This is one way that the Lewbel (2000) special regressor estimator can converge at rate root n. Lewbel (Boston College) Identification Zoo 2016 64 / 75

Irregular identification is not the same as weak identification. Irregularly identified parameters converge slowly for essentially the same reasons that kernel regressions converge slowly; they re based on a vanishingly small subset of the entire data set. Also like nonparametric regression, given suitable regularity conditions, standard methods can usually be used to derive asymptotic theory for irregularly identified parameters. However, the rates of convergence of thin set identified parameters can vary widely, from extremely slow up to root n, depending on details regarding the shape of the density function in the neighborhood of the identifying point. Lewbel (Boston College) Identification Zoo 2016 65 / 75

Sometimes it s easier to prove a parameter is identified by looking at a thin set, but the parameters may still strongly identified, because the data structure contains more information about the parameter than is being used in the proof. Example: Heckman and Honore (1989) showed that without parameterizing error distributions, the competing risks model is identified by data where covariates can drive individual risks to zero. Lee and Lewbel (2013) later showed that this model is actually strongly identified, using information over the entire DGP, not just where individual risks go to zero. Lewbel (Boston College) Identification Zoo 2016 66 / 75

Notes on irregular identification: Identification at infinity of the selection model given above is discussed by Chamberlain (1986), Heckman (1990), Andrews and Schafgans (1998), Lewbel (2007), and Khan and Tamer (2010). Khan and Tamer point out that the average treatment effect model of Hahn (1998) and Hirano, Imbens and Ridder (2003) is also generally irregularly identified, and so will not attain the parametric root n rates derived by those authors unless the instrument has thick tails Similarly, Lewbel s (2000) special regressor estimator requires thick tails to avoid being irregular. Lewbel (Boston College) Identification Zoo 2016 67 / 75

7.3 Ill-Posed Identification Let θ be (ordinary) identified from info φ by a structure s. Let s be such that, for some g, θ = g (φ). Ill-posedness is when g is not suffi ciently smooth. Ill-posedness is an identification concept like weak identification or identification at infinity: it s a feature of φ and s, and hence a property of the population. The concept of well-posedness (the opposite of ill-posedness) is due to Hadamard (1923). See Horowitz (2014) for a survey. Lewbel (Boston College) Identification Zoo 2016 68 / 75

Example: iid observations of W so φ is the distribution function of W, denote it F (w). A uniformly consistent, asymptotically n 1/2 normal estimator of F is the empirical distribution function: F (w) = n i=1 I (W i w) /n. Suppose θ = g (F ) for some known g. Then θ is (ordinary) identified. If g is continuous, then θ = g ( φ ) is consistent. If g is not continuous, then θ is generally not consistent, and we have ill-posedness. Lewbel (Boston College) Identification Zoo 2016 69 / 75

When ill-posed, consistent θ requires "regularization," smoothing out the discontinuity in g. Regularization introduces bias, and consistency needs a way to asymptotically shrink this bias to zero. Result: ill-posedness leads to slower convergence rates. Example: Estimation of the density function f (w) = df (w) /dw. There is not a continuous g such that f = g (F ). Can t take derivative of F (w) = n i=1 I (W i w) /n Kernel estimation is regularization. An example is f (w) = ( F (w + b) F (w b)) / (2b). Need b 0 as n. F converges n 1/2, kernel f is rate n 2/5. Lewbel (Boston College) Identification Zoo 2016 70 / 75

Nonparametric regression is similarly ill-posed. Regularization is kernel and bandwidth choice, or is the selection of basis functions and the number of terms to include in nonparametric sieve estimators. In some problems, ill-posedness is more severe, causing much slower convergence rates, like ln (n). Examples of problems that are often severely ill-posed are: Nonparametric instrumental variables, e.g., Newey and Powell (2003), Models containing mismeasured variables where the measurement error distribution is unknown. Random coeffi cient models where the distribution of the random coeffi cients is nonparametrically estimated. Lewbel (Boston College) Identification Zoo 2016 71 / 75

7.4 Bayesian Identification Ordinary identification is sometimes called frequentist identification or sampling identification, to contrast with Bayesian identification. Bayesian: parameter θ is random, has a prior and a posterior. Lindley (1971): identification is irrelevant for Bayesians: has no effect on whether one can specify a prior and obtain a poserior. Florens, Mouchart, and Rolin (1990), Florens and Simoni (2011): θ is Bayes identified if its posterior differs from its prior distribution. So θ is Bayes identified if data provides any info that updates the prior. Lewbel (Boston College) Identification Zoo 2016 72 / 75

Ordinary identification implies Bayes identification, because then the population updates the prior to a degenerate distribution. Set identified parameters are also typically Bayes identified. Enough info to confine θ to a set should be enough to update a prior. Example: Moon and Schorfheide (2012): When θ is set identified, the support of the posterior will typically lie inside the identified set. Intuition: The data tells us nothing about where θ could lie inside the identified set, but the prior provides information inside the set, and the posterior reflects that information. Lewbel (Boston College) Identification Zoo 2016 73 / 75

8. Conclusions Why is identification a rapidly growing area within econometrics (as evidenced by the growing terminology)? Rise of big data and computation creating ever more complicated models needing identification. Examples of increasing complexity: games and auction models, social interactions models, forward looking dynamic models, and models with nonseparable errors assigning behavioral meaning to unobservables. At the other extreme, the so-called credibility revolution: search for sources of randomization in constructed or natural experiments, to gain identification under minimal modeling assumptions. Lewbel (Boston College) Identification Zoo 2016 74 / 75

Unlike statistical inference, there is not a large body of general tools or techniques for proving identification. Identification theorems tend to be highly model specific. Some general techniques for identification: Control function methods as in Blundell and Powell (2004), Special regressors as in Lewbel (2000), Completeness as in Newey and Powell (2003), Observational equivalence characterizations as in Matzkin (2005). Characterizations of moments over unobservables as in Schennach (2014). Development of more such general techniques and principles would be a valuable area for future research. Given the increasing recognition of the importance of identification in econometrics, the Identification zoo is likely to keep expanding. Lewbel (Boston College) Identification Zoo 2016 75 / 75