Causal Inference on Total, Direct, and Indirect Effects

Transcription

1 Causal Inference on Total, Direct, and Indirect Effects Rolf Steyer, Axel Mayer and Christiane Fiege In: A. Michalos (ed.), Encyclopedia of Quality of Life Research Version: May 27,

2 1 RESEARCH TRADITIONS 2 Abstract The theory of causal effects (TCE) is a mathematical theory providing a methodological foundation for design and analysis of experiments and quasi-experiments. TCE consists of two parts. In the first part, total, direct, and indirect effects are defined, the second part deals with causal inference, i. e., in the second part it is shown how causal effects are identified by estimable quantities. In each part, there are two levels, a disaggregated and a re-aggregated one. In the definition part of TCE, the disaggregated level is called the atomic level. In this part we translate J. St. Mill s ceteris paribus clause into probabilistic concepts. For this purpose, we introduce temporal order between events and/or random variables using the concept of a filtration. Defining an atomic totaleffect variable we isolate the effects of X on Y, controlling for all variables that are prior or simultaneous to X, while ignoring all intermediate variables in between X and Y. In contrast, in the definition of an atomic t-direct-effect variable, we ignore all intermediate variables in between t and Y, but control all variables (potential confounders) that are prior or simultaneous to t. At the second level of the definition part of TCE, we aggregate these atomic effects defining average effects as expectations and conditional effects as conditional expectations of the corresponding atomic effect variables. In the identification part of TCE we connect causal effects to estimable quantities, namely conditional expectations of Y given X, or of Y given X, covariates, and/or intermediate variables by the unbiasedness assumption. At the disaggregated level of the identification part, we present a number of causality conditions, i. e., conditions that imply unbiasedness and identifiability of causal effects. At this level we condition on covariates such that one of these causality conditions hold. Once identification of conditional causal effects is achieved by controlling for these covariates, we can again re-aggregate taking expectations and/or conditional expectations of those conditional causal effects obtained at the disaggregated level. In this way we can coarsen the conditional effects obtained at the disaggregated level of the identification part. TCE has implications for design and data analysis of empirical studies aimed at estimating and testing causal effects. The most important design techniques are randomization, conditional randomization, and covariate selection. All these design techniques aim at satisfying one of the causality conditions. Techniques of data analysis can also be selected and/or developed guided by TCE (for more details see the Conclusion). 1 Research Traditions Studying the effect of a variable X on a variable Y, we distinguish between total, direct, and indirect effects (Wright, 1921, 1923). In a randomized experiment, the average total treatment effect is typically estimated, which is the average causal effect of a treatment variable X on a response variable Y, irrespective of mediation processes. As soon as we want to gain insight into transmitting pathways, intermediate variables have to be included in order to estimate direct effects of X on Y. Direct effects represent those parts of total effects that are not transmitted through the intermediate variables. In contrast, indirect effects are those components of total effects of X on Y that are not direct but are transmitted through mediators. This article aims at sharpened these intuitive ideas, presenting the stochastic theory of causal effects, which emerged from several research traditions. The most important contribution of the Neyman-Rubin tradition (see, e.g., Splawa- Neyman, 1923/1990; Rubin, 2005) is its emphasis on defining causal effects such as

3 1 RESEARCH TRADITIONS 3 individual, conditional, and average treatment effects. Defining such effects is important for proving that certain methods of data analysis yield unbiased estimates of these effects if certain assumptions can be made. Are there conditions under which the analysis of change scores (between pre- and post-tests) and repeatedmeasures analysis of variance yield causal effects? Under which conditions do we test causal effects in the analysis of covariance? Which are the assumptions under which propensity score methods yield unbiased estimates of causal effects? Answers to all these questions presuppose that we have a clear-cut definition of causal effects. The Campbellian tradition (see, e. g., Shadish, Cook, & Campbell, 2002), less formalized than the Neyman-Rubin tradition, addresses questions and problems beyond causality itself, which are also relevant in empirical causal research, such as: How to generalize beyond the study? What does the treatment variable mean? Which is the causal agent in a compound treatment variable comprising many aspects? What is the meaning of the response variable? Does it in fact measure the construct of interest? And, perhaps the most important question: Are there alternative explanations for the effects? In the graphical modeling tradition (see, e. g., Pearl, 2009; Spirtes, Glymour, & Scheines, 2000) techniques have been developed for estimating causal effects, finding confounders, identifying causal effects, and searching for causal models if specific assumptions can be made. The fact that a randomized experiment does not guarantee the validity of causal inference on direct effects has been brought up by this research tradition. Structural equation modeling and psychometrics showed how to use latent variables and structural equation modeling in testing causal hypotheses. Although many scientists hope to find causal answers via structural equation modeling, it should be clearly stated that structural equation modeling and this is also true for graphical modeling and other kinds of statistical modeling (including analysis of variance) does neither necessarily mean to estimate and test causal effects, nor does it provide a satisfactory theory of causal effects. Nevertheless, this research tradition contributes just like graphical modeling and other areas of statistics many techniques and tools that are useful in the analysis of causal effects. Mediation Analysis has roots in genetics, psychology, sociology and epidemiology. Mediation is concerned with analyzing total, direct and indirect effects. To date, much substantive research applying mediation analysis is based on influential papers by Baron and Kenny (1986), Bollen (1987), MacKinnon (2008), Preacher, Rucker, and Hayes (2007), Sobel (1982), which are based on the original work of Sewall Wright cited above. Recently, ideas from the causal inference literature have entered the discourse on mediation analysis. Questions like Is the effect of X on Y causally transmitted through a mediator? or What is the causal direct effect of X on Y have been raised in this field. Early concerns about the causal interpretability of mediation effects have already been expressed by Judd and Kenny (1981) as well as Holland (1988). All these research traditions (as well as others not mentioned) contributed to our knowledge about causal inference. In this article, we present a unified stochastic theory of causal effects, focussing on experimental and quasi-experimental designs, in which the putative cause is a discrete random variable (see also Mayer, Thoemmes, Rose, Steyer, & West, 2012). It is presumed that the reader is familiar with some fundamental concepts of measure and probability theory as provided in textbooks such as Bauer (1996), Bauer (2001), Klenke (2008), or Steyer, Nagel,

4 2 PRELIMINARY CONSIDERATIONS 4 Partchev, and Mayer (in press). 2 Preliminary Considerations 2.1 Basic Idea of Causal Effects For the time being, consider a random variable X with two values, 0 and 1, assuming P(X=0), P(X =1) > 0. These values of X may represent two treatment (intervention, exposition) conditions, e. g., a treatment and a control. Therefore, X will be called a treatment variable. The random variable Y (assessing, e. g., quality of life) will be called the response variable and is assumed to have a finite expectation. This assumption implies that the regression E(Y X ) is defined, that the conditional expectations E(Y X=1) and E(Y X=0) exist and are finite, and that the difference E(Y X=1) E(Y X=0) between the two conditional expectations of Y in the two treatment conditions is defined. Note that E(Y X ) can be written as a linear function α 0 + α 1 X with slope α 1 = E(Y X=1) E(Y X=0). However, the difference E(Y X=1) E(Y X=0) is not necessarily identical to the (total) causal treatment effect comparing treatment (x =1) to control (x= 0). The crucial problem is that X and Y may both depend on a covariate Z. In this case a regressive dependence of Y on X would be observed i. e., α 1 0 (see left-hand side of Fig. 1), even though Y is regressively independent of X given Z (see right-hand side of Fig. 1). In such a case, the slope of the regression, i. e., α 1 = E(Y X=1) E(Y X=0), does not describe the total causal effect of X on Y. What would be the remedy? Clearly, if Z were the only variable biasing the dependence of Y on X, then keeping constant Z at one of its values z would eliminate this spurious dependence of Y on X. In this case, the differences E(Y X=1, Z=z) E(Y X=0, Z=z) would describe the (Z =z)-conditional total treatment effects of X on Y. Furthermore, taking the expectation of these (Z =z)-conditional effects over the distribution of Z would yield the average total treatment effect. Unfortunately, in experiments and quasi-experiments in psychology and other social sciences, there are many variables that may create bias. However, conceptually, and for the purpose of defining atomic and various aggregated (average, conditional) causal effects, all these variables can be controlled, as will be shown in section 3. Of course, in empirical applications, controlling all these variables that may create bias is a challenge. 2.2 Conceptual Framework Probability Space and Random Variables In the sequel, the following kind of random experiments are considered: (a) sample a person u out of a set of persons (the population of persons), (b) observe the value z of a fallible (possibly multivariate) pre-treatment variable Z, (c) assign the unit or observe its assignment to one of several experimental conditions (represented by the values x of the treatment variable X ),

5 2 PRELIMINARY CONSIDERATIONS 5 Z X ε X α Y ε X 0 Y ε Y Figure 1. Path diagrams representing the regression E(Y X ) (on the left) as well as E(X Z ) and E(Y X, Z ) (on the right). (d) observe the value m of a (possibly multivariate) intermediate variable M, (e) observe the value y of the response variable Y. This kind of random experiments is called a single-unit trial. It is the kind of empirical phenomena typically considered in that part of the stochastic theory of causality that is devoted to causal effects in experiments and quasi-experiments. It does not represent a sampling process, which consists of repeating such a single-unit trial many times in some way or other and which would be considered in treating estimation and hypothesis testing. But these issues are not considered in this article. Like all other random experiments, a single-unit trial as described above is represented by a probability space (Ω, A, P), which is the formal framework that is necessary to define, e. g., random variables, distributions, conditional expectations (see, e. g., Bauer, 2001, Klenke, 2008, or Steyer, Nagel, et al., in press), and causal effects. Example 1: Joe and Ann With Self-Selection The first column of Table 1 shows all eight possible outcomes ω Ω of a very simple random experiment: Sample a person from the set Ω U := { Joe,Ann }, observe whether (yes) or not (no) the sampled person selects treatment, and whether (+) or not ( ) the sampled person reaches a success criterion. This example suffices to illustrate many concepts used in this article. The eight triples displayed in the first column are the elements of the set Ω. The set of all subsets of Ω, the power set P (Ω), is chosen as the σ-algebra A. (For the concept of a σ-algebra see, e. g., chapter 1 of Steyer, Nagel, et al., in press). It has 2 8 = 256 elements, representing all events that can be considered in this random experiment. The second column displays the probabilities of all elementary events {ω}, ω Ω. These eight probabilities can be used to compute the probabilities of all 256 events using the additivity of the probability measure P : A [0,1] (see Steyer, Nagel, et al., in press, Def. 4.1). For example, the probability of the event that Joe is sampled and treated is P[{( Joe, yes, ), ( Joe, yes, +)}] = P[{( Joe, yes, )}] + P[{( Joe, yes, +)}] = =.02. In fact, all other parameters displayed in Table 1 can be computed from the probabilities of the eight elementary events. Alternatively, all probabilities displayed in this table, including those for the elementary events, can be computed from the eight parameters of the first experiment displayed in Table 2.

6 2 PRELIMINARY CONSIDERATIONS 6 Table 1. Joe and Ann with self-selection Outcomes ω Observables Regressions Unit Treatment Success P ({ω}) Person variable U Treatment variable X Response variable Y E(Y X,U ) E(Y X ) E(X U ) E X=0 (Y U ) E X=1 (Y U ) P X=0 ({ω}) P X=1 ({ω}) ( Joe, no, ).144 Joe ( Joe, no, +).336 Joe ( Joe, yes, ).004 Joe ( Joe, yes, +).016 Joe (Ann, no, ).096 Ann (Ann, no, +).024 Ann (Ann, yes, ).228 Ann (Ann, yes,+).152 Ann Table 1 also displays several random variables, e. g.,the observable random variables (the observables) U, X, and Y. The first, U, will be called the person variable. Its values are the person sampled in the random experiment considered. The second, X, called the treatment variable, indicates whether or not the person sampled is treated. The third, Y, is called the response variable. In this example, it indicates whether or not the patient sampled gives a positive statement about his or her quality of live, six months after treatment. Other random variables are the regressions (synonymously, conditional expectations) E(Y X,U ), E(Y X ), and E(X U ). Because Y is a dichotomous regressand with values 0 and 1, these regressions are also denoted by P(Y= 1 X,U ), P(Y= 1 X ), and P(X=1 U ), respectively. All the random variables mentioned above are mappings on Ω with values in a subset of the setr of real numbers, except for U, which takes on its values in the set { Joe, Ann }. By definition, all random variables on (Ω,A,P) are measurable with respect to A and have a distribution denoted by P U, P X, P Y, etc. (see chapter 5 of Steyer, Nagel, et al., in press). Let us use X to illustrate the concept of measurability with respect to A. First note that X : Ω Ω X is a mapping with domain Ω and range Ω X = {0,1}. Furthermore, the definition of a random variable does not only presuppose that there is a σ-algebra on Ω, but also a σ-algebra on Ω X. In our example, this σ-algebra on Ω X is A X = { Ω X,Ø,{0},{1} }. The definition of a random variable on (Ω,A,P) requires that X 1 (A ) A, A A X, (1) where X 1 (A ) := {ω Ω: X (ω) A } is the inverse image of A under X. In our example, (1) is trivially true, because A is the power set P (Ω). In examples in which A P (Ω) and this is the case as soon as continuous random variables are involved this requirement is not trivial. If it holds, then it follows that all events X 1 (A ), A A X, have a probability, namely P[X 1 (A )], because P is a mapping on

7 2 PRELIMINARY CONSIDERATIONS 7 Table 2. Four random experiments with Joe and Ann compressed Random Experiment u P (U=u) E X=0 (Y U=u) E X=1 (Y U=u) P (X = 1 U=u) 1. With self-selection Joe Ann No Treatment for Joe Joe Ann With random assignment Joe Ann Homogeneous Joe Ann A, assigning a probability to all its elements. This fact is used to define the distribution of X by P X : A X [0,1] with P X (A ) = P[X 1 (A )], A A X. (2) In our example, the distribution of X is specified by P X ({0}) = P[X 1 ({0})] = P[{( Joe, no, ), ( Joe, no, +),( Ann, no, ), ( Ann, no, +)}] = =.6. Analogously, P X ({1})=.4, P X (Ω X)=1, and P X (Ø)=0. Finally, the set X 1 (A X) := { X 1 (A ): A A } X is called the σ-algebra generated by X and is also denoted by σ(x ). In our example, σ(x ) = { Ω, Ø, X 1 ({0}), X 1 ({1}) } = { Ω, Ø, {( Joe, no, ), ( Joe, no, +), ( Ann, no, ), ( Ann, no, +)}, Filtration and Temporal Order {( Joe, yes, ),( Joe, yes, +),( Ann, yes, ), ( Ann, yes, +)} }. In contrast to many other random experiments, in a single-unit trial as described in section 2.2, there is additional structure: There are events that are prior to the treatment variable X such as the event Joe is sampled or the event that the person sampled is male. Random variables may also be prior to X, such as the fallible pre-test Z = quality of life before treatment, or the person variable U (taking on the values Joe, Ann, Jim, etc.) itself, which is prior to Z, because the person, its sex, its race, etc. are determined before a fallible value z of Z is assessed. Furthermore, the response Y represents events, such as {Y=y } := {ω Ω: Y (ω) = y }, that may occur after treatment. Hence, Y is posterior to X. (3)

8 2 PRELIMINARY CONSIDERATIONS 8 In more formal and general terms, this temporal order can be represented by a filtration (F t ) t T in A, which is a fundamental concept in the theory of stochastic processes (see Table 3 and, e. g., Bauer, 1996; Klenke, 2008; Øksendal, 2007). In many applications it is sufficient to consider a filtration with an index set T = {1,...,n T }, where n T is a natural number > 1. In other applications, T might be a subset of the set of real numbers. Example 1 continued In Example 1, we define F 1 := σ(u ), F 2 := σ(u, X ), and F 3 := σ(u, X,Y ). Hence, in this example, the filtration (F t ) t T consists of n T = 3 σ-algebras: F 1 has only four elements, the event that Joe is sampled, the event that Ann is sampled, Ω (Joe or Ann is sampled), and Ø (neither Joe nor Ann is sampled). The σ-algebra F 2 has 2 4 = 16 elements: All elements in F 1, the event that the person sampled is treated, the event that the person sampled is not treated, as well as events such as Joe (Ann) is sampled and (not) treated. Finally, F 3 has 2 8 = 256 elements. It is identical to the power set of Ω and contains as elements all events that can be considered in this random experiment. 2.3 Preliminary Definitions Order With Respect to a Filtration Figure 2 depicts a filtration with n T = 5 and it also shows in which σ-algebra F t the events {U=u }, {Z=z }, {X=x }, etc. occur for the first time. For example, {Z=z } F 1 but {Z=z } F 2, {X=x } F 2 but {X=x } F 3, etc. Using such a filtration, one can easily define terms such as U is prior to X, X is prior to Y, and X 1 is simultaneous to X 2, e. g., if X 2 is a second treatment variable and the second treatment is applied at the same time as the first one. The idea is to see in which σ-algebra F t events such as {X=x }, {Z=z }, or {Y=y } occur for the first time. Using this criterion for the kind of single-unit trial described above, X is prior to M, which itself is prior to Y, whereas X is posterior to U and Z. Using the concept of a σ-algebra generated by a random variable V [denoted σ(v )] (see e. g., Klenke, 2008), this idea is defined in more formal terms in Table 3. The σ-algebras generated by X, Y, etc. are subsets of the corresponding σ-algebras F t. In contrast, the events {X=x }, {Y=y }, etc. are elements of the corresponding F t (see again Fig. 2). Global Covariates The concept of a global t-covariate of X defined in Table 3 is crucial. It is denoted by C X,t and will be used to define true-outcome variables and atomic causal effects, i. e., effects on the most fine-grained level (see section 3). Note that there several time points t T with respect to which a global covariate of X can be considered. For example, defining atomic total effects, we control for C X,tX, which is defined such that it comprises all variables other than X that are prior or simultaneous to X. In contrast, defining atomic direct effects with respect to time t, we control for C X,t, which is defined such that it comprises all variables other than X that are prior, simultaneous, or posterior to X, but not posterior to t (see Table 3). More precisely, C X,t is a random variable on (Ω,A,P) and its most important property is σ(c X,t, X ) = F t, i. e., C X,t and X together generate F t. The second assumption ensures that X is not comprised in C X,t, i. e., σ(x ) σ(c X,t )], and the third

9 2 PRELIMINARY CONSIDERATIONS 9 {U=u } F 1 σ(u ) {Z=z } F 2 σ(z ) {X=x } F 3 = F tx σ(x ) {M=m} F 4 = F tm σ(m) {Y=y } F 5 σ(y ) Figure 2. Venn diagram of a filtration with T = {1,...,5} assumption implies that C X,t is simultaneous or posterior to X and prior to Y. Intuitively speaking, C X,t comprises all random variables on (Ω,A,P) that are prior or simultaneous to t, except for X itself, i. e., it comprises all potential confounders that could possibly bias t-direct effects (pertaining to pairs of values) of X on Y. Covariates and Intermediate Variables In this framework, a t-covariate of X is defined as any random variable on (Ω,A,P), say Z t, with σ(z t ) σ(c X,t ). This implies: all events A A that are represented by a t-covariate of X, such as {Z t = z t }, are elements of F t. Similarly, using the filtration, (t 1, t 2 )-intermediate variables can also be defined (see Table 3). Note again that T may also be a continuous (time) set. Simplified Notation For simplicity, the terms covariate of X and t X -covariate will be used as synonyms. Similarly, C X := C X,tX and Z := Z tx denote a global t X -covariate of X (or simply, a global covariate of X ) and a t X -covariate of X (or simply, a covariate of X ), respectively. In single-unit trials in which no fallible covariates of X are assessed, U can be a global covariate of X. Considering a random experiment in which a fallible covariate of X is assessed, then (U, Z ) can be a global covariate of X, where Z denotes the (possibly multivariate) random variable consisting of all fallible covariates of X.

10 2 PRELIMINARY CONSIDERATIONS 10 Table 3. Framework and preliminary concepts Let X, Y, and W be random variables on a probability space (Ω,A,P). Filtration in A A family (F t ) t T of σ-algebras F t A such that F s F t if s t. X is prior to Y X is called prior to Y (and Y posterior to X ) in (F t ) t T, if there is an s T such that σ(x ) F s, σ(y ) F s, and there is a t T, s < t, such that σ(y ) F t. X is simultaneous to Y Global t-covariate of X t X t Y X is called simultaneous to Y in (F t ) t T, if there is a t T such that σ(x ) F t, σ(y ) F t, and no s T, s < t, such that σ(x ) F s or σ(y ) F s. A random variable denoted C X,t satisfying: (a) σ(x,c X,t )=F t, (b) the product measure P X P CX, t exists, and (c) t X t < t Y, where t X T is defined by σ(x ) F tx and σ(x ) F t if t < t X. (t Y is defined in the same way replacing X by Y ). t-covariate of X A random variable Z t on (Ω,A,P) with σ(z t ) σ(c X,t ). Global covariate of X A random variable C X on (Ω,A,P) such that C X := C X,tX. Covariate of X A random variable Z on (Ω,A,P) such that σ(z ) σ(c X ). (t 1, t 2 )-intermediate variable Causality space with discrete cause A random variable M on (Ω,A,P) such that σ(m) F t1 and there exists a t T, t < t 2, such that σ(m) F t. A quadruple ( (Ω,A,P),(F t ) t T, X,Y ) satisfying: (a) (F t ) t T is a filtration in A, (b) X is discrete with values in Ω X = {0,1,...,n }, and P(X=x)> 0, x Ω X, (c) Y is numerical with finite expectation E(Y ), and (d) X is prior to Y in (F t ) t T. Causality Space Throughout the rest of this article we assume that there is a causality space with discrete cause (see Table 3). Such a causality space provides the formal framework in which causal effects can be defined. Example 1 continued In section 2.2, the filtration (F t ) t T has been specified for Example 1. Using the definitions displayed in Table 3, yields: U is prior to X and to Y, X is prior to Y. Furthermore, 1 Joe is simultaneous to U, where 1 Joe denotes the indicator variable of the event that Joe is sampled. It takes on the value 1, if Joe is sampled, and 0, otherwise. In this example, U and 1 Joe are global covariates of X, and U, 1 Joe, and 1 mal e are covariates of X, where 1 mal e is the indicator variable of the event that the sampled person is male. (In this specific example with only one male and one female person, 1 Joe = 1 mal e.) If we would also like to consider intermediate variables, Table 1 would have to be extended to include at least one intermediate variable such

11 3 CAUSAL EFFECTS 11 as quality of live three months after treatment. The filtration (F t ) t T would have ( to be extended correspondingly. Hence, now all components of a causality space (Ω,A,P),(Ft ) t T, X,Y ) defined in Table 3 have been illustrated. 3 Causal Effects In this section the definitions of the adjusted conditional expectations and causal effects displayed in Table 4 are explained and illustrated. 3.1 (X=x)-Conditional Probability Measure A fundamental concept used in the definitions in Table 4 is the (X =x)-conditional probability measure P X=x. Assume P (X=x)>0, for x Ω X. Then the (X=x)-conditional probability measure on (Ω, A ) is defined by A A : P X=x (A) := P(A X=x). (4) where x Ω X = {0,1,...,n }. Hence, for each value x of X there is such a probability measure. The last two columns of Table 1 display the values of P X=0 and P X=1 for all elementary events {ω} in Example 1. Because distributions, expectations, conditional expectations, etc. all refer to a probability measure, each of these measures defines distributions, expectations, conditional expectations, etc. with respect to these measures. Hence, P X=x Y will denote the distribution, E X=x (Y ) the expectation, and E X=x (Y Z ) the Z -conditional expectation of Y with respect to the measure P X=x. [Chapter 13 of Steyer, Nagel, et al., in press provides an extensive presentation of E X=x (Y Z ).] 3.2 True-Outcome Variable With Respect to t As already mentioned, C X,t, a global t-covariate of X, comprises all variables that are prior or simultaneous to t, except for X. Hence, conditioning on C X,t, all potential confounders of t-direct effects are controlled. Now consider the C X,t -conditional expectation of Y with respect to P X=x. For t T, we define a version of the trueoutcome variable τ x,t with respect to t by τ x,t := E X=x (Y C X,t ), x Ω X. (5) Hence, intuitively speaking, considering such a true-outcome variable τ x,t, conditioning is on X and all other variables that are prior or simultaneous to t. What still varies and may affect Y are measurement errors of Y, but also effects of variables that are in between t and t Y. If t = t X and U takes the role of C X,t, then τ x,t is analog to Rubins potential outcome (see, e. g., Rubin, 2005). P X=x -Uniqueness and P -Uniqueness In general, conditional expectations are not uniquely defined. Hence, there is a set E X=x (Y C X,t ) of such conditional expectations. However, if τ x,t,τ x,t E X=x (Y C X,t ) are two such versions, then they are P X=x -equivalent, i. e., τ x,t = P X=x τ x,t, (6)

12 3 CAUSAL EFFECTS 12 Table 4. Adjusted conditional expectations and t-direct-effect functions Let ( (Ω,A,P),(F t ) t T, X,Y ) be a causality space with discrete cause, let C X,t be a global t-covariate of X, let W be a random variable on (Ω,A,P), and let x, x Ω X = {0,1,...,n } be two values of X. E X=x (Y C X,t ) E X=x (Y C X,t ) τ x,t δ xx,t E C X,t(Y X=x) ADE xx,t The set of all versions of the C X,t -conditional expectation of Y with respect to P X=x. A version of the C X,t -conditional expectation of Y with respect to P X=x. A shortcut for E X=x (Y C X,t ) is τ x,t. A version of the atomic t-direct-effect variable of x vs. x. Assume: (a) there is a τ x,t E X=x (Y C X,t ) with finite expectation E(τ x,t ) and a τ x,t E X=x (Y C X,t ) with finite expectation E(τ x,t ). (b) τ x,t and τ x,t are P-unique. Assumption (a) implies that there is a finite τ x,t E X=x (Y C X,t ) and a finite τ x,t E X=x (Y C X,t ). Choosing two such finite τ x,t and τ x,t, we define δ xx,t := τ x,t τ x,t. Assumption (b) implies that δ xx,t is P-unique. The C X,t -adjusted (X=x)-conditional expectation of Y. If (a) and (b) hold, we define E C X,t(Y X=x) := E(τ x,t ) and say that it exists. The average t-direct effect of x vs. x. Assuming that E C X,t(Y X=x) and E C X,t(Y X=x ) exist, we define ADE xx,t := E C X,t(Y X=x) E C X,t(Y X=x ). E C X,t(Y X=x;W ) A version of the C X,t -adjusted (X=x,W )-conditional expectation of Y. If (a) and (b) hold, we define E C X,t(Y X=x;W ) := E(τ x,t W ) and say that it exists. CDE xx,t (W ) A version of the W -conditional t-direct effect-function of x vs. x. If (a) and (b) hold, then E C X,t(Y X=x;W ) and E C X,t(Y X=x ;W ) are P-unique. Furthermore, under (a) and (b), there is a finite version E C X,t(Y X=x;W ) and a finite version E C X,t(Y X=x ;W ). Choosing two such finite versions, we define CDE xx,t (W ) := E C X,t(Y X=x;W ) E C X,t(Y X=x ;W ). Note: Proofs of the propositions in this table are found in chapters 13 to 15 of Steyer, Nagel, et al. (in press).

13 3 CAUSAL EFFECTS 13 which is a shortcut for P X=x({ ω Ω: τ x,t (ω)=τ x,t (ω) }) = 1. Hence, Equation (6) means that τ x,t and τ x,t take on identical values with (X=x)- conditional probability 1. In this case, E X=x (Y C X,t ) is said to be P X=x -unique. Hence, P X=x -uniqueness of E X=x (Y C X,t ) means that all versions τ x,t E X=x (Y C X,t ) are pairwise P X=x -equivalent. Note that P X=x -uniqueness does not imply P-uniqueness of E X=x (Y C X,t ), i. e., it does not imply P-equivalence of τ x,t,τ x,t E X=x (Y C X,t ), which is defined by Again, (7) is a shortcut for τ x,t = P τ x,t. (7) P ({ ω Ω: τ x,t (ω)=τ x,t (ω) }) = 1. The assumption that τ x,t is P-unique plays a crucial role not only in the definition but also in the identification of causal effects. It implies that all versions τ x,t E X=x (Y C X,t ) have identical distributions, and therefore also identical expectations, variances, and covariances with other random variables. P-uniqueness of τ x,t is equivalent to which is defined by P(X=x C X,t ) > P 0, (8) P ({ ω Ω: P(X=x C X,t )(ω)> 0 }) = 1. In our examples with Joe and Ann, in which U takes the role of C X,tX, requiring P(X=x U ) > P 0 means that all persons must have a nonzero treatment probability, unless the person has a zero probability to be sampled. [See chapter 13 of Steyer, Nagel, et al., in press for other conditions that are equivalent to P-uniqueness.]. Example 1 continued In Example 1, U is a global t X -covariate of X. Using the simplified notation C X,t = C X for the case t = t X, we can also say that U is a global covariate of X. Table 1 displays the U -conditional expectations E X=0 (Y U ) and E X=1 (Y U ), which are identical with the total-effect true-outcome variables τ 0 and τ 1. In this example, these true-outcome variables are uniquely defined and therefore they are also P-unique. They are random variables on (Ω,A,P) just like U, X, Y, and the other regressions such as E(Y X ), E(Y X,U ), and E(X U ). Note that, by definition, τ 0 = E X=0 (Y U ) and τ 1 = E X=1 (Y U ) are measurable with respect to U, i. e., σ(τ 0 ) σ(u ) and σ(τ 1 ) σ(u ). This implies that there are functions g 0, g 1 : { Joe,Ann } R such that τ 0 = g 0 (U ) and τ 1 = g 1 (U ) (see section of Steyer, Nagel, et al., in press). From a substantive point of view, this means that the values of τ 0 and τ 1 represent properties of the person u sampled in the random experiment considered, the conditional expectations E X= 0 (Y U=u) and E X=1 (Y U=u).

14 3 CAUSAL EFFECTS 14 Example 2: No Treatment for Joe The second part of Table 2 displays a random experiment in which the causality space is identical to the one described in Example 1 except for the probability measure P. In this second example, τ 1 = E X=1 (Y U ) is not P-unique. The reason is that P(X=1 U=Joe ) = 0, whereas P(U=Joe ) > 0. In this case, the value of E X=1 (Y U ) is not uniquely defined for all ω {U=Joe }. Hence, E X=1 (Y U=Joe ) is an arbitrary real number. [In Table 2, the number 99 has arbitrarily been chosen. Although this number is not a conditional probability, it is fully in line with the general definition of a conditional expected value as the value of a factorization of a regression (see Steyer, Nagel, et al., in press, chapter 9).] The fact that E X=1 (Y U=Joe ) is arbitrary is not a problem by itself, because P(X=1 U=Joe ) = 0. However, together with P(U=Joe )>0, it is a problem: It implies that E X=1 (Y U ) is not P-unique, and which in turn implies, e. g., that different versions τ 1,τ 1 E X=1 (Y U ) have different expectations, i. e., E(τ 1 ) E(τ 1). In the same example, τ 0 = E X=0 (Y U ) is P-unique ; it is even uniquely defined. This implies, e. g., that E(τ 0 ) is a uniquely defined number. Expectations such as E(τ 0 ) and E(τ 1 ) play a crucial role in the definition of average direct and total effects. However, P-uniqueness of the true-outcome variables is also required in the definition of atomic total and direct effect variables. 3.3 Atomic t -Direct-Effect Variable Assumption (a) in Table 4 implies that there is a finite version τ x,t E X=x (Y C X,t ) and a finite version τ x,t E X=x (Y C X,t ). Assuming P-uniqueness of τ x,t and τ x,t [see assumption (b) in that table] is a second prerequisite for the difference τ x,t τ x,t to be meaningful. This assumption is equivalent to P(X=x C X,t ) > P 0 and P(X= x C X,t ) > P 0. (9) It implies that τ x,t τ x,t is P-unique. Assuming (a) and (b) in Table 4, we choose finite versions of τ x,t and τ x,t and define a version of the atomic t-direct-effect variable by δ xx,t := τ x,t τ x,t. (10) This definition implies that δ xx,t is P-unique and finite. If Z t is a t-covariate of X, then, by definition, σ(z t ) σ(c X,t ) F t. Therefore, E X=x (Y C X,t ) = P X=x E X=x (Y C X,t, Z t ), x Ω X. This means, with C X,t we control for all t-covariates of X. In intuitive terms this means: With C X,t all potential confounders of t-direct effects or controlled. In other words, an atomic t-direct-effect variable is defined such that it cannot be biased (cf. sections 2.1 and 4.1). 3.4 Atomic Total-Effect Variable If t = t X, we omit the index t using τ x := E X=x (Y C X ), x Ω X, (11) and δ xx := τ x τ x. (12)

15 3 CAUSAL EFFECTS 15 The random variable τ x is called a version of the total-effect true-outcome variable pertaining to x, whereas δ xx is called a version of the atomic total-effect variable of x vs. x. Hence, an atomic total-effect variable is an atomic t X -direct-effect variable. In the example presented in Table 1, the atomic total-effect variable δ 10 is identical to the difference E X=1 (Y U ) E X=0 (Y U ) taking the value δ 10 (ω) =.10 if ω {U=Joe } and the value δ 10 (ω) =.20 if ω {U=Ann }. It is a random variable on the probability space (Ω,A,P), it is P-unique, and it is measurable with respect to U, i. e., σ(δ 10 ) σ(u ). In Example 2, the atomic total-effect variable δ 10 is not defined, because τ 1 = E X=1 (Y U ) is not P-unique. 3.5 Adjusted (X =x)-conditional Expectation As explained above, the true-outcome variables and the atomic-effect variables are defined such that they cannot be biased, because, with C X,t, all variables that could induce bias are controlled. In general, in applications, neither the true-outcome variables nor the atomic-effect variables can be observed or estimated. However, expectations and conditional expectations of the true-outcome variables and atomiceffect variables can be estimated, provided that appropriate assumptions can be made (see section 4). Note that although re-aggregated, these expectations and conditional expectations remain adjusted from bias. In general, the expectations and conditional expectations of the atomic-effect variables just coarsen the effects, they do not introduce bias. The concept of a C X,t -adjusted (X=x)-conditional expectation, denoted E C X,t(Y X=x), is a good starting point. Under the assumptions (a) and (b) in Table 4, it exists and is defined as the expectation E(τ x,t ) (see Table 4). Assumptions (a) and (b) in Table 4 imply that E(τ x,t ) is uniquely defined and finite, which also means that E(τ x,t ) does not depend on the choice of the version τ x,t E X=x (Y C X,t ). 3.6 Average t -Direct Effect If E C X,t(Y X=x) and E C X,t(Y X= x ) exist, then the average t-direct effect of x vs. x is defined by ADE xx,t := E C X,t (Y X=x) E C X,t (Y X= x ). (13) Note that ADE xx,t = E(δ xx,t ) = E(τ x,t ) E(τ x,t ). (14) 3.7 Adjusted (X =x, W )-Conditional Expectation So far two extremes have been considered, the true-outcome variables and their differences, the atomic t-direct effects on one side, and their expectations, the adjusted (X =x)-conditional expectations and their differences, the average t-direct effects, one the other side. Conditional t-direct effects are somewhere in between these two extremes. The basic idea is to consider a random variable W and the W -conditional expectations of the atomic t-direct effects given W. Because W can be multivariate, consisting of several univariate random variables W 1,...,W m, the degree of aggregation of the atomic t-direct effects depends on the choice of W. Note that W might also be continuous.

16 3 CAUSAL EFFECTS 16 Table 5. Adjusted conditional expectations and total effects Let ( (Ω,A,P),(F t ) t T, X,Y ) be a causality space with discrete cause, let C X be a global covariate of X, and let W be random variable on (Ω,A,P). E X=x (Y C X ) τ x The set of all versions of the C X -conditional expectation of Y with respect to P X=x, where x Ω X. A version of the total-effect true-outcome variable. τ x := E X=x (Y C X ). δ xx A version of the atomic total-effect variable of x vs. x. E C X (Y X=x) δ xx := δ xx,t X. A version of the C X -adjusted (X=x)-conditional expectation of Y. E C X (Y X=x) := E C X,t X(Y X=x). ATE xx The average total effect of x vs. x. E C X (Y X=x;W ) ATE xx := ADE xx,t X. A version of the C X -adjusted (X=x,W )-conditional expectation of Y. E C X (Y X=x;W ) := E C X,t X(Y X=x;W ). CTE xx (W ) A version of the W -conditional total effect-function of x vs. x. CTE xx (W ) := CDE xx,t X (W ). Again, begin with a version of the C X,t -adjusted (X=x,W )-conditional expectation of Y. Under the assumptions (a) and (b) in Table 4 we define E C X,t (Y X=x;W ) := E(τ x,t W ), (15) call it a version of the C X,t -adjusted (X=x,W )-conditional expectation of Y, and say that it exists. Assumptions (a) and (b) in Table 4 imply that there is a finite version E(τ x,t W ), and P-uniqueness of τ x,t implies that E(τ x,t W ) = P E(τx,t W ) if τ x,t,τ x,t E X=x (Y C X,t ). Hence, there exists a finite version E C X,t(Y X=x;W ) and it is P-unique. 3.8 W-Conditional t -Direct-Effect Function Assumptions (a) and (b) in Table 4 imply that there is a finite version E C X,t(Y X=x;W ) and a finite version E C X,t(Y X= x ;W ). Choosing two such finite versions, we define CDE xx,t (W ) := E C X,t (Y X=x;W ) E C X,t (Y X= x ;W ) (16) call it a version of the W-conditional t-direct-effect function of x vs. x, and say that it exists. Note that CDE xx,t (W ) is P-unique and finite, and that CDE xx,t (W ) = P E(δ xx,t W ) = P E(τ x,t W ) E(τ x,t W ). (17)

17 3 CAUSAL EFFECTS Average and Conditional Total Effects Remember, C X := C X,tX and the atomic total effect has been defined as a special t- direct effect for t = t X. Correspondingly, all average and conditional total effects will be defined as t X -direct effects. Table 5 summarizes the various total effects. Example 1 continued In the example displayed in Table 1, the expectations of the true-outcome variables τ 0 = E X=0 (Y U ) and τ 1 = E X=1 (Y U ) are E(τ 0 ) =.70 P(U=Joe ) +.20 P(U=Ann ) = =.45 and E(τ 1 ) =.80 P(U=Joe ) +.40 P(U=Ann ) = =.60. Hence, the expectation of δ 10 = τ 1 τ 0 is E(δ 10 ) = E(τ 1 ) E(τ 0 ) = =.15. In this example, the U -conditional total-effect function CTE 10 (U ) = P E(δ 10 U ) (18) can also considered. Because δ 10 is measurable with respect to U, it follows that E(δ 10 U )=δ 10 [see Rule (vii) of Box 9.2 in Steyer, Nagel, et al., in press]. Later on other examples are presented in which a Z -conditional total-effect function CTE 10 (Z ) is considered, where Z denotes the random variable sex (see Table 10). In these examples, CTE 10 (Z ) CTE 10 (U ) Indirect Effects Indirect effects are simply differences between total and direct effects. Suppose that the assumptions (a) and (b) in Table 4 hold for t (with global covariate C X,t ) and for t X (with global covariate C X ), where t X < t < t Y. Then the difference δ xx δ xx,t (19) is called a version of the atomic t-indirect effect-variable of x vs. x. Under the same assumptions we define AIE xx,t := ATE xx ADE xx,t (20) and call it the average t-indirect effect. Finally, and again under the same assumptions, we define CIE xx,t (W ) := CTE xx (W ) CDE xx,t (W ) (21) and call it the W -conditional t-indirect-effect function.

18 3 CAUSAL EFFECTS 18 M ε M Z X Y ε Y Figure 3. A path diagram representing a causal process with a single mediator M. Example 3: A Simple Path Model Total, direct, and indirect effects are most easily illustrated by a computer simulation, such as the following one: (a) Sample a value of a normally distributed random variable Z with expectation 100 and standard deviation 10. (b) Sample a value of a Bernoulli distributed random variable X with expectation.5. Ensure that X and Z are independent. (This independence would also be created in a randomized experiment.) (c) Compute a value of M by M = X +.3 Z + ε M, where ε M is normally distributed with expectation 0 and standard deviation 3. Ensure that ε M and (X, Z ) are independent. (d) Compute a value of Y by Y = X+.7 Z+.5 M+ε Y, where ε Y is normally distributed with expectation 0 and standard deviation 3. Ensure that ε Y and (X, Z, M) are independent. Repeating steps (a) to (d) n times would yield a concrete sample of size n and a data matrix of type n 4. The dependencies between the four random variables are perfectly described by the two regression equations and E(M X, Z ) = P X Z. (22) E(Y X, M, Z ) = P X Z M, (23) which, except for the intercepts, can also be represented by the path diagram displayed in Figure 3. For didactic purposes, this example is confined to linear parameterizations of the regressions without interactions. However, the general theory of causal effects, outlined in this article, can accommodate much more complex models. Now let us construct the causality space, in particular the probability space (Ω, A, P) and the filtration (F t ) t T. The set of possible outcomes is Ω =R Ω X R R, where Ω X = {0,1}, the σ-algebra on Ω is the product σ-algebra A = B P (Ω X ) B B

19 3 CAUSAL EFFECTS 19 (see Steyer, Nagel, et al., in press, chapter 1), and the probability measure P on (Ω,A ) is specified by the distributional assumptions described in points (a) to (d) above. Now, X, Y, Z, and M are random variables on (Ω,A,P) and a filtration (F t ) t T in A can be specified by: F 1 = σ(z ), F 2 = σ(z, X ), F 3 = σ(z, X, M), and F 4 = A = σ(z, X, M,Y ). Now total, direct, and indirect effects are specified in this example, starting with atomic total-effect variable δ 10. In this example, Z is a global covariate of X, because σ(z, X )=F 2 = F tx. Therefore, τ 0 := E X=0 (Y C X )=E X=0 (Y Z ), τ 1 := E X=1 (Y C X )= E X=1 (Y Z ), and δ 10 := τ 1 τ 0 = E X=1 (Y Z ) E X=0 (Y Z ). Hence, in order to specify δ 10, the conditional expectations E X=0 (Y Z ) and E X=1 (Y Z ) have to be computed. As a first step, using the rules of computation for regressions (see Steyer, Nagel, et al., in press, Box 9.2), compute E(Y X, Z ) = P E [E(Y X, M, Z ) X, Z ] = P E(80+10 X +.7 Z +.5 M X, Z ) [(23)] = P X +.7 Z +.5 E(M X, Z ) = P X +.7 Z +.5 (60+20 X +.3 Z ) [(22)] = P X +.85 Z. Now, independence of X and Z implies that the regressions E X=x (Y Z ) are P-unique. Hence, E X=0 (Y Z ) = P Z, E X=1 (Y Z ) = P Z (see section 13.4 of Steyer, Nagel, et al., in press) and δ 10 = τ 1 τ 0 = P E X=1 (Y Z ) E X=0 (Y Z ) = P 20. Hence, in this example, the atomic total-effect function δ 10 is constant and therefore its expectation, the average total effect is ATE 10 = E(δ 10 ) = E(20) = 20. The same applies to the Z -conditional total-effect function CTE 10 (Z ) = P E(δ 10 Z ) = P E(20 Z ) = P 20. Now consider the atomic t 3 = t M -direct-effect variable δ 10,tM = τ 1,tM τ 0,tM = E X=1 (Y C X,tM ) E X=0 (Y C X,tM ). In this example, the bivariate random variable (Z, M) is a global t M -covariate of X, because σ(z, M, X ) = F tm = F 3. Furthermore, because P(X=x Z, M) > P 0, the regressions E X=x (Y Z, M) are P-unique (see Steyer, Nagel, et al., in press, chapter 13). This implies Equation (23) implies δ 10,tM = τ 1,tM τ 0,tM = P E X=1 (Y Z, M) E X=0 (Y Z, M). E X=0 (Y Z, M) = P Z +.5 M

20 4 CAUSALITY CONDITIONS AND IDENTIFICATION OF CAUSAL EFFECTS 20 and Therefore, E X=1 (Y Z, M) = P Z +.5 M. δ 10,tM = P Z +.5 M ( Z +.5 M) = 10, which, in this example, is a constant, too. Hence, the average t M -direct effect is ADE 10,tM = E(δ 10,tM ) = E(10) = 10, and the Z -conditional t M -direct-effect function is CDE 10,tM (Z ) = P E(δ 10,tM Z ) = P E(10 Z ) = P 10. Finally, in this example, the atomic t M -indirect-effect variable is δ 10 δ 10,tM = P = 10, again a constant. Hence, AIE 10,tM = E(δ 10 δ 10,tM ) and AIE 10,tM (Z )=E(δ 10 δ 10,tM Z ) are equal to 10 as well. Obviously, our results are in line with the well-known rules of computing total, direct, and indirect effects in linear path models (see, e. g., Bollen, 1987). However, while those are restricted to linear path models and exclude interactions, our theory applies irrespective of how the regressions involved are parameterized. In this example, two observations are worthwhile mentioning. First, independence of X and Z implies that the total effect of X on Y is Z -unbiased. However, even though X and Z are independent, omitting Z yields a seriously biased direct effect (see Mayer et al., 2012 for a detailed presentation). Second, note that, in this particular example, A = σ(z, X, M,Y ) and the joint distribution of these four random variables determines the probability measure P on (Ω,A ). In this sense, our example is a closed system. In this particular example, there are no random variables that are not measurable with respect to σ(z, X, M,Y ). Such a closed system is realistic in the computer sciences and in engineering. In many other empirical sciences, the situation is different: there, σ(z, X, M,Y ) A, but not σ(z, X, M, Y ) = A. In the theory of causal effects, not only the random variables such as X, Y, Z, and M are needed, but also a probability space (Ω,A,P) and a filtration (F t ) t T, which are constructed such that all pre-treatment variables and not only Z are measurable with respect to A. Similarly, if considering direct effects, F tm has to be constructed in such a way that all variables that are simultaneous or prior to M have to be measurable with respect to F tm. Only with reference to them the relationship between the included variables such as X, Y, Z, and M, and omitted variables that may create bias can be specified. In other words, in serious empirical applications, (Ω,A,P) and (F t ) t T have to be constructed such that they represent the real world. Only then is it possible to investigate if it is sufficient to consider the variables such as X, Y, Z, and M that occur in our regression models. It is exactly the relationship between the included and the omitted variables that is at issue in the definition of unbiasedness and other causality conditions. 4 Causality Conditions and Identification of Causal Effects So far, the concepts of atomic, average and conditional total, direct, and indirect effects have been defined and illustrated, confining the presentation to experiments

21 4 CAUSALITY CONDITIONS AND IDENTIFICATION OF CAUSAL EFFECTS 21 or quasi-experiments. Now causal inference is treated: How to infer from empirically estimable quantities to these causal effects? How to identify the various causal effects and effect functions from empirically estimable quantities? The key is to link the causal effects to estimable quantities by an unbiasedness assumption. Although such an unbiasedness assumption is not empirically testable itself, it is implied by a number of causality conditions, some of which are empirically testable. 4.1 Unbiasedness Unbiasedness of the Conditional Expectations E(Y X =x) and E(Y X ) Let τ x,t be a version of the true-outcome variable with respect to t and E C X,t(Y X=x) a version of the C X,t -adjusted (X=x)-conditional expectation of Y (see Table 4). Then the conditional expectation E(Y X=x) is called C X,t -unbiased, if E(Y X=x) = E C X,t (Y X=x). (24) Because E C X,t(Y X=x)=E(τ x,t ), it follows: If E C X,t(Y X=x) exists, then the conditional expectation E(Y X=x) is C X,t -unbiased if and only if E(Y X=x) = E(τ x,t ). (25) Finally, because it is presumed that X is discrete with P (X=x)>0 for all its values, we can define C X,t -unbiasedness of the conditional expectation E(Y X ) by E(Y X=x) = E C X,t (Y X=x), x Ω X. (26) Unbiasedness of the Conditional Expectations E X=x (Y W ) and E(Y X,W ) In Table 4 we defined E C X,t(Y X=x;W ) := E(τ x,t W ), a version of the C X,t -adjusted (X=x,W )-conditional expectation of Y. Referring to this term, E X=x (Y W ) is called (C X,t ;W )-unbiased, if E X=x (Y W ) = P E C X,t (Y X=x;W ). (27) Again, if E C X,t(Y X=x;W ) exists, we can conclude that E X=x (Y W ) is (C X,t ;W )-unbiased if and only if E X=x (Y W ) = P E(τ x,t W ). (28) Finally, because we confine ourselves to the case in which X is discrete with P (X=x) > 0 for all its values, (C X,t ;W )-unbiasedness of the conditional expectation E(Y X,W ) can be defined by E X=x (Y W ) = P E C X,t (Y X=x;W ), x Ω X. (29) Usually, unbiasedness cannot be tested empirically, at least not for all values of X, because it involves the true-outcome variables that cannot be estimated unless overly strong assumptions are introduced. However, there are a number of conditions implying unbiasedness and identifiability of causal effects. Conditions that imply unbiasedness are called causality conditions, and some of these can be tested empirically. We present two kinds of such testable conditions and a third kind that cannot be tested empirically. In the first kind of these conditions we consider the relationship between X and C X,t, and in the second, the relationship between Y and C X,t. The third, which is analog to Rosenbaum and Rubin s strong ignorability (see Rosenbaum & Rubin, 1983), is implied by both kinds of causality conditions.