Chapter 1: Introduction to General Causual Effect and Treatment Efffect

draft

1 Definition of Causal Effect and Treatment Effect

A causal effect can be defined as the impact that a variable $X$ exerts on another variable $Y$ , without being limited to a specific form. In contrast, a treatment effect is a special case of a causal effect. It specifically refers to the impact of receiving or not receiving a certain intervention (treatment), where the treatment variable is typically binary (e.g., treated vs. untreated).

1.1 Individual Treatment Effect ITE

According to the potential outcomes model (Rubin, 1974), the potential outcome $Y_i(d)$ represents the outcome for unit $i$ under treatment status $d$ . Potential outcomes can be understood from a functional perspective: for each unit, the outcome is a response to a hypothetical treatment level $d$ . The function $f: d \mapsto Y_i(d)$ captures this abstract mapping between stimulus and response, which exists independently of time and space and may take an arbitrary form. The value of $Y_i(d)$ may be real-valued. The term “potential” emphasizes the hypothetical and unobservable nature of this outcome.

The individual treatment effect (ITE) is defined as the difference between the potential outcome under treatment and under control:

ITE_i = Y_i(1) - Y_i(0)

1.2 Average Treatment Effect ATE

From the perspective of groups, individual treatment effects can be aggregated into group-level averages. For individuals who received the treatment, we define the Average Treatment Effect on the Treated (ATT) as:

ATT = \mathbb{E}[Y_i(1) - Y_i(0) \mid D_i = 1] = \mathbb{E}[Y_i(1) \mid D_i = 1] - \mathbb{E}[Y_i(0) \mid D_i = 1]

Similarly, for those who did not receive the treatment, we define the Average Treatment Effect on the Untreated (ATU) as:

ATU = \mathbb{E}[Y_i(1) - Y_i(0) \mid D_i = 0] = \mathbb{E}[Y_i(1) \mid D_i = 0] - \mathbb{E}[Y_i(0) \mid D_i = 0]

When considering all units, regardless of their treatment status, we define the Average Treatment Effect (ATE) as:

ATE = \mathbb{E}[Y_i(1) - Y_i(0)] = \mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)]

If the proportion of treated individuals is $w$ , then the ATE can be written as a weighted average of ATT and ATU:

ATE = w \times ATT + (1 - w) \times ATU

2 Oberserved Outcomes and Counterfactuals

Treatment effects are unobservable because we can only observe one potential outcome $Y_i(d)$ for each unit depending on the realized treatment assignment $D_i \in \{0,1\}$ . The other potential outcome, corresponding to the unobserved treatment status, is called the counterfactual.

To connect the theoretical potential outcomes model to empirical data, we define the observed outcome as:

Y_i^{\text{obs}} = D_i \cdot Y_i(1) + (1 - D_i) \cdot Y_i(0)

Corresponding to the observed outcome is the counterfactual outcome, which refers to the treatment state that the individual unit did not actually experience, that is, when $D_i = 1 - d$ . Under this condition, the corresponding potential outcome is: $Y_i(1-d)$ . The key to estimating the treatment effect lies in estimating the unobservable counterfactual outcome.

Although in any actual observation, the researcher has access to at most the realized value of one potential outcome and the other is unobservable, the existence of counterfactual outcomes constitutes a prerequisite for the definability of causal effects in the theoretical construct. In the potential outcome framework, individual causal effects are formally defined as the difference between the potential outcomes of the same unit in two treatment states, a definition that logically depends on the conception of counterfactual outcomes and is empirically unidentifiable. To overcome this identification barrier, causal inference typically turns to the estimation of group-averaged treatment effects. (The usual here refers to the existence of mainstream causal effect identification strategies, as well as non-mainstream causal effect identification strategies, such as the identification of individual effects using Bayesian posterior inference.)

3 Bias

3.1 Bias in estimating ITE

For individuals $i$ and $j$ , their respective treatment effects are $\Delta_i = Y_i(1) - Y_i(0), \quad \Delta_j = Y_j(1) - Y_j(0)$ . The observed outcomes are $Y_i^{\text{obs}} = Y_i(1) \mid D_i = 1, \quad Y_j^{\text{obs}} = Y_j(0) \mid D_j = 0$ . Then, the difference between their observed outcomes can be written as:

Y_i^{\text{obs}} - Y_j^{\text{obs}} = Y_i(1) - Y_j(0) = Y_i(1) + Y_i(0) - Y_i(0) - Y_j(0) = \Delta_i + Y_i(0) - Y_j(0)

The difference between the observed results includes not only the individual treatment effect $\Delta_i$ of individual $i$ , but also the difference in their untreated potential outcomes $Y_i(0) - Y_j(0)$ .

A typical example involves drug treatment, but this lacks broader social interpretation. Instead, consider the effect of a university degree on starting salary. Suppose graduates $i$ and $j$ from the same university start working in the same year. Individual $i$ successfully obtains a degree, while individual $j$ does not. If we estimate the treatment effect of the degree based on the difference in their starting salaries, the estimation will include not only the effect of the degree $\Delta_i$ , but also the unobservable baseline difference $Y_i(0) - Y_j(0)$ .

Although $Y_i(0) - Y_j(0)$ is unobservable, we can hypothesize its possible value. For example, imagine you're a hiring manager comparing two resumes, both without a degree (e.g., due to error), but you still assign different salaries based on other cues, such as appearance or style. If you subconsciously assign a higher salary to the more attractive person, then $Y_i(0) - Y_j(0)>0$ , and the degree's treatment effect will be overestimated.

On the other hand, if the less attractive person would have received a higher salary without a degree, then $Y_i(0) - Y_j(0)<0$ , and the treatment effect will be underestimated. I hope this example can illustrate that most of us researchers are not the rule-makers (bosses), but rather a somewhat naive observer (parents).

3.2 Bridge

You may be thinking of numerous other variables or factors that I have not noted that play a role in wage negiotiations, and that produce observable outcomes that are unobservable in the scenario set up in this example. Due to the unobservability of the counterfactual outcome (Fundamental Problem of Causal Inference), no individual can be in both the treatment and control groups, and thus the ITE is not directly identifiable at the empirical level. Its estimation must rely on modeling assumptions (e.g., structure of the distribution of potential outcomes, no unobserved confounding, etc.) that cannot themselves be verified by observational data, leading to a risk of systematic bias in causal inference at the individual level.

In contrast, the average treatment effect becomes an identifiable estimation target because it allows the researcher to average out or model control for bias through statistical group comparisons under certain assumptions (randomization, strong ignorability, or covariate balancing), which in turn achieves identifiability of the estimator. In other words, at the individual level we are dealing with an “unobserved and uncorrectable error term”, whereas at the average level this error term can be explained, adjusted, or canceled out in the aggregate structure.

3.3 Bias in estimating ATE

Let $\mathbb{E}[Y_i(1) \mid D_i = 1] = T1$ ， $\mathbb{E}[Y_i(0) \mid D_i = 1] = T0$ , ATT is defined as $ATT = T1 - T0$ ; let $\mathbb{E}[Y_i(1) \mid D_i = 0] = C1$ ， $\mathbb{E}[Y_i(0) \mid D_i = 0] = C0$ , ATU is defined as $ATU = C1 - C0$ , then, Then the overall average treatment effect is the weighted average:

ATE = w \times ATT + (1 - w) \times ATU

In non-experimental data, researchers often estimate the average treatment effect based on the observed mean difference between the treatment group and the control group, e.g.,

\widehat{ATE} = \mathbb{E}[Y_i(1) \mid D_i = 1] - \mathbb{E}[Y_i(0) \mid D_i = 0] = T1 - C0

Note that $T1-C0$ can be disassembled as follows

T1 - C0 = \underbrace{T1 - T0}_{ATT} + T0 - C0

It can also be expressed as:

T1 - C0 = \underbrace{C1 - C0}_{ATU} + T1 - C1

Therefore, it is further expanded as follows:

\widehat{ATE} = T1 - C0 = w(T1 - T0) + (1 - w)(C1 - C0) + w(T0 - C0) + (1 - w)(T1 - C1)

With this additive decomposition, we can make it clear that the deviation between the observed mean difference and the true causal effect depends not only on whether the treatment assignment is random, but also on whether the mean structure of the groups is symmetric under different potential outcomes.

The observed mean difference is equal to ATE only if $T0 = C0$ and $T1 = C1$ . This bias is known as selection bias.

4 Independence Assumption

The main challenge in identifying causal effects in observational data stems from the fact that treatment variables are not randomly assigned. Since the treatment status of each unit may be affected by its own characteristics and then associated with its potential outcomes, there is often endogeneity between treatment variables and potential outcomes. This problem, known as the problem of non-independence of the processing allocation mechanism, is the root cause of selection bias and estimation error.

Assumption 1: Independence of Average Untreated Potentical Outcomes

Y(0) \perp D

This assumption means that the potential outcome under no treatment is independent of whether the unit receives treatment. In other words, the average untreated potential outcome for the treated group is the same as that for the control group. Formally:

\mathbb{E}[Y_i(0) \mid D_i = 1] = \mathbb{E}[Y_i(0) \mid D_i = 0] \Rightarrow T0 = C0

Under this assumption, we can use the observed outcome of the control group, $C0$ , to approximate the counterfactual potential outcome of the treated group, $T0$ . Thus, we can identify:

ATT = T1 - C0

Assumption 2: Independence of the Average Treated Potential Outcome

Y(1) \perp D

This assumption means that the potential outcome under treatment is independent of whether the unit actually receives treatment. That is, the average treated potential outcome for the treated group is the same as that for the control group. Formally:

\mathbb{E}[Y_i(1) \mid D_i = 1] = \mathbb{E}[Y_i(1) \mid D_i = 0] \Rightarrow T1 = C1

Under this assumption, we can use the observed outcome of the treated group, $T1$ , to approximate the counterfactual potential outcome of the untreated group, $C1$ . Thus, we can identify:

ATU = T1 - C0

Assumption 3: Full Independence of Potential Outcomes

In the ideal scenario, we assume that the treatment assignment is completely independent of both potential outcomes. That is:

(Y(1), Y(0)) \perp D

When the following assumption holds,

\mathbb{E}[Y_i(0) \mid D_i = 1] = \mathbb{E}[Y_i(0) \mid D_i = 0], \quad \mathbb{E}[Y_i(1) \mid D_i = 1] = \mathbb{E}[Y_i(1) \mid D_i = 0]

we can identify the Average Treatment Effect (ATE) by:

ATE = \mathbb{E}[Y_i(1) \mid D_i = 1] - \mathbb{E}[Y_i(0) \mid D_i = 0] = T1 - C0

This assumption is known as the strong independence assumption or unconfoundedness. It means that the treatment assignment is completely determined by an external random mechanism, independent of potential outcomes. Such a condition typically holds in a Randomized Controlled Trial (RCT), where treatment assignment is randomized by design, hence uncorrelated with any latent traits of individuals.

5 Conditional Independence Assumption (CIA)

5.1 Form

In observational studies, the strong independence assumption often fails. Therefore, researchers propose a weaker identifying assumption—the Conditional Independence Assumption (CIA), defined as:

(Y(1), Y(0)) \perp D \mid X

Here, $X$ is a vector of covariates that may influence both treatment assignment and potential outcomes. CIA implies that conditional on $X$ , the treatment assignment is as good as random.

This assumption underlies many common identification strategies such as propensity score matching and covariate adjustment, and allows us to use observed outcomes to estimate causal effects despite not observing counterfactuals.

Under CIA, we can recover the average treatment effect as follows:

\mathbb{E}[Y(1)] = \mathbb{E}_X[\mathbb{E}[Y \mid D = 1, X]], \quad \mathbb{E}[Y(0)] = \mathbb{E}_X[\mathbb{E}[Y \mid D = 0, X]]

Hence:

ATE = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = \mathbb{E}_X[\mathbb{E}[Y \mid D = 1, X]] - \mathbb{E}_X[\mathbb{E}[Y \mid D = 0, X]]

Although CIA is a foundational assumption for identifying causal effects in observational settings, it cannot be empirically verified, because counterfactual outcomes are never observed. That is, we cannot directly test whether $(Y(1), Y(0)) \perp D \mid X$ .

Thus, the plausibility of CIA relies heavily on theoretical reasoning, domain knowledge, and the richness of observed covariates.

5.2 Confounding Bias Analysis

If important confounders are unobserved and omitted from $X$ , then CIA fails, and the estimated treatment effect becomes biased. This bias is systematic, not random sampling error, and does not vanish with large samples.

In the case of a misspecified model (e.g., when CIA is violated), the treatment variable $D_i$ is correlated with omitted variables. Suppose the true model is:

Y_i = \tau D_i + \gamma U_i + \varepsilon_i

but we estimate:

Y_i = \hat{\tau} D_i + \eta_i, \quad \text{where } \eta_i = \gamma U_i + \varepsilon_i

Then the bias in the OLS estimate $\hat{\tau}$ is due to the correlation between $D_i$ and the omitted variable $U_i$ .

Then, the OLS estimator is:

\hat{\tau} = \frac{\text{Cov}(D_i, Y_i)}{\text{Var}(D_i)}

Substitute the true model into the formula:

Y_i = \tau D_i + \gamma U_i + \varepsilon_i

We get:

\hat{\tau} = \frac{\text{Cov}(D_i, \tau D_i + \gamma U_i + \varepsilon_i)}{\text{Var}(D_i)} = \frac{\text{Cov}(D_i, \tau D_i)}{\text{Var}(D_i)} + \frac{\text{Cov}(D_i, \gamma U_i)}{\text{Var}(D_i)} + \frac{\text{Cov}(D_i, \varepsilon_i)}{\text{Var}(D_i)}

Using linearity of covariance and assuming:

\text{Cov}(D_i, \tau D_i) = \tau, \quad \text{Cov}(D_i, \varepsilon_i) = 0

We obtain:

\hat{\tau} = \tau + \frac{\gamma \cdot \text{Cov}(D_i, U_i)}{\text{Var}(D_i)}

Therefore, the bias is:

\text{Bias}(\hat{\tau}) = \hat{\tau} - \tau = \frac{\gamma \cdot \text{Cov}(D_i, U_i)}{\text{Var}(D_i)}

In the context of causal inference, the bias we obtained is so called counfunding bias, and $U_i$ is omitted confounder. As the above argument and derivation is carried out at the population level, covariance $Cov(D_i, U_i)$ and variance of treatment variable $Var(D_i)$ are population parameters, e.g., definite values rather than random variables. So, the expetation of bias $\mathbb{E}[Bias(\hat\tau)]$ is itself, and its variance $Var(Bias(\hat\tau))$ equals zero.

In pratical estimation, population parameters are unknown, we need to use the sample variance instead of the population variance. As both sample covariance and sample variance are estimators, they are themselfves randm variables, bringing sampling error, then, the bias expression becomes a random variable.

Using sample covariance $\widehat{\text{Cov}}(D_i, U_i)$ and sample variance $\widehat{\text{Var}}(D_i)$ as estimators, the bias expression is changed to:

\text{Bias}(\hat{\tau}) = \frac{\gamma \cdot \widehat{\text{Cov}}(D_i, U_i)}{\widehat{\text{Var}}(D_i)}

As the numerator and denominatoer are random variables, the expected value cannot be decomposed into the expected ratio of the covariance to the variance estimator. In other words:

\mathbb{E}\left[\frac{\widehat{\text{Cov}}}{\widehat{\text{Var}}}\right] \neq \frac{\mathbb{E}[\widehat{\text{Cov}}]}{\mathbb{E}[\widehat{\text{Var}}]}.

If the sample size is large enough and the setimates are consistent, then:

\frac{\widehat{\text{Cov}}}{\widehat{\text{Var}}} \approx \frac{\mathbb{E}[\widehat{\text{Cov}}]}{\mathbb{E}[\widehat{\text{Var}}]} + \text{infinitesimal of higher order}

\mathbb{E}[\text{Bias}(\hat{\tau})] \approx \frac{\gamma \cdot \text{Cov}(D_i, U_i)}{\text{Var}(D_i)}.

\text{Var}(\text{Bias}(\hat{\tau})) \approx \left( \frac{\gamma}{\text{Var}(D_i)} \right)^2 \cdot \text{Var}(\widehat{\text{Cov}}) + \left( \frac{-\gamma \cdot \text{Cov}(D_i, U_i)}{\text{Var}(D_i)^2} \right)^2 \cdot \text{Var}(\widehat{\text{Var}})

If $\text{Cov}(D_i, U_i) > 0$ and $\gamma > 0$ , the bias is positive, leading to an overestimation of the treatment effect.

If $\text{Cov}(D_i, U_i) < 0$ , the bias is negative, leading to underestimation.

For example, suppose $U_i$ represents an unobserved variable like “attractiveness.” If more attractive individuals are more likely to be treated and also tend to have better outcomes, then failing to control for $U_i$ results in upward bias in $\hat{\tau}$ .

If only social background and academic credentials are observed, and these are correlated with treatment status but not with underlying personal traits like motivation or appearance, we may seriously misestimate the true treatment effect.

Appendix 1

In many statistical derivations, we often encounter estimators in the form of ratios, such as:

\widehat{\theta} = \frac{X}{Y}

As $X$ and $Y$ are random variables, the expectation of ration is $\mathbb{E}[\widehat{\theta}] = \mathbb{E}\left[\frac{X}{Y}\right]$ generally cannot be written as $\frac{\mathbb{E}[X]}{\mathbb{E}[Y]}$ . For approximate calculation, we can use first Taylor expansion to expand it:

Let $(\mu_X, \mu_Y)$ as the expansion point, $f(X, Y) = \frac{X}{Y}$ , $\mu_X = \mathbb{E}[X]， \mu_Y = \mathbb{E}[Y]$ .

A first-order Taylor expansion of $f(X, Y)$ in the neighborhood of $(\mu_X, \mu_Y)$ yields:

$f(X, Y) \approx f(\mu_X, \mu_Y) + f_X'(\mu_X, \mu_Y)(X - \mu_X) + f_Y'(\mu_X, \mu_Y)(Y - \mu_Y)$

f_X = \frac{\partial}{\partial X} \left( \frac{X}{Y} \right) = \frac{1}{Y}

f_Y = \frac{\partial}{\partial Y} \left( \frac{X}{Y} \right) = -\frac{X}{Y^2}

\frac{X}{Y} \approx \frac{\mu_X}{\mu_Y} + \frac{1}{\mu_Y}(X - \mu_X) - \frac{\mu_X}{\mu_Y^2}(Y - \mu_Y)

\mathbb{E}\left[\frac{X}{Y}\right] \approx \frac{\mu_X}{\mu_Y} + \frac{1}{\mu_Y}\mathbb{E}[X - \mu_X] - \frac{\mu_X}{\mu_Y^2}\mathbb{E}[Y - \mu_Y]

AS $\mathbb{E}[X - \mu_X] = 0$ and $\mathbb{E}[Y - \mu_Y] = 0$ ,

\mathbb{E}\left[\frac{X}{Y}\right] \approx \frac{\mu_X}{\mu_Y}

\boxed{\mathbb{E}\left[\frac{X}{Y}\right] \approx \frac{\mathbb{E}[X]}{\mathbb{E}[Y]}}

Since,

f(X, Y) \approx f(\mu_X, \mu_Y) + f_X'(\mu_X, \mu_Y)(X - \mu_X) + f_Y'(\mu_X, \mu_Y)(Y - \mu_Y)

then,

\text{Var}(g(X, Y)) \approx (g'_X)^2 \cdot \text{Var}(X) + (g'_Y)^2 \cdot \text{Var}(Y) + 2g'_X g'_Y \cdot \text{Cov}(X, Y)

PreviousWelcome NextChapter 2: Introduction to Causal Effect over Time

Last updated 10 months ago

hashtag1 Definition of Causal Effect and Treatment Effect

hashtag1.1 Individual Treatment Effect ITE

hashtag1.2 Average Treatment Effect ATE

hashtag2 Oberserved Outcomes and Counterfactuals

hashtag3 Bias

hashtag3.1 Bias in estimating ITE

hashtag3.2 Bridge

hashtag3.3 Bias in estimating ATE

hashtag4 Independence Assumption

hashtagAssumption 1: Independence of Average Untreated Potentical Outcomes

hashtagAssumption 2: Independence of the Average Treated Potential Outcome

hashtagAssumption 3: Full Independence of Potential Outcomes

hashtag5 Conditional Independence Assumption (CIA)

hashtag5.1 Form

hashtag5.2 Confounding Bias Analysis

hashtagAppendix 1