1Chapter 1: Introduction to General Causual Effect and Treatment Efffect

draft

1 Definition of Causal Effect and Treatment Effect

A causal effect can be defined as the impact that a variable XX exerts on another variable YY, without being limited to a specific form. In contrast, a treatment effect is a special case of a causal effect. It specifically refers to the impact of receiving or not receiving a certain intervention (treatment), where the treatment variable is typically binary (e.g., treated vs. untreated).

1.1 Individual Treatment Effect ITE

According to the potential outcomes model (Rubin, 1974), the potential outcome Yi(d)Y_i(d) represents the outcome for unit ii under treatment status dd . Potential outcomes can be understood from a functional perspective: for each unit, the outcome is a response to a hypothetical treatment level dd. The function f:dYi(d)f: d \mapsto Y_i(d) captures this abstract mapping between stimulus and response, which exists independently of time and space and may take an arbitrary form. The value of Yi(d)Y_i(d)may be real-valued. The term “potential” emphasizes the hypothetical and unobservable nature of this outcome.

The individual treatment effect (ITE) is defined as the difference between the potential outcome under treatment and under control:

ITEi=Yi(1)Yi(0)ITE_i = Y_i(1) - Y_i(0)

1.2 Average Treatment Effect ATE

From the perspective of groups, individual treatment effects can be aggregated into group-level averages. For individuals who received the treatment, we define the Average Treatment Effect on the Treated (ATT) as:

ATT=E[Yi(1)Yi(0)Di=1]=E[Yi(1)Di=1]E[Yi(0)Di=1]ATT = \mathbb{E}[Y_i(1) - Y_i(0) \mid D_i = 1] = \mathbb{E}[Y_i(1) \mid D_i = 1] - \mathbb{E}[Y_i(0) \mid D_i = 1]

Similarly, for those who did not receive the treatment, we define the Average Treatment Effect on the Untreated (ATU) as:

ATU=E[Yi(1)Yi(0)Di=0]=E[Yi(1)Di=0]E[Yi(0)Di=0]ATU = \mathbb{E}[Y_i(1) - Y_i(0) \mid D_i = 0] = \mathbb{E}[Y_i(1) \mid D_i = 0] - \mathbb{E}[Y_i(0) \mid D_i = 0]

When considering all units, regardless of their treatment status, we define the Average Treatment Effect (ATE) as:

ATE=E[Yi(1)Yi(0)]=E[Yi(1)]E[Yi(0)]ATE = \mathbb{E}[Y_i(1) - Y_i(0)] = \mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)]

If the proportion of treated individuals is ww, then the ATE can be written as a weighted average of ATT and ATU:

ATE=w×ATT+(1w)×ATUATE = w \times ATT + (1 - w) \times ATU

2 Oberserved Outcomes and Counterfactuals

Treatment effects are unobservable because we can only observe one potential outcome Yi(d)Y_i(d) for each unit depending on the realized treatment assignment Di{0,1}D_i \in \{0,1\}. The other potential outcome, corresponding to the unobserved treatment status, is called the counterfactual.

To connect the theoretical potential outcomes model to empirical data, we define the observed outcome as:

Yiobs=DiYi(1)+(1Di)Yi(0)Y_i^{\text{obs}} = D_i \cdot Y_i(1) + (1 - D_i) \cdot Y_i(0)

Corresponding to the observed outcome is the counterfactual outcome, which refers to the treatment state that the individual unit did not actually experience, that is, when Di=1dD_i = 1 - d. Under this condition, the corresponding potential outcome is: Yi(1d)Y_i(1-d). The key to estimating the treatment effect lies in estimating the unobservable counterfactual outcome.

Although in any actual observation, the researcher has access to at most the realized value of one potential outcome and the other is unobservable, the existence of counterfactual outcomes constitutes a prerequisite for the definability of causal effects in the theoretical construct. In the potential outcome framework, individual causal effects are formally defined as the difference between the potential outcomes of the same unit in two treatment states, a definition that logically depends on the conception of counterfactual outcomes and is empirically unidentifiable. To overcome this identification barrier, causal inference typically turns to the estimation of group-averaged treatment effects. (The usual here refers to the existence of mainstream causal effect identification strategies, as well as non-mainstream causal effect identification strategies, such as the identification of individual effects using Bayesian posterior inference.)

3 Bias

3.1 Bias in estimating ITE

For individuals ii and jj, their respective treatment effects are Δi=Yi(1)Yi(0),Δj=Yj(1)Yj(0)\Delta_i = Y_i(1) - Y_i(0), \quad \Delta_j = Y_j(1) - Y_j(0) . The observed outcomes are Yiobs=Yi(1)Di=1,Yjobs=Yj(0)Dj=0Y_i^{\text{obs}} = Y_i(1) \mid D_i = 1, \quad Y_j^{\text{obs}} = Y_j(0) \mid D_j = 0. Then, the difference between their observed outcomes can be written as:

YiobsYjobs=Yi(1)Yj(0)=Yi(1)+Yi(0)Yi(0)Yj(0)=Δi+Yi(0)Yj(0)Y_i^{\text{obs}} - Y_j^{\text{obs}} = Y_i(1) - Y_j(0) = Y_i(1) + Y_i(0) - Y_i(0) - Y_j(0) = \Delta_i + Y_i(0) - Y_j(0)

The difference between the observed results includes not only the individual treatment effect Δi\Delta_i of individual ii, but also the difference in their untreated potential outcomes Yi(0)Yj(0)Y_i(0) - Y_j(0).

A typical example involves drug treatment, but this lacks broader social interpretation. Instead, consider the effect of a university degree on starting salary. Suppose graduates ii and jj from the same university start working in the same year. Individual ii successfully obtains a degree, while individual jj does not. If we estimate the treatment effect of the degree based on the difference in their starting salaries, the estimation will include not only the effect of the degree Δi\Delta_i, but also the unobservable baseline difference Yi(0)Yj(0)Y_i(0) - Y_j(0).

Although Yi(0)Yj(0)Y_i(0) - Y_j(0) is unobservable, we can hypothesize its possible value. For example, imagine you're a hiring manager comparing two resumes, both without a degree (e.g., due to error), but you still assign different salaries based on other cues, such as appearance or style. If you subconsciously assign a higher salary to the more attractive person, then Yi(0)Yj(0)>0Y_i(0) - Y_j(0)>0, and the degree's treatment effect will be overestimated.

On the other hand, if the less attractive person would have received a higher salary without a degree, then Yi(0)Yj(0)<0Y_i(0) - Y_j(0)<0, and the treatment effect will be underestimated. I hope this example can illustrate that most of us researchers are not the rule-makers (bosses), but rather a somewhat naive observer (parents).

3.2 Bridge

You may be thinking of numerous other variables or factors that I have not noted that play a role in wage negiotiations, and that produce observable outcomes that are unobservable in the scenario set up in this example. Due to the unobservability of the counterfactual outcome (Fundamental Problem of Causal Inference), no individual can be in both the treatment and control groups, and thus the ITE is not directly identifiable at the empirical level. Its estimation must rely on modeling assumptions (e.g., structure of the distribution of potential outcomes, no unobserved confounding, etc.) that cannot themselves be verified by observational data, leading to a risk of systematic bias in causal inference at the individual level.

In contrast, the average treatment effect becomes an identifiable estimation target because it allows the researcher to average out or model control for bias through statistical group comparisons under certain assumptions (randomization, strong ignorability, or covariate balancing), which in turn achieves identifiability of the estimator. In other words, at the individual level we are dealing with an “unobserved and uncorrectable error term”, whereas at the average level this error term can be explained, adjusted, or canceled out in the aggregate structure.

3.3 Bias in estimating ATE

Let E[Yi(1)Di=1]=T1\mathbb{E}[Y_i(1) \mid D_i = 1] = T1E[Yi(0)Di=1]=T0\mathbb{E}[Y_i(0) \mid D_i = 1] = T0, ATT is defined as ATT=T1T0ATT = T1 - T0; let E[Yi(1)Di=0]=C1\mathbb{E}[Y_i(1) \mid D_i = 0] = C1E[Yi(0)Di=0]=C0\mathbb{E}[Y_i(0) \mid D_i = 0] = C0, ATU is defined as ATU=C1C0ATU = C1 - C0, then, Then the overall average treatment effect is the weighted average:

ATE=w×ATT+(1w)×ATUATE = w \times ATT + (1 - w) \times ATU

In non-experimental data, researchers often estimate the average treatment effect based on the observed mean difference between the treatment group and the control group, e.g.,

ATE^=E[Yi(1)Di=1]E[Yi(0)Di=0]=T1C0\widehat{ATE} = \mathbb{E}[Y_i(1) \mid D_i = 1] - \mathbb{E}[Y_i(0) \mid D_i = 0] = T1 - C0

Note that T1C0T1-C0 can be disassembled as follows

T1C0=T1T0ATT+T0C0T1 - C0 = \underbrace{T1 - T0}_{ATT} + T0 - C0

It can also be expressed as:

T1C0=C1C0ATU+T1C1T1 - C0 = \underbrace{C1 - C0}_{ATU} + T1 - C1

Therefore, it is further expanded as follows:

ATE^=T1C0=w(T1T0)+(1w)(C1C0)+w(T0C0)+(1w)(T1C1)\widehat{ATE} = T1 - C0 = w(T1 - T0) + (1 - w)(C1 - C0) + w(T0 - C0) + (1 - w)(T1 - C1)

With this additive decomposition, we can make it clear that the deviation between the observed mean difference and the true causal effect depends not only on whether the treatment assignment is random, but also on whether the mean structure of the groups is symmetric under different potential outcomes.

The observed mean difference is equal to ATE only if T0=C0T0 = C0 and T1=C1T1 = C1. This bias is known as selection bias.

4 Independence Assumption

The main challenge in identifying causal effects in observational data stems from the fact that treatment variables are not randomly assigned. Since the treatment status of each unit may be affected by its own characteristics and then associated with its potential outcomes, there is often endogeneity between treatment variables and potential outcomes. This problem, known as the problem of non-independence of the processing allocation mechanism, is the root cause of selection bias and estimation error.

Assumption 1: Independence of Average Untreated Potentical Outcomes

Y(0)DY(0) \perp D

This assumption means that the potential outcome under no treatment is independent of whether the unit receives treatment. In other words, the average untreated potential outcome for the treated group is the same as that for the control group. Formally:

E[Yi(0)Di=1]=E[Yi(0)Di=0]T0=C0\mathbb{E}[Y_i(0) \mid D_i = 1] = \mathbb{E}[Y_i(0) \mid D_i = 0] \Rightarrow T0 = C0

Under this assumption, we can use the observed outcome of the control group, C0C0, to approximate the counterfactual potential outcome of the treated group, T0T0. Thus, we can identify:

ATT=T1C0ATT = T1 - C0

Assumption 2: Independence of the Average Treated Potential Outcome

Y(1)DY(1) \perp D

This assumption means that the potential outcome under treatment is independent of whether the unit actually receives treatment. That is, the average treated potential outcome for the treated group is the same as that for the control group. Formally:

E[Yi(1)Di=1]=E[Yi(1)Di=0]T1=C1\mathbb{E}[Y_i(1) \mid D_i = 1] = \mathbb{E}[Y_i(1) \mid D_i = 0] \Rightarrow T1 = C1

Under this assumption, we can use the observed outcome of the treated group, T1T1, to approximate the counterfactual potential outcome of the untreated group, C1C1. Thus, we can identify:

ATU=T1C0ATU = T1 - C0

Assumption 3: Full Independence of Potential Outcomes

In the ideal scenario, we assume that the treatment assignment is completely independent of both potential outcomes. That is:

(Y(1),Y(0))D(Y(1), Y(0)) \perp D

When the following assumption holds,

E[Yi(0)Di=1]=E[Yi(0)Di=0],E[Yi(1)Di=1]=E[Yi(1)Di=0]\mathbb{E}[Y_i(0) \mid D_i = 1] = \mathbb{E}[Y_i(0) \mid D_i = 0], \quad \mathbb{E}[Y_i(1) \mid D_i = 1] = \mathbb{E}[Y_i(1) \mid D_i = 0]

we can identify the Average Treatment Effect (ATE) by:

ATE=E[Yi(1)Di=1]E[Yi(0)Di=0]=T1C0ATE = \mathbb{E}[Y_i(1) \mid D_i = 1] - \mathbb{E}[Y_i(0) \mid D_i = 0] = T1 - C0

This assumption is known as the strong independence assumption or unconfoundedness. It means that the treatment assignment is completely determined by an external random mechanism, independent of potential outcomes. Such a condition typically holds in a Randomized Controlled Trial (RCT), where treatment assignment is randomized by design, hence uncorrelated with any latent traits of individuals.

5 Conditional Independence Assumption (CIA)

5.1 Form

In observational studies, the strong independence assumption often fails. Therefore, researchers propose a weaker identifying assumption—the Conditional Independence Assumption (CIA), defined as:

(Y(1),Y(0))DX(Y(1), Y(0)) \perp D \mid X

Here, XX is a vector of covariates that may influence both treatment assignment and potential outcomes. CIA implies that conditional on XX, the treatment assignment is as good as random.

This assumption underlies many common identification strategies such as propensity score matching and covariate adjustment, and allows us to use observed outcomes to estimate causal effects despite not observing counterfactuals.

Under CIA, we can recover the average treatment effect as follows:

E[Y(1)]=EX[E[YD=1,X]],E[Y(0)]=EX[E[YD=0,X]]\mathbb{E}[Y(1)] = \mathbb{E}_X[\mathbb{E}[Y \mid D = 1, X]], \quad \mathbb{E}[Y(0)] = \mathbb{E}_X[\mathbb{E}[Y \mid D = 0, X]]

Hence:

ATE=E[Y(1)]E[Y(0)]=EX[E[YD=1,X]]EX[E[YD=0,X]]ATE = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = \mathbb{E}_X[\mathbb{E}[Y \mid D = 1, X]] - \mathbb{E}_X[\mathbb{E}[Y \mid D = 0, X]]

Although CIA is a foundational assumption for identifying causal effects in observational settings, it cannot be empirically verified, because counterfactual outcomes are never observed. That is, we cannot directly test whether (Y(1),Y(0))DX(Y(1), Y(0)) \perp D \mid X.

Thus, the plausibility of CIA relies heavily on theoretical reasoning, domain knowledge, and the richness of observed covariates.

5.2 Confounding Bias Analysis

If important confounders are unobserved and omitted from XX, then CIA fails, and the estimated treatment effect becomes biased. This bias is systematic, not random sampling error, and does not vanish with large samples.

In the case of a misspecified model (e.g., when CIA is violated), the treatment variable DiD_i is correlated with omitted variables. Suppose the true model is:

Yi=τDi+γUi+εiY_i = \tau D_i + \gamma U_i + \varepsilon_i

but we estimate:

Yi=τ^Di+ηi,where ηi=γUi+εiY_i = \hat{\tau} D_i + \eta_i, \quad \text{where } \eta_i = \gamma U_i + \varepsilon_i

Then the bias in the OLS estimate τ^\hat{\tau} is due to the correlation between DiD_i and the omitted variable UiU_i.

Then, the OLS estimator is:

τ^=Cov(Di,Yi)Var(Di)\hat{\tau} = \frac{\text{Cov}(D_i, Y_i)}{\text{Var}(D_i)}

Substitute the true model into the formula:

Yi=τDi+γUi+εiY_i = \tau D_i + \gamma U_i + \varepsilon_i

We get:

τ^=Cov(Di,τDi+γUi+εi)Var(Di)=Cov(Di,τDi)Var(Di)+Cov(Di,γUi)Var(Di)+Cov(Di,εi)Var(Di)\hat{\tau} = \frac{\text{Cov}(D_i, \tau D_i + \gamma U_i + \varepsilon_i)}{\text{Var}(D_i)} = \frac{\text{Cov}(D_i, \tau D_i)}{\text{Var}(D_i)} + \frac{\text{Cov}(D_i, \gamma U_i)}{\text{Var}(D_i)} + \frac{\text{Cov}(D_i, \varepsilon_i)}{\text{Var}(D_i)}

Using linearity of covariance and assuming:

Cov(Di,τDi)=τ,Cov(Di,εi)=0\text{Cov}(D_i, \tau D_i) = \tau, \quad \text{Cov}(D_i, \varepsilon_i) = 0

We obtain:

τ^=τ+γCov(Di,Ui)Var(Di)\hat{\tau} = \tau + \frac{\gamma \cdot \text{Cov}(D_i, U_i)}{\text{Var}(D_i)}

Therefore, the bias is:

Bias(τ^)=τ^τ=γCov(Di,Ui)Var(Di)\text{Bias}(\hat{\tau}) = \hat{\tau} - \tau = \frac{\gamma \cdot \text{Cov}(D_i, U_i)}{\text{Var}(D_i)}

In the context of causal inference, the bias we obtained is so called counfunding bias, and UiU_i is omitted confounder. As the above argument and derivation is carried out at the population level, covariance Cov(Di,Ui)Cov(D_i, U_i) and variance of treatment variable Var(Di) Var(D_i) are population parameters, e.g., definite values rather than random variables. So, the expetation of bias E[Bias(τ^)]\mathbb{E}[Bias(\hat\tau)] is itself, and its variance Var(Bias(τ^))Var(Bias(\hat\tau)) equals zero.

In pratical estimation, population parameters are unknown, we need to use the sample variance instead of the population variance. As both sample covariance and sample variance are estimators, they are themselfves randm variables, bringing sampling error, then, the bias expression becomes a random variable.

Using sample covariance Cov^(Di,Ui)\widehat{\text{Cov}}(D_i, U_i) and sample variance Var^(Di)\widehat{\text{Var}}(D_i) as estimators, the bias expression is changed to:

Bias(τ^)=γCov^(Di,Ui)Var^(Di)\text{Bias}(\hat{\tau}) = \frac{\gamma \cdot \widehat{\text{Cov}}(D_i, U_i)}{\widehat{\text{Var}}(D_i)}

As the numerator and denominatoer are random variables, the expected value cannot be decomposed into the expected ratio of the covariance to the variance estimator. In other words:

E[Cov^Var^]E[Cov^]E[Var^].\mathbb{E}\left[\frac{\widehat{\text{Cov}}}{\widehat{\text{Var}}}\right] \neq \frac{\mathbb{E}[\widehat{\text{Cov}}]}{\mathbb{E}[\widehat{\text{Var}}]}.

If the sample size is large enough and the setimates are consistent, then:

Cov^Var^E[Cov^]E[Var^]+infinitesimal of higher order\frac{\widehat{\text{Cov}}}{\widehat{\text{Var}}} \approx \frac{\mathbb{E}[\widehat{\text{Cov}}]}{\mathbb{E}[\widehat{\text{Var}}]} + \text{infinitesimal of higher order}
E[Bias(τ^)]γCov(Di,Ui)Var(Di).\mathbb{E}[\text{Bias}(\hat{\tau})] \approx \frac{\gamma \cdot \text{Cov}(D_i, U_i)}{\text{Var}(D_i)}.
Var(Bias(τ^))(γVar(Di))2Var(Cov^)+(γCov(Di,Ui)Var(Di)2)2Var(Var^)\text{Var}(\text{Bias}(\hat{\tau})) \approx \left( \frac{\gamma}{\text{Var}(D_i)} \right)^2 \cdot \text{Var}(\widehat{\text{Cov}}) + \left( \frac{-\gamma \cdot \text{Cov}(D_i, U_i)}{\text{Var}(D_i)^2} \right)^2 \cdot \text{Var}(\widehat{\text{Var}})

If Cov(Di,Ui)>0\text{Cov}(D_i, U_i) > 0 and γ>0\gamma > 0, the bias is positive, leading to an overestimation of the treatment effect.

If Cov(Di,Ui)<0\text{Cov}(D_i, U_i) < 0, the bias is negative, leading to underestimation.

For example, suppose UiU_i represents an unobserved variable like “attractiveness.” If more attractive individuals are more likely to be treated and also tend to have better outcomes, then failing to control for UiU_i results in upward bias in τ^\hat{\tau}.

If only social background and academic credentials are observed, and these are correlated with treatment status but not with underlying personal traits like motivation or appearance, we may seriously misestimate the true treatment effect.

Appendix 1

In many statistical derivations, we often encounter estimators in the form of ratios, such as:

θ^=XY\widehat{\theta} = \frac{X}{Y}

As XX and YY are random variables, the expectation of ration is E[θ^]=E[XY]\mathbb{E}[\widehat{\theta}] = \mathbb{E}\left[\frac{X}{Y}\right]generally cannot be written as E[X]E[Y]\frac{\mathbb{E}[X]}{\mathbb{E}[Y]}. For approximate calculation, we can use first Taylor expansion to expand it:

Let (μX,μY)(\mu_X, \mu_Y) as the expansion point, f(X,Y)=XYf(X, Y) = \frac{X}{Y}, μX=E[X]μY=E[Y]\mu_X = \mathbb{E}[X], \mu_Y = \mathbb{E}[Y].

A first-order Taylor expansion of f(X,Y)f(X, Y) in the neighborhood of (μX,μY)(\mu_X, \mu_Y) yields:

f(X,Y)f(μX,μY)+fX(μX,μY)(XμX)+fY(μX,μY)(YμY)f(X, Y) \approx f(\mu_X, \mu_Y) + f_X'(\mu_X, \mu_Y)(X - \mu_X) + f_Y'(\mu_X, \mu_Y)(Y - \mu_Y)

fX=X(XY)=1Yf_X = \frac{\partial}{\partial X} \left( \frac{X}{Y} \right) = \frac{1}{Y}
fY=Y(XY)=XY2f_Y = \frac{\partial}{\partial Y} \left( \frac{X}{Y} \right) = -\frac{X}{Y^2}
XYμXμY+1μY(XμX)μXμY2(YμY)\frac{X}{Y} \approx \frac{\mu_X}{\mu_Y} + \frac{1}{\mu_Y}(X - \mu_X) - \frac{\mu_X}{\mu_Y^2}(Y - \mu_Y)
E[XY]μXμY+1μYE[XμX]μXμY2E[YμY]\mathbb{E}\left[\frac{X}{Y}\right] \approx \frac{\mu_X}{\mu_Y} + \frac{1}{\mu_Y}\mathbb{E}[X - \mu_X] - \frac{\mu_X}{\mu_Y^2}\mathbb{E}[Y - \mu_Y]

AS E[XμX]=0\mathbb{E}[X - \mu_X] = 0 and E[YμY]=0\mathbb{E}[Y - \mu_Y] = 0,

E[XY]μXμY\mathbb{E}\left[\frac{X}{Y}\right] \approx \frac{\mu_X}{\mu_Y}
E[XY]E[X]E[Y]\boxed{\mathbb{E}\left[\frac{X}{Y}\right] \approx \frac{\mathbb{E}[X]}{\mathbb{E}[Y]}}

Since,

f(X,Y)f(μX,μY)+fX(μX,μY)(XμX)+fY(μX,μY)(YμY)f(X, Y) \approx f(\mu_X, \mu_Y) + f_X'(\mu_X, \mu_Y)(X - \mu_X) + f_Y'(\mu_X, \mu_Y)(Y - \mu_Y)

then,

Var(g(X,Y))(gX)2Var(X)+(gY)2Var(Y)+2gXgYCov(X,Y)\text{Var}(g(X, Y)) \approx (g'_X)^2 \cdot \text{Var}(X) + (g'_Y)^2 \cdot \text{Var}(Y) + 2g'_X g'_Y \cdot \text{Cov}(X, Y)

Last updated