POLI 572B

Michael Weaver

January 31, 2025

Conditioning

Objectives

  1. \(ACE\) without randomization

  2. Confounding

    • DAGs
    • Dependence in Potential Outcomes
  3. Conditioning

    • how it works (two approaches)
    • what not to do

Example

So far, we have looked at experiments

What happens when we don’t have randomization?

  • some questions can’t be answered with survey experiments
  • field experiments are expensive, hard to do
  • randomization often impossible on important questions (cost/ethics)

In absence of randomization, how do we get unbiased estimates of \(ACE\) or other causal estimands?

An Example

Contact Hypothesis:

Teach For America (TFA) is a prominent national service program that integrates top university graduates into low-income communities (often of color) for two years as teachers. Does service in TFA induce participants to harbor lower racial “resentment”?

see here

An Example

Contact Hypothesis:

Imagine we survey and compare TFA participants and non-participants graduating from the same universities.

Measure “racial resentment” using a battery of questions (e.g. : “Racial discrimination in the US today limits the chances of individuals from [particular racial group] to get ahead?”).

Score \(1\) to \(5\), with \(5\) being greatest resentment (“strongly disagree”), \(1\) being least (“strongly agree”).

An Example

Participating in Teach For America involves multiple stages of “selection”/“choice”:

  • University graduates choose to apply.
  • Teach For America selects candidates based on applicants.
  • Selected candidates choose to accept or reject the offer.

How does this relate to the assumptions we made when analyzing experiments?

An Example

We can imagine potential outcomes:

\(i\) indexes individuals

\(TFA_i\) indicates TFA participation (\(1\)) or not (\(0\)) of person \(i\) (\(D_i\))

\(Y_i(0)\) is “racial resentment” without TFA participation

\(Y_i(1)\) is “racial resentment” with TFA participation

Potential Outcomes: TFA

\(i\) \(Y_i(0)\) \(Y_i(1)\) \(TFA_i\)
1 2 1 1
2 2 1 1
3 3 1 1
4 2 1 0
5 3 1 0
6 3 1 0

What is the true \(ACE\)?

\(i\) \(Y_i(0)\) \(Y_i(1)\) \(TFA_i\)
1 2 1 1
2 2 1 1
3 3 1 1
4 2 1 0
5 3 1 0
6 3 1 0

What is the estimate, \(\widehat{ACE}\)?, if we just compare means of TFA participants to non-participants?

\(i\) \(Y_i(0)\) \(Y_i(1)\) \(TFA_i\)
1 2 1 1
2 2 1 1
3 3 1 1
4 2 1 0
5 3 1 0
6 3 1 0
po.table = data.table(i = 1:6, 
                      y_i_0 = c(2, 2, 3, 2, 3, 3),
                      y_i_1 = c(1, 1, 1, 1, 1, 1),
                      d_i = c(1,1,1,0,0,0)
                      )
#ace
ace = po.table[, y_i_1 - y_i_0] %>% mean

#ace_hat
ace_hat = po.table[, mean(y_i_1[(d_i)]) - mean(y_i_0[!(d_i)])]

ace
## [1] -1.5
ace_hat
## [1] -1.666667

Because we simply observe those treated and those untreated, without randomization, there are no other alternate allocations of TFA enrollment possible.

Why is the \(\widehat{ACE} \ \) biased?

Confounding

Confounding

Confounding occurs when, in the cases we observe, the values of treatment \(D\) are not independent of potential outcomes of \(Y\).

If we ignore confounding, then our estimate of the \(ACE\) will be biased.

  • Why might the potential outcomes of \(Y\) (racial resentment) be dependent on the value of treatment (TFA)?

Non-independence of potential outcomes

\(i\) \(Y_i(0)\) \(Y_i(1)\) \(TFA_i\)
1 2 1 1
2 2 1 1
3 3 1 1
4 2 1 0
5 3 1 0
6 3 1 0

Confounding

  • Occurs frequently when we don’t have randomization
    • Note: \(Z\) will be reserved for randomized variables; \(D\) when randomization is unclear.
    • size of bias may vary, depending on extent of confounding
  • Can occur, despite randomization
    • exclusion restriction violation
    • attrition
    • breaking experiment
  • Easy to understand graphically

Directed Acyclic Graphs (DAGs)

DAGs:

Graphs of hypothesized causal relationships between variables

  • nodes/vertices represent variables in causal chain (\(G\)raph)
    • may be observed (filled) / unobserved (empty)
  • \(D\)irected arrows represents flow of causality (not the magnitude/direction/functional form of effect)
  • no cycles/\(A\)cyclic: variables cannot cause themselves (even indirectly)

Directed Acyclic Graphs (DAGs)

Using a DAG:

Confounding exists when there is a backdoor path between variable of interest \(D\) and the outcome \(Y\).

A backdoor path is a non-causal path from \(D\) to \(Y\):

  • a path that remains after we remove arrows point outward from \(D\)
  • that only changes direction once (backwards from \(D\), forwards to \(Y\))
  • “backdoor” because it follows arrows backwards out of \(D\)

What are the backdoor paths for “Mammogram”?

Directed Acyclic Graphs (DAGs)

The “backdoor” criterion for confounding points to solutions:

  1. Find a causal variable that has no backdoor paths:

    • randomization is only way to be sure: \(Z\) as “instrument”

Using the \(CACE\) in an Experiment

Miguel et al 2004

Exclusion Restriction Violation in \(CACE\) Miguel et al 2004

Directed Acyclic Graphs (DAGs)

What if we don’t have randomization?

In groups: Create a DAG for the TFA.

  • what are \(3\) variables that might affect joining TFA?
  • For each of these, identify \(1\) possible cause of that variable
  • For every variable (max 6) you came up with, could it directly/indirectly affect racial resentment?
  • Draw a DAG corresponding to these connections on the white board.

Directed Acyclic Graphs (DAGs)

The “backdoor” criterion for confounding points to solutions:

  1. Find a causal variable that has no backdoor paths:
  • randomization is only way to be sure
  1. Condition” on variables to “block” backdoor paths
  • assuming we block all backdoor paths
  • assuming we don’t unblock new paths
  • assuming we can correctly measure relevant variables

Conditioning:

conditioning is when we compare values of \(Y_i\) across values of \(D_i\) (e.g. TFA) using information on the values of other variables \(\mathbf{X_i}\),

  • Where \(\mathbf{X_i}\) is a set of variables such that at least one member of the set appears on every backdoor path from \(D\) to \(Y\).
  • We use values of \(\mathbf{X_i}\) to:
    • impute: compare values of \(Y_i\) across values of \(D_i\) (within strata) where values of \(\mathbf{X_i}\) are the same
    • (and/or) reweight: compare values of \(Y_i\) across values of \(D_i\), weighting \(i\) by probability of “treatment” given \(\mathbf{X_i}\) (\(Pr(D_i | X_i)\))

Result is blocking backdoor (non-causal) paths from \(D\) to \(Y\).

Directed Acyclic Graphs (DAGs)

Condition on variables that

  • have causal paths toward \(D\) and toward \(Y\) or
  • have causal paths toward \(Y\) and backdoor path to \(D\)

Are there variables we should NOT condition on?

If we want to find the effect of ability on earnings (but we don’t directly observe ability, only exam scores): what variables should we condition on?

Should we condition on University admissions? DAG says it doesn’t affect earnings, but in the real world, we might consider that a possibility…

Consider University Admissions:

Admission might be a function of

  • Motivation
  • Academic Ability (as measured by exam scores)
  • Assume (as in the DAG) that motivation and academic ability are independent of each other

What happens when we examine the relationship between motivation and academic ability among groups where university admissions are held constant?

(board)

Within strata defined by admission status, ability and motivation are no longer independent.

Directed Acyclic Graphs (DAGs)

We may be tempted to condition on as many variables as possible, but this can lead to trouble!

colliders: variable that is causally influenced by two or more variables (2+ arrows pointing into it). The causal variables influencing the collider are themselves not necessarily associated.

  • conditioning on the collider can INDUCE associations between variables that were not associated.
  • typically (not always) due to conditioning on post-treatment variable

Directed Acyclic Graphs (DAGs)

Condition on all variables that

  • have causal paths toward \(D\) and toward \(Y\) or
  • have causal paths toward \(Y\) and backdoor path to \(D\)

Do NOT condition on variables that

  • are on causal path from \(D\) to \(Y\) (mediators)
  • are “colliders” connected to backdoor paths

Directed Acyclic Graphs (DAGs)

To find effect of ability (through exam scores) on income, should condition on:

  • Parent Income (close back door path)
  • NOT University (a collider: opens backdoor path from motivation to earnings.)

Colliders

If both motivation and test scores cause admission (but are independent of each other), conditioning on admission (stratifying by admission) can lead us to see negative association between the two.

When does this arise?

  • Can happen in a variety of circumstances, but is VERY COMMON for variables that are “post-treatment” (to the board)
  • See Morgan and Winship, Chapter 4

Directed Acyclic Graphs (DAGs)

  1. Condition on variables that
  • have causal paths toward \(D\) and toward \(Y\)
  • have causal paths toward \(Y\) and backward path toward \(D\)
  1. Do NOT condition on variables that
  • are on causal path from \(D\) to \(Y\) (mediators)
  • are “colliders”
  1. Best practice:
  • Condition on variables that have causal paths toward \(D\) and \(Y\): “pre-treatment”
  • DAGs are hypothetical, “severity” suggests we consider all plausible DAGs, results should be robust to choices

Directed Acyclic Graphs (DAGs)

As a class:

Revisit our TFA DAGs:

Identify variables we need to condition on to close backdoor paths and find effect of TFA on racial resentment

Directed Acyclic Graphs (DAGs)

Pros:

  • Can help us be explicit about our assumptions (what backdoors or colliders can/cannot exist)

Given an assumed DAG…

  • Can tell us when conditioning can/cannot block backdoor paths
  • Can tell us what to condition on/ not to condition on
  • Software to figure out complex situations

Cons:

  • DAGs are assumed, never really known
  • How to make logic of DAGs compatible with weak severity?

Conditioning

In order for conditioning to estimate the \(ACE\) without bias, we must assume

\(1\). Ignorability/Conditional Independence: within strata of \(X\), potential outcomes of \(Y\) must be independent of \(D\) (i.e. for cases with same values of \(X\), \(D\) must be as-if random)

  • all ‘backdoor’ paths are blocked
  • no conditioning on colliders to ‘unblock’ backdoor path
  • Connects to how imputation and reweighting work

In order for conditioning to estimate the \(ACE\) without bias, we must assume

\(2\). Positivity/Common Support: For all values of treatment \(d\) in \(D\) and all value of \(x\) in \(X\): \(Pr(D = d | X = x) > 0\) and \(Pr(D = d | X = x) < 1\)

  • There must be variation in the levels of treatment within every strata of \(X\)
  • If not, we can get other causal estimands but not \(ACE\)

In order for conditioning to estimate the \(ACE\) without bias, we must assume

  1. No Measurement Error: If conditioning variables \(X\) are mismeasured, bias will persist.
  • If substantial measurement error, backdoor paths not blocked
  • Even random measurement error in \(X\) a problem

Two Ways of Conditioning:

  • imputation/matching: “plug in” unobserved potential outcomes of \(\color{red}{Y(1)}\) (\(\color{red}{Y(0)}\)) using observed potential outcomes of \(Y(1)\)(\(Y(0)\)) from cases with same/similar values of \(\mathbf{X_i}\).
  • reweighting: re-weight members of “treated” and “untreated” groups based on \(Pr(D_i | \mathbf{X_i})\)

Both implicitly assume that, for cases with same values in conditioning variables \(\mathbf{X_i}\), \(D_i\) is effectively “as-if” random

Conditioning by Imputation

Conditioning with imputation contrasts to calculating naive \(\widehat{ACE}\):

Rather than taking this difference:

\(\widehat{ACE} = E[Y(1) | D = 1] - E[Y(0) | D=0]\)

which may be biased… \(ACE \neq E[\widehat{ACE}]\)

Conditioning by Imputation

Instead, we find effect of \(D\) on \(Y\) within each subset of the data with same values of \(X_i\).

\(\widehat{ACE}[X_i = x] = E[Y_i(1) | D_i=1, X_i = x] - E[Y_i(0) | D_i = 0, X_i = x]\)

for all unique set of values of \(x\) in the data.

Conditioning by Imputation

Under the Conditional Independence Assumption,

\(\overbrace{E[Y_i(1) | D_i=1, X_i=x]}_{\text{among treated cases w/ value of X}} = \underbrace{\color{red}{E[Y_i(1) | D_i=0, X_i=x]}}^{\text{among untreated cases w/ value of X} }\)

and \(E[Y_i(0) | D_i=0, X_i=x] = \color{red}{E[Y_i(0) | D_i=1, X_i=x]}\)

In short, CIA \(\to\) we assume that with \(X\) held constant, like drawing \(Y_(1)\) and \(Y_i(0)\) at random (VERY IMPORTANT, re: sampling distribution of conditioning estimators)

Then take a weighted average of all \(\widehat{ACE}[Xi_ = x]\), weighted by \(Pr(X_i = x)\) (fraction of cases where \(X_i = x\))

\[\widehat{ACE} = \sum\limits_{x} \widehat{ACE}[X = x]Pr(X=x)\]

Conditioning by Imputation

Why is this imputation?

We’re imputing/plugging in \(\color{red}{unobserved}\) potential outcomes using \(observed\) potential outcomes.

Conditioning: Example

Imagine we have our binary treatment \(TFA_i\) and we want to know its average effect on \(Y_i\), racial resentment.

But we lack random assignment; potential outcomes of \(Y_i\) are dependent on values of \(TFA_i\).

We think that this dependence is induced by majoring in economics (\(X_i\)), which affects \(TFA_i\) and \(Y_i\).

We solve this problem with conditioning.

Economists

Example (1)

\(i\) \(Y_i(0)\) \(Y_i(1)\) \(TFA_i\) \(X_i\)
1 2 1 1 0
2 2 1 1 0
3 3 1 1 1
4 2 1 0 0
5 3 1 0 1
6 3 1 0 1

Example (1): Let’s “condition” on \(X\)

\(i\) \(Y_i(0)\) \(Y_i(1)\) \(TFA_i\) \(X_i\) \(Y_i\)
4 2 1 0 0 2
1 2 1 1 0 1
2 2 1 1 0 1
5 3 1 0 1 3
6 3 1 0 1 3
3 3 1 1 1 1

Calculate \(\widehat{ACE}\) using conditioning

Example (1): Conditioning in R

#ACE | X = 0
ace_x0 = po.table[x_i %in% 0, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
#ACE | X = 1
ace_x1 = po.table[x_i %in% 1, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
#Ace Hat
mean(c(ace_x0, ace_x1))
## [1] -1.5
#Ace
ace
## [1] -1.5
  • Why does conditioning recover the true \(ACE\)?

Example (2) Let’s “condition” on \(X\)

\(i\) \(Y_i(0)\) \(Y_i(1)\) \(TFA_i\) \(X_i\) \(Y_i\)
4 2 1 0 0 2
1 2 1 1 0 1
2 2 1 1 0 1
5 3 1 0 1 3
6 3 3 0 1 3
3 3 1 1 1 1

Calculate \(\widehat{ACE}\) using conditioning

Example (2): Conditioning in R

ace_x0 = po.table[x_i %in% 0, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]

ace_x1 = po.table[x_i %in% 1, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]

#Ace HAT
mean(c(ace_x0, ace_x1))
## [1] -1.5
#Ace
ace
## [1] -1.166667
  • Why does \(\widehat{ACE} \neq ACE\) in this case?

Example (3)

\(i\) \(Y_i(0)\) \(Y_i(1)\) \(TFA_i\) \(X_i\) \(Y_i\)
4 2 1 0 0 2
1 2 1 1 0 1
2 2 1 1 0 1
5 3 1 0 1 3
3 3 1 1 1 1

Calculate \(\widehat{ACE}\) using conditioning

Example (3): Conditioning in R

#ACE | X = 0
ace_x0 = po.table[x_i %in% 0, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
#ACE | X = 1
ace_x1 = po.table[x_i %in% 1, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
#Ace Hat
mean(c(ace_x0, ace_x1))
## [1] -1.5
#Ace
ace
## [1] -1.4
  • Why does \(\widehat{ACE} \neq ACE\) in this case?

Example (3): Conditioning in R

#ACE | X = 0
ace_x0 = po.table[x_i %in% 0, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
#ACE | X = 1
ace_x1 = po.table[x_i %in% 1, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
#Ace Hat: weighting by size of each group in X
weighted.mean(c(ace_x0, ace_x1), w = c(3,2))
## [1] -1.4
#Ace
ace
## [1] -1.4
  • Why does \(\widehat{ACE} == ACE\) in this case?
  • We correctly weight by \(Pr(X = x)\)

Example (4): Condition on \(X\)

\(i\) \(Y_i(0)\) \(Y_i(1)\) \(TFA_i\) \(X_i\) \(Y_i\)
8 1 1 1 -1 1
4 2 1 0 0 2
1 2 1 1 0 1
2 2 1 1 0 1
5 3 1 0 1 3
6 3 1 0 1 3
3 3 1 1 1 1
7 5 4 0 2 5

Can we use conditioning to recover the \(ACE\)?

Example (4)

Why can’t conditioning recover the true \(ACE\)?

Conditioning by Imputation

  1. Ignorability/Conditional Independence:
    • violated in Example 2
  2. Positivity/Common Support:
    • violated in Example 4

Conditioning: Limitations

  1. Ignorability/Conditional Independence:
    • an untestable assumption: there may be unobserved variables we should condition on
    • can argue this is plausible using theory/knowledge of cases; or argue for partial identification
  2. Positivity/Common Support:
    • with more conditioning variables: less likely positivity is to hold
    • many unique combinations of \(X\) with little variation in \(D\) (curse of dimensionality)
  3. Measurement Error:
    • This exacerbates problems with assumption (1)

Conditioning: Assumptions

Positivity/Common Support: Without positivity, \(\widehat{ACE}\) may be biased, but can estimate other causal parameters:

  • \(ATT\): Average Treatment effect on Treated:
    • requires positivity only for values of \(x\) held by units in treatment
  • \(ATC\): Average Treatment effect on Control
    • requires positivity only for values of \(x\) held by units in control
  • Treatment effects for strata with positivity
    • requires thinking about interpretation (like \(CACE\), for whom does effect hold?)

Conditioning by Imputation: Approaches

  1. Matching: (exact matching is what we’ve been doing, other types)

  2. Regression: (interpolate values of \(Y\) as linear and additive in \(X\))

  3. Machine Learning:

Conditioning by Reweighting

Another way of conditioning focuses on the selection into treatment \(D\), rather than imputing potential outcomes of \(Y\).

Conditioning by Reweighting

If there is confounding, then the observed values of \(Y(1)\) are systematically different than \(E[Y_i(1)]\) for all of the cases (bias). Why? (board)

If we know the probability each case was treated (\(Pr(D_i)\)) then we can re-weight observed values of \(Y(1)\) such that we reconstruct the \(E[Y(1)]\). The same can be done for \(Y(0)\)

TO THE BOARD

Conditioning by Reweighting

Under the conditional independence assumption, \(Pr(D | X = x)\) is the same for all cases \(X = x\), and adjusting for the probability of treatment \(D\), potential outcomes of \(Y\) are independent of \(D\)

\(Pr(D | X = x)\) is called the propensity score, which we can estimate.

Example (1): Let’s “condition” on \(X\)

\(i\) \(Y_i(0)\) \(Y_i(1)\) \(TFA_i\) \(X_i\) \(Y_i\)
8 1 1 1 -1 1
4 2 1 0 0 2
1 2 1 1 0 1
2 2 1 1 0 1
5 3 1 0 1 3
6 3 1 0 1 3
3 3 1 1 1 1
7 5 4 0 2 5

Conditioning by Reweighting

This approach leads us to the inverse probability weighting estimator of the \(ACE\):

\[\widehat{ACE} = \frac{1}{N}\sum\limits_{i=1}^N \frac{D_iY_i}{\widehat{Pr}(D_i|\mathbf{X_i})}-\frac{(1-D_i)Y_i}{1-\widehat{Pr}(D_i|\mathbf{X_i})}\]

Conditioning by Reweighting: Approaches

  1. Estimating Propensity Scores: known probabilities, logit regression, machine learning
  2. Inverse probability weighting (see last slide)

Conditioning: Why not use both?

We can use both imputation and re-weighting at the same time. This is called “doubly robust” estimation, as it can give us an unbiased estimate of the \(ACE\) if either the imputation model or the propensity score model is correct.

Conditioning:

In absence of randomization, we condition (by imputation or reweighting). We always assume:

  1. Ignorability/Conditional Independence
  2. Positivity/Common Support

Different methods of conditioning may make additional assumptions. We need to be aware and evaluate their plausibility.