\(ACE\) without randomization
Confounding
Conditioning
So far, we have looked at experiments…
What happens when we don’t have randomization?
In absence of randomization, how do we get unbiased estimates of \(ACE\) or other causal estimands?
Contact Hypothesis:
Teach For America (TFA) is a prominent national service program that integrates top university graduates into low-income communities (often of color) for two years as teachers. Does service in TFA induce participants to harbor lower racial “resentment”?
Contact Hypothesis:
Imagine we survey and compare TFA participants and non-participants graduating from the same universities.
Measure “racial resentment” using a battery of questions (e.g. : “Racial discrimination in the US today limits the chances of individuals from [particular racial group] to get ahead?”).
Score \(1\) to \(5\), with \(5\) being greatest resentment (“strongly disagree”), \(1\) being least (“strongly agree”).
Participating in Teach For America involves multiple stages of “selection”/“choice”:
How does this relate to the assumptions we made when analyzing experiments?
We can imagine potential outcomes:
\(i\) indexes individuals
\(TFA_i\) indicates TFA participation (\(1\)) or not (\(0\)) of person \(i\) (\(D_i\))
\(Y_i(0)\) is “racial resentment” without TFA participation
\(Y_i(1)\) is “racial resentment” with TFA participation
\(i\) | \(Y_i(0)\) | \(Y_i(1)\) | \(TFA_i\) |
---|---|---|---|
1 | 2 | 1 | 1 |
2 | 2 | 1 | 1 |
3 | 3 | 1 | 1 |
4 | 2 | 1 | 0 |
5 | 3 | 1 | 0 |
6 | 3 | 1 | 0 |
What is the true \(ACE\)?
\(i\) | \(Y_i(0)\) | \(Y_i(1)\) | \(TFA_i\) |
---|---|---|---|
1 | 2 | 1 | 1 |
2 | 2 | 1 | 1 |
3 | 3 | 1 | 1 |
4 | 2 | 1 | 0 |
5 | 3 | 1 | 0 |
6 | 3 | 1 | 0 |
What is the estimate, \(\widehat{ACE}\)?, if we just compare means of TFA participants to non-participants?
\(i\) | \(Y_i(0)\) | \(Y_i(1)\) | \(TFA_i\) |
---|---|---|---|
1 | 2 | 1 | 1 |
2 | 2 | 1 | 1 |
3 | 3 | 1 | 1 |
4 | 2 | 1 | 0 |
5 | 3 | 1 | 0 |
6 | 3 | 1 | 0 |
po.table = data.table(i = 1:6,
y_i_0 = c(2, 2, 3, 2, 3, 3),
y_i_1 = c(1, 1, 1, 1, 1, 1),
d_i = c(1,1,1,0,0,0)
)
#ace
ace = po.table[, y_i_1 - y_i_0] %>% mean
#ace_hat
ace_hat = po.table[, mean(y_i_1[(d_i)]) - mean(y_i_0[!(d_i)])]
ace
## [1] -1.5
## [1] -1.666667
Because we simply observe those treated and those untreated, without randomization, there are no other alternate allocations of TFA enrollment possible.
Confounding occurs when, in the cases we observe, the values of treatment \(D\) are not independent of potential outcomes of \(Y\).
If we ignore confounding, then our estimate of the \(ACE\) will be biased.
Non-independence of potential outcomes
\(i\) | \(Y_i(0)\) | \(Y_i(1)\) | \(TFA_i\) |
---|---|---|---|
1 | 2 | 1 | 1 |
2 | 2 | 1 | 1 |
3 | 3 | 1 | 1 |
4 | 2 | 1 | 0 |
5 | 3 | 1 | 0 |
6 | 3 | 1 | 0 |
DAGs:
Graphs of hypothesized causal relationships between variables
Using a DAG:
Confounding exists when there is a backdoor path between variable of interest \(D\) and the outcome \(Y\).
A backdoor path is a non-causal path from \(D\) to \(Y\):
What are the backdoor paths for “Mammogram”?
The “backdoor” criterion for confounding points to solutions:
Find a causal variable that has no backdoor paths:
Using the \(CACE\) in an Experiment
Exclusion Restriction Violation in \(CACE\) Miguel et al 2004
What if we don’t have randomization?
In groups: Create a DAG for the TFA.
The “backdoor” criterion for confounding points to solutions:
conditioning is when we compare values of \(Y_i\) across values of \(D_i\) (e.g. TFA) using information on the values of other variables \(\mathbf{X_i}\),
Result is blocking backdoor (non-causal) paths from \(D\) to \(Y\).
Condition on variables that
Are there variables we should NOT condition on?
If we want to find the effect of ability on earnings (but we don’t directly observe ability, only exam scores): what variables should we condition on?
Should we condition on University admissions? DAG says it doesn’t affect earnings, but in the real world, we might consider that a possibility…
Admission might be a function of
What happens when we examine the relationship between motivation and academic ability among groups where university admissions are held constant?
(board)
Within strata defined by admission status, ability and motivation are no longer independent.
We may be tempted to condition on as many variables as possible, but this can lead to trouble!
colliders: variable that is causally influenced by two or more variables (2+ arrows pointing into it). The causal variables influencing the collider are themselves not necessarily associated.
Condition on all variables that
Do NOT condition on variables that
To find effect of ability (through exam scores) on income, should condition on:
If both motivation and test scores cause admission (but are independent of each other), conditioning on admission (stratifying by admission) can lead us to see negative association between the two.
When does this arise?
As a class:
Revisit our TFA DAGs:
Identify variables we need to condition on to close backdoor paths and find effect of TFA on racial resentment
Given an assumed DAG…
In order for conditioning to estimate the \(ACE\) without bias, we must assume
\(1\). Ignorability/Conditional Independence: within strata of \(X\), potential outcomes of \(Y\) must be independent of \(D\) (i.e. for cases with same values of \(X\), \(D\) must be as-if random)
In order for conditioning to estimate the \(ACE\) without bias, we must assume
\(2\). Positivity/Common Support: For all values of treatment \(d\) in \(D\) and all value of \(x\) in \(X\): \(Pr(D = d | X = x) > 0\) and \(Pr(D = d | X = x) < 1\)
In order for conditioning to estimate the \(ACE\) without bias, we must assume
Both implicitly assume that, for cases with same values in conditioning variables \(\mathbf{X_i}\), \(D_i\) is effectively “as-if” random
Conditioning with imputation contrasts to calculating naive \(\widehat{ACE}\):
Rather than taking this difference:
\(\widehat{ACE} = E[Y(1) | D = 1] - E[Y(0) | D=0]\)
which may be biased… \(ACE \neq E[\widehat{ACE}]\)
Instead, we find effect of \(D\) on \(Y\) within each subset of the data with same values of \(X_i\).
\(\widehat{ACE}[X_i = x] = E[Y_i(1) | D_i=1, X_i = x] - E[Y_i(0) | D_i = 0, X_i = x]\)
for all unique set of values of \(x\) in the data.
Under the Conditional Independence Assumption,
\(\overbrace{E[Y_i(1) | D_i=1, X_i=x]}_{\text{among treated cases w/ value of X}} = \underbrace{\color{red}{E[Y_i(1) | D_i=0, X_i=x]}}^{\text{among untreated cases w/ value of X} }\)
and \(E[Y_i(0) | D_i=0, X_i=x] = \color{red}{E[Y_i(0) | D_i=1, X_i=x]}\)
In short, CIA \(\to\) we assume that with \(X\) held constant, like drawing \(Y_(1)\) and \(Y_i(0)\) at random (VERY IMPORTANT, re: sampling distribution of conditioning estimators)
Then take a weighted average of all \(\widehat{ACE}[Xi_ = x]\), weighted by \(Pr(X_i = x)\) (fraction of cases where \(X_i = x\))
\[\widehat{ACE} = \sum\limits_{x} \widehat{ACE}[X = x]Pr(X=x)\]
Why is this imputation?
We’re imputing/plugging in \(\color{red}{unobserved}\) potential outcomes using \(observed\) potential outcomes.
Imagine we have our binary treatment \(TFA_i\) and we want to know its average effect on \(Y_i\), racial resentment.
But we lack random assignment; potential outcomes of \(Y_i\) are dependent on values of \(TFA_i\).
We think that this dependence is induced by majoring in economics (\(X_i\)), which affects \(TFA_i\) and \(Y_i\).
We solve this problem with conditioning.
Economists
Example (1)
\(i\) | \(Y_i(0)\) | \(Y_i(1)\) | \(TFA_i\) | \(X_i\) |
---|---|---|---|---|
1 | 2 | 1 | 1 | 0 |
2 | 2 | 1 | 1 | 0 |
3 | 3 | 1 | 1 | 1 |
4 | 2 | 1 | 0 | 0 |
5 | 3 | 1 | 0 | 1 |
6 | 3 | 1 | 0 | 1 |
Example (1): Let’s “condition” on \(X\)
\(i\) | \(Y_i(0)\) | \(Y_i(1)\) | \(TFA_i\) | \(X_i\) | \(Y_i\) |
---|---|---|---|---|---|
4 | 2 | 1 | 0 | 0 | 2 |
1 | 2 | 1 | 1 | 0 | 1 |
2 | 2 | 1 | 1 | 0 | 1 |
5 | 3 | 1 | 0 | 1 | 3 |
6 | 3 | 1 | 0 | 1 | 3 |
3 | 3 | 1 | 1 | 1 | 1 |
Calculate \(\widehat{ACE}\) using conditioning
Example (1): Conditioning in R
#ACE | X = 0
ace_x0 = po.table[x_i %in% 0, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
#ACE | X = 1
ace_x1 = po.table[x_i %in% 1, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
#Ace Hat
mean(c(ace_x0, ace_x1))
## [1] -1.5
## [1] -1.5
Example (2) Let’s “condition” on \(X\)
\(i\) | \(Y_i(0)\) | \(Y_i(1)\) | \(TFA_i\) | \(X_i\) | \(Y_i\) |
---|---|---|---|---|---|
4 | 2 | 1 | 0 | 0 | 2 |
1 | 2 | 1 | 1 | 0 | 1 |
2 | 2 | 1 | 1 | 0 | 1 |
5 | 3 | 1 | 0 | 1 | 3 |
6 | 3 | 3 | 0 | 1 | 3 |
3 | 3 | 1 | 1 | 1 | 1 |
Calculate \(\widehat{ACE}\) using conditioning
Example (2): Conditioning in R
ace_x0 = po.table[x_i %in% 0, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
ace_x1 = po.table[x_i %in% 1, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
#Ace HAT
mean(c(ace_x0, ace_x1))
## [1] -1.5
## [1] -1.166667
Example (3)
\(i\) | \(Y_i(0)\) | \(Y_i(1)\) | \(TFA_i\) | \(X_i\) | \(Y_i\) |
---|---|---|---|---|---|
4 | 2 | 1 | 0 | 0 | 2 |
1 | 2 | 1 | 1 | 0 | 1 |
2 | 2 | 1 | 1 | 0 | 1 |
5 | 3 | 1 | 0 | 1 | 3 |
3 | 3 | 1 | 1 | 1 | 1 |
Calculate \(\widehat{ACE}\) using conditioning
Example (3): Conditioning in R
#ACE | X = 0
ace_x0 = po.table[x_i %in% 0, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
#ACE | X = 1
ace_x1 = po.table[x_i %in% 1, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
#Ace Hat
mean(c(ace_x0, ace_x1))
## [1] -1.5
## [1] -1.4
Example (3): Conditioning in R
#ACE | X = 0
ace_x0 = po.table[x_i %in% 0, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
#ACE | X = 1
ace_x1 = po.table[x_i %in% 1, mean(y_i[d_i == 1]) - mean(y_i[d_i == 0])]
#Ace Hat: weighting by size of each group in X
weighted.mean(c(ace_x0, ace_x1), w = c(3,2))
## [1] -1.4
## [1] -1.4
Example (4): Condition on \(X\)
\(i\) | \(Y_i(0)\) | \(Y_i(1)\) | \(TFA_i\) | \(X_i\) | \(Y_i\) |
---|---|---|---|---|---|
8 | 1 | 1 | 1 | -1 | 1 |
4 | 2 | 1 | 0 | 0 | 2 |
1 | 2 | 1 | 1 | 0 | 1 |
2 | 2 | 1 | 1 | 0 | 1 |
5 | 3 | 1 | 0 | 1 | 3 |
6 | 3 | 1 | 0 | 1 | 3 |
3 | 3 | 1 | 1 | 1 | 1 |
7 | 5 | 4 | 0 | 2 | 5 |
Can we use conditioning to recover the \(ACE\)?
Why can’t conditioning recover the true \(ACE\)?
Positivity/Common Support: Without positivity, \(\widehat{ACE}\) may be biased, but can estimate other causal parameters:
Matching: (exact matching is what we’ve been doing, other types)
Regression: (interpolate values of \(Y\) as linear and additive in \(X\))
Machine Learning:
Another way of conditioning focuses on the selection into treatment \(D\), rather than imputing potential outcomes of \(Y\).
If there is confounding, then the observed values of \(Y(1)\) are systematically different than \(E[Y_i(1)]\) for all of the cases (bias). Why? (board)
If we know the probability each case was treated (\(Pr(D_i)\)) then we can re-weight observed values of \(Y(1)\) such that we reconstruct the \(E[Y(1)]\). The same can be done for \(Y(0)\)
TO THE BOARD
Under the conditional independence assumption, \(Pr(D | X = x)\) is the same for all cases \(X = x\), and adjusting for the probability of treatment \(D\), potential outcomes of \(Y\) are independent of \(D\)
\(Pr(D | X = x)\) is called the propensity score, which we can estimate.
Example (1): Let’s “condition” on \(X\)
\(i\) | \(Y_i(0)\) | \(Y_i(1)\) | \(TFA_i\) | \(X_i\) | \(Y_i\) |
---|---|---|---|---|---|
8 | 1 | 1 | 1 | -1 | 1 |
4 | 2 | 1 | 0 | 0 | 2 |
1 | 2 | 1 | 1 | 0 | 1 |
2 | 2 | 1 | 1 | 0 | 1 |
5 | 3 | 1 | 0 | 1 | 3 |
6 | 3 | 1 | 0 | 1 | 3 |
3 | 3 | 1 | 1 | 1 | 1 |
7 | 5 | 4 | 0 | 2 | 5 |
This approach leads us to the inverse probability weighting estimator of the \(ACE\):
\[\widehat{ACE} = \frac{1}{N}\sum\limits_{i=1}^N \frac{D_iY_i}{\widehat{Pr}(D_i|\mathbf{X_i})}-\frac{(1-D_i)Y_i}{1-\widehat{Pr}(D_i|\mathbf{X_i})}\]
We can use both imputation and re-weighting at the same time. This is called “doubly robust” estimation, as it can give us an unbiased estimate of the \(ACE\) if either the imputation model or the propensity score model is correct.
In absence of randomization, we condition (by imputation or reweighting). We always assume:
Different methods of conditioning may make additional assumptions. We need to be aware and evaluate their plausibility.