We never observe unit causal effects, but can estimate average causal effect, by filling in missing (\(\color{red}{\text{counterfactual}}\)) potential outcomes.
\[\begin{equation}\begin{split}ACE &= \{E[Y_i(1)|Z_i = 1]\pi_1 + \overbrace{\color{red}{E[Y_i(1)|Z_i = 0]}}^{\text{Mean Y(1) for untreated}}\pi_0\} \\ & \phantom{=}\ - \{\underbrace{\color{red}{E[Y_i(0)|Z_i = 1]}}_{\text{Mean Y(0) for treated}}\pi_1 + E[Y_i(0)|Z_i = 0]\pi_0\} \end{split} \end{equation}\]
Given assumptions, we fill in the maximum/minimum logically possible values for missing potential outcomes to get bounds. Bounds must contain the true causal effect if the assumptions are true.
\[\begin{equation}\begin{split}ACE &= \{E[Y_i(1)|Z_i = 1]\pi_1 + \overbrace{\color{red}{E[Y_i(1)|Z_i = 0]}}^{\text{plug in Y(1) for treated}}\pi_0\} \\ & \phantom{=}\ - \{\underbrace{\color{red}{E[Y_i(0)|Z_i = 1]}}_{\text{plug in Y(0) for untreated}}\pi_1 + E[Y_i(0)|Z_i = 0]\pi_0\} \end{split} \end{equation}\]
… random assignment does not guarantee that any particular randomization gives us exact estimate of unobserved potential outcomes
different realizations of random process \(\to\) sampling variability
The extent of sampling variability depends on the nature of the random process that generates observations.
In any given randomization, treatment mean and control mean are likely \(\neq\) the true means of \(Y(1)\) and \(Y(0)\)…
Given this, we want to know…
Error probabilities: if we use this estimator to estimate the ACE, how likely are we to conclude there is a causal effect when in fact there is not?
What are the range of causal effect sizes we can rule out, with known error probabilities? (if these were the true effects, we are highly unlikely to observe our estimated effect)
If we run an experiment (implement a set of randomization procedures that dictate treatment assignment)…
“You give me the rule, and I consider its latitude for erroneous outputs. We’re actually looking at the probability distribution of the rule, over outcomes in the sample space. This distribution is called a sampling distribution.” (Mayo 2018)
What are the other (counterfactual) outcomes that we could have observed through implementing this procedure/test?
Sampling distributions can \(\to\) severe tests of claims
We evaluate hypotheses (claims) about some estimand of interest (e.g. \(ACE\)).
\(H_0: ACE <= 0; H_a: ACE > 0\) one-sided
\(H_0: ACE >= 0; H_a: ACE < 0\) one-sided
\(H_0: ACE = 0; H_a: ACE \neq 0\) two-sided
We then use the sampling distribution of the estimate (usually a test statistic), assuming the null hypothesis is true
\(p\) value tells us how likely we are to observe the data, assuming the null hypothesis is correct.
We want to either reject the null hypothesis as False or not reject the null hypothesis.
since we can not know for sure whether it is true or false, we choose some probability (\(p\) value), below which, we arbitrarily decide that the null hypothesis can be rejected.
\(\alpha\) is the threshold which gives us the significance level, and decision rule for rejecting the null.
Because the null hypothesis is always either True or False, our decision to reject the null based on some probability threshold (\(\alpha\)) may be correct or an error.
With a single test, the \(p\) value can be interpreted as the false positive rate of this test procedure using that value as \(\alpha\). If we use \(\alpha=.05\), we are saying that we are comfortable with our testing procedure making false positive errors in no more than 5% of tests.
From the severe testing perspective, we judge the alternative hypothesis to be “severely tested” when we observe evidence that would only occur 5% of the time if the null hypothesis were true.
If we say, “The result is significantly different from the hypothesized value of zero (\(p=.001\))! We reject that hypothesis!” when the truth is zero we are making a false positive error (claiming to detect something positively when there is no signal, only noise).
If we say, “We cannot distinguish this result from zero (\(p=.3\)). We cannot reject the hypothesis of zero.” when the truth is not zero we are making a false negative error.
Hypothesis tests can be useful for learning, if used appropriately.
Confidence intervals summarize a group of hypothesis tests.
So a 95% confidence interval contains all values \(b\) where we fail to reject the null with \(\alpha = 0.05\).
Because confidence intervals invert hypothesis tests, they are the product of the random process. They either do or do not include the true parameter value.
The correct way to interpret them is:
Our procedure severely tests the observed \(\widehat{ACE}\) against the true \(ACE\) being outside of confidence interval. If true \(ACE\) is outside of the confidence interval, we would have observed \(\widehat{ACE}\) smaller(larger) than we saw, with probability \(1-\alpha\).
But hypothesis tests and confidence intervals require that we have access to the sampling distribution of \(\widehat{ACE}\)… how do we do that?
parameter/estimand: unknown attribute of random variable (e.g., the difference in treatment/control group means) that we want to know
estimator: rule/procedure for estimating the parameter/estimand given observed data
bias: estimator is biased if, on average, the estimator yields a value different from the parameter
So \(\widehat{ACE}\) is unbiased if:
\[E(\widehat{ACE}) - ACE = 0\]
We can work through an actual experiment.
Following evidence of effects of soap opera in Rwanda (2009):
Variation on Rwandan study, in Eastern DRC.
Intolerance: “I would not like that group to belong to my community association”; (1 = totally disagree; 4 = totally agree)
\(Region_i\) | \(Y_i(1)\) | \(Y_i(0)\) |
---|---|---|
1 | 3 | 2 |
2 | 4 | 4 |
3 | 4 | 2 |
4 | 2 | 3 |
5 | 2 | 4 |
6 | 4 | 1 |
\(Region_i\) | \(Y_i(1)\) | \(Y_i(0)\) |
---|---|---|
1 | 3 | 2 |
2 | 4 | 4 |
3 | 4 | 2 |
4 | 2 | 3 |
5 | 2 | 4 |
6 | 4 | 1 |
\(Region_i\) | \(Y_i(1)\) | \(Y_i(0)\) |
---|---|---|
1 | 3 | 2 |
2 | 4 | 4 |
3 | 4 | 2 |
4 | 2 | 3 |
5 | 2 | 4 |
6 | 4 | 1 |
We set 3 regions in treatment (soap opera + talk show)
We set 3 regions in control (soap opera only)
How many possible random assignments are there?
What are all possible random assignments (to treatment and control)?
R
(hint, use the combn
function)1 | 2 | 3 |
1 | 2 | 4 |
1 | 2 | 5 |
1 | 2 | 6 |
1 | 3 | 4 |
1 | 3 | 5 |
1 | 3 | 6 |
1 | 4 | 5 |
1 | 4 | 6 |
1 | 5 | 6 |
2 | 3 | 4 |
2 | 3 | 5 |
2 | 3 | 6 |
2 | 4 | 5 |
2 | 4 | 6 |
2 | 5 | 6 |
3 | 4 | 5 |
3 | 4 | 6 |
3 | 5 | 6 |
4 | 5 | 6 |
For each randomization, calculate the estimated \(\widehat{ACE}\) (hint, calculate in R, or express this in fractions \(\frac{x}{3}\))
\(Region_i\) | \(Y_i(1)\) | \(Y_i(0)\) |
---|---|---|
1 | 3 | 2 |
2 | 4 | 4 |
3 | 4 | 2 |
4 | 2 | 3 |
5 | 2 | 4 |
6 | 4 | 1 |
What is the mean \(\widehat{ACE}\)?
How does it compare to the \(ACE\)?
Are there any \(\widehat{ACE} = ACE\)
Let’s check our work:
require(data.table) #data.table function
p_o_table = data.table(region_i = 1:6,
y_i_1 = c(3,4,4,2,2,4),
y_i_0 = c(2,4,2,3,4,1)
)
p_o_table$tau_i = p_o_table$y_i_1 - p_o_table$y_i_0
p_o_table[, tau_i := y_i_1 - y_i_0]
#ACE
ace = mean(p_o_table$tau_i)
ace
## [1] 0.5
Let’s check our work:
Let’s check our work:
## [1] 3.666667 3.000000 3.000000 3.666667 3.000000 3.000000 3.666667 2.333333
## [9] 3.000000 3.000000 3.333333 3.333333 4.000000 2.666667 3.333333 3.333333
## [17] 2.666667 3.333333 3.333333 2.666667
## [1] 2.666667 2.333333 2.000000 3.000000 3.000000 2.666667 3.666667 2.333333
## [9] 3.333333 3.000000 2.333333 2.000000 3.000000 1.666667 2.666667 2.333333
## [17] 2.333333 3.333333 3.000000 2.666667
Let’s check our work:
#Average Causal Effects (hat)
ace_hats = t_means - c_means
#Expected value of the ACE (hat)
e_ace_hat = mean(ace_hats)
e_ace_hat
## [1] 0.5
## [1] 0.5
Let’s check our work:
Let’s check our work:
Let’s check our work:
Sample Difference in Means in unbiased
Histogram is the exact sampling distribution of the \(\widehat{ACE}\) in this experiment
This sampling distribution could tell us
But we never observe this histogram
Analytic/Asymptotic approach
Randomization inference
Bootstrap: later in the course
All approaches involve:
This takes three steps:
We assume a known shape for the sample distribution of the \(\widehat{ACE}\): approximately Normal (or \(t\))
We estimate the variance (or standard deviation) of this sampling distribution
We estimate probability of observing \(\widehat{ACE}\) if \(ACE = 0\) using this distribution
Per the Central Limit Theorem: the sampling distributions of sums of random variables (and by extension, their means) approach the normal distribution as the \(N \rightarrow\infty\).
Using this fact; estimated sample mean and variance of the sample mean:
This approximation performs well, but depends on sample size and population distribution.
If the population looks like this:
…Predict the shape of sampling distribution of sample mean for \((n= 5\))
The shape of sampling distribution of sample mean for \((n= 5\))
If the population looks like this:
Predict the shape of sampling distribution of sample mean for \((n= 5\))
The shape of sampling distribution of sample mean for \((n= 5\))
The sampling distibution of sample mean for \(n = 25\) is:
The sampling distribution of sample mean for \((n= 100\)) is
Does normality hold in our experiment?
Does normality hold in our experiment?
Does normality hold in our experiment?
Then: we want to get variance of \(\widehat{ACE}\)
\[Var[X - Y] = Var[X] + Var[Y] - 2 \cdot Cov[X,Y]\]
What is \(Var[Y^T - Y^C] = Var[\widehat{ACE}]\)?
Variances of Treatment/Control Group Means
if we assume independent and identically distributed draws from the study group
\[Var[Y^T] = \frac{Var[Y_i(1)]}{m}\]
Variance of sampling distribution of the treatment-group mean is variance ofpotential outcomes under treatment for all cases divided by the treatment group size
Variance of potential outcomes under treatment:
\[Var[Y_i(1)] = \frac{1}{N}\sum\limits_{i=1}^{N} \left( Y_i(1) - \overbrace{\frac{\sum\limits_{i=1}^{N} Y_i(1)}{N}}^{mean \ Y_i(1)} \right) ^2\]
This is a parameter, often denoted \(\sigma^2\)
\[Var[Y^T] = \frac{\sigma^2}{m}\]
We don’t know \(\sigma^2\), we need to estimate it from our sample.
Like sample mean, sample variance is an unbiased estimator of population variance:
\[\widehat{Var[Y_i(1)]} = \widehat{\sigma^2} = \frac{1}{\color{green}{m-1}}\sum\limits_{i=1}^{m}[Y_i(1) | Z_i = 1] - Y^T)^2\]
Why is sample variance biased if we divide by \(m\) (instead of \(m-1\))?
the mean is the value that minimizes the sum of squared errors
If the sample mean \(\hat\mu\) \(\neq\) population mean \(\mu\), then \(\left[ \sum\limits_{i = 1}^{m} [x_i - \hat\mu]^2 \right] < \left[ \sum\limits_{i = 1}^{m} [x_i - \mu]^2 \right]\)
Uncorrected sample variance \(\widehat{\sigma^2}\) is \(\frac{1}{m} \sum\limits_{i = 1}^{m} [x_i - \hat\mu]^2\).
Then, \(\widehat{\sigma^2} < \sigma^2\) unless sample mean equals population mean
Using this approach:
\[\widehat{Var[Y_i(1)]} = \widehat{\sigma^2} = \frac{1}{m-1}\sum\limits_{i=1}^{m}[Y_i(1) | Z_i = 1] - Y^T)^2\]
\[\widehat{Var[Y^T]} = \frac{\widehat{\sigma^2}}{m}\]
we can estimate \(Var(Y^T)\) and \(Var(Y^C)\).
What else do we need to estimate \(Var[\widehat{ACE}]\)?
We still need \(Cov(Y^T,Y^C)\) to get variance of \(\widehat{ACE}\), because
\(Var[\widehat{ACE}] = Var[Y^T] + Var[Y^C] - 2 Cov[Y^T, Y^C]\)
\[Cov(Y^T,Y^C) = -\frac{1}{N(N-1)}\sum\limits_{i=1}^{N} \left( Y_i(1) - \overbrace{\frac{\sum\limits_{i=1}^{N} Y_i(1)}{N}}^{mean \ Y_i(1)} \right) \cdot \left(Y_i(0) - \overbrace{\frac{\sum\limits_{i=1}^{N} Y_i(0)}{N}}^{mean \ Y_i(0)} \right)\]
Can’t estimate the covariance because we don’t see both potential outcomes for each case!
We can ignore the covariance safely, even though it deflates the variance (how does this relate to severity?).
Variances we obtain with \(\widehat{Var}[\widehat{ACE}]\) are going to be:
So…
\[\widehat{Var}[\widehat{ACE}] = \widehat{Var}[Y^T] + \widehat{Var}[Y^C]\]
Variance is not what we want…
Let’s assume that villages 1, 2, 4 are treated, and the others are untreated
Let’s calculate \(\widehat{ACE}\)…
region_i | y_i_1 | y_i_0 | tau_i |
---|---|---|---|
1 | 3 | 2 | 1 |
2 | 4 | 4 | 0 |
3 | 4 | 2 | 2 |
4 | 2 | 3 | -1 |
5 | 2 | 4 | -2 |
6 | 4 | 1 | 3 |
y1s = p_o_table[region_i %in% c(1,2,4), y_i_1]
y0s = p_o_table[!(region_i %in% c(1,2,4)), y_i_0]
ace_hat = mean(y1s) - mean(y0s)
ace_hat
## [1] 0.6666667
Now, let’s calculate the estimated variance of the \(\widehat{ACE}\)
\(\widehat{Var[Y_i(1)]} = \widehat{\sigma^2} = \frac{1}{m-1}\sum\limits_{i=1}^{m}[Y_i(1) | Z_i = 1] - Y^T)^2\)
var_y_10 = function(y) {
m = length(y)
mean_y = mean(y)
1/(m-1)*sum((y-mean_y)^2)
}
#Variance Y(1)
var_y_10(y1s)
## [1] 1
## [1] 2.333333
Now, let’s calculate the estimated variance of the \(\widehat{ACE}\)
\(\widehat{Var[Y^T]} = \frac{\widehat{\sigma^2}}{m}\)
var_y_tc = function(y) {
var_y_hat = var_y_10(y)
var_y_hat/length(y)
}
#Var Y_T (treatment group sample mean)
var_y_tc(y1s)
## [1] 0.3333333
## [1] 0.7777778
Now, let’s calculate the estimated variance of the \(\widehat{ACE}\)
\(\widehat{Var}[\widehat{ACE}] = \widehat{Var}[Y^T] + \widehat{Var}[Y^C]\)
var_ace_hat = function(y1, y0) {
var_y_tc(y1) + var_y_tc(y0)
}
#Variance ACE hat (our estimate)
var_ace_hat(y1s, y0s)
## [1] 1.111111
Now, let’s use this to do a one-sided hypothesis test: \(H_a: ACE > 0\); \(H_0: ACE <= 0\)
First, we need get the \(t\) statistic (relative to the null):
\[t = \frac{\widehat{ACE} - H_0}{\widehat{SE}(\widehat{ACE})}\]
\(SE\) is just the square root of the variance:
## [1] 0.6324555
Next we need to calculate \(p\) values: need to know the distribution. \(t\) distribution depends on degrees of freedom, here we can assume it is 3.4482759.
We want to know probability of observing \(t =\) 0.6324555 or larger:
#we want to look at UPPER TAIL because we want to know probability of observing t or larger.
pt(t, df = df, lower.tail = F)
## [1] 0.2832836
Compare assumed \(t\) distribution against actual sampling distribution (shifted to be centered at 0)
If distributional assumptions are wrong, hypothesis test will not be correct
Unlike analytical approach:
Tests a different null hypothesis
Usually null hypothesis is that the average effect is \(0\) (some units could have positive or negative effects).
\[\frac{1}{N}\sum\limits_{i=1}^{N} \tau_i = ACE = 0\] Randomization inference tests the sharp null hypothesis
\[\tau_i = 0 \ \ \ \forall \ \ \ (i\ \in N)\] that every unit treatment effect is \(0\).
Advantages:
Disadvantages
In practice:
We run Paluck’s experiment and see this:
\(Region_i\) | \(Y_i(1)\) | \(Y_i(0)\) |
---|---|---|
1 | 3 | ? |
2 | 4 | ? |
3 | ? | 2 |
4 | 2 | ? |
5 | ? | 4 |
6 | ? | 1 |
Under the sharp null, what are the values that are “?”?
Under the sharp null, this would be true:
\(Region_i\) | \(Y_i(1)\) | \(Y_i(0)\) |
---|---|---|
1 | 3 | \(\color{red}{3}\) |
2 | 4 | \(\color{red}{4}\) |
3 | \(\color{red}{2}\) | 2 |
4 | 2 | \(\color{red}{2}\) |
5 | \(\color{red}{4}\) | 4 |
6 | \(\color{red}{1}\) | 1 |
Once we have this response schedule under the sharp null, we:
If \(\widehat{ACE}=\) 0.6666667, then \(p(ACE != 0)\) is 0.4.
In R
: