Michael Weaver

January 19, 2024

Potential Outcomes Model and Sampling Variability

Plan for Today

  • Review \(ACE\) (and unbiasedness of \(\widehat{ACE}\))
  • Sampling variability of \(\widehat{ACE}\)
  • Two ways to estimate variance of \(\widehat{ACE}\)
    • asymptotic
    • randomization inference
    • bootstrap


Key ideas from last week

  • Potential Outcomes Model (and Neyman Causal Model)
  • Partial and Point Identification
  • Reviewed random variables and their expectations

Potential Outcomes and Causal effects

We never observe unit causal effects, but can estimate average causal effect, by filling in missing (\(\color{red}{\text{counterfactual}}\)) potential outcomes.

\[\begin{equation}\begin{split}ACE &= \{E[Y_i(1)|Z_i = 1]\pi_1 + \overbrace{\color{red}{E[Y_i(1)|Z_i = 0]}}^{\text{Mean Y(1) for untreated}}\pi_0\} \\ & \phantom{=}\ - \{\underbrace{\color{red}{E[Y_i(0)|Z_i = 1]}}_{\text{Mean Y(0) for treated}}\pi_1 + E[Y_i(0)|Z_i = 0]\pi_0\} \end{split} \end{equation}\]

Partial Identification

Given assumptions, we fill in the maximum/minimum logically possible values for missing potential outcomes to get bounds. Bounds must contain the true causal effect if the assumptions are true.

Random Variables

random variable: a chance procedure for generating a number

observed value (realization): value of a particular draw of a random variable.

Random Variables

  1. Arithmetic operations on random variables are new random variables (e.g., sum and mean)

  2. Expected value of a random variable \(X\) is the mean of all possible realizations of \(X\)

  3. Independence and Dependence: random variables \(X,Y\) are independent if knowing value of \(X\) does not yield information about value of \(Y\).

  4. Mean of \(n\) realizations (sample) from random variable \(X\) is also a random variable, with mean same as \(E[X]\). (intuition for proof?)

Point Identification

  • Random assignment of \(Z\) lets us plug in sample means for unobserved potential outcomes: Why?

\[\begin{equation}\begin{split}ACE &= \{E[Y_i(1)|Z_i = 1]\pi_1 + \overbrace{\color{red}{E[Y_i(1)|Z_i = 0]}}^{\text{plug in Y(1) for treated}}\pi_0\} \\ & \phantom{=}\ - \{\underbrace{\color{red}{E[Y_i(0)|Z_i = 1]}}_{\text{plug in Y(0) for untreated}}\pi_1 + E[Y_i(0)|Z_i = 0]\pi_0\} \end{split} \end{equation}\]


  • … random assignment does not guarantee that any particular randomization gives us exact estimate of unobserved potential outcomes

  • different realizations of random process \(\to\) sampling variability

The extent of sampling variability depends on the nature of the random process that generates observations.

Sampling variability

In any given randomization, treatment mean and control mean are likely \(\neq\) the true means of \(Y(1)\) and \(Y(0)\)

We want to know:

  1. By how much are the sample means likely to deviate from true average causal effect?
  2. Given \(\widehat{ACE}\), how confident can we be that \(ACE \neq 0\)

Key idea

Sampling Distribution of the Mean

  • What is it?
  • We’ll do an example

Bonus idea

Unbiasedness of \(\widehat{ACE}\)

parameter/estimand: unknown attribute of random variable (e.g., the mean) that we want to know

estimator: rule/procedure for estimating the parameter/estimand given observed data

bias: estimator is biased if, on average, the estimator yields a value different from the parameter

So \(\widehat{ACE}\) is unbiased if:

\[E(\widehat{ACE}) - ACE = 0\]


Paluck (2010)

Following evidence of effects of soap opera in Rwanda (2009):

Paluck (2010)

Variation on Rwandan study, in Eastern DRC.

Paluck (2010)

Paluck (2010)

Paluck (2010)

Intolerance: “I would not like that group to belong to my community association”; (1 = totally disagree; 4 = totally agree)

\(Region_i\) \(Y_i(1)\) \(Y_i(0)\)
1 3 2
2 4 4
3 4 2
4 2 3
5 2 4
6 4 1

What are the unit treatment effects? (\(\tau_i\))

\(Region_i\) \(Y_i(1)\) \(Y_i(0)\)
1 3 2
2 4 4
3 4 2
4 2 3
5 2 4
6 4 1

What is the Average Causal Effect (\(ACE\))?

\(Region_i\) \(Y_i(1)\) \(Y_i(0)\)
1 3 2
2 4 4
3 4 2
4 2 3
5 2 4
6 4 1

Imagine we run this experiment

  • We set 3 regions in treatment (soap opera + talk show)

  • We set 3 regions in control (soap opera only)

  • How many possible random assignments are there?

  • What are all possible random assignments (to treatment and control)?

    • on paper/in R

All possible treatment groups

All possible treatment groups

1 2 3
1 2 4
1 2 5
1 2 6
1 3 4
1 3 5
1 3 6
1 4 5
1 4 6
1 5 6
2 3 4
2 3 5
2 3 6
2 4 5
2 4 6
2 5 6
3 4 5
3 4 6
3 5 6
4 5 6

For each randomization, calculate the \(\widehat{ACE}\) (hint, express this in fractions \(\frac{x}{3}\))

\(Region_i\) \(Y_i(1)\) \(Y_i(0)\)
1 3 2
2 4 4
3 4 2
4 2 3
5 2 4
6 4 1

Across all randomizations…

  • What is the mean \(\widehat{ACE}\)?

  • How does it compare to the \(ACE\)?

  • Are there any \(\widehat{ACE} = ACE\)

Let’s check our work:

require(data.table) #data.table function

p_o_table = data.table(region_i = 1:6,
                       y_i_1 = c(3,4,4,2,2,4),
                       y_i_0 = c(2,4,2,3,4,1)
p_o_table$tau_i = p_o_table$y_i_1 - p_o_table$y_i_0
p_o_table[, tau_i := y_i_1 - y_i_0]

ace = mean(p_o_table$tau_i)
## [1] 0.5

Let’s check our work:

require(ri) #combn is a function in this package
require(magrittr) # %>% is from this package

randomizations = combn(6,3,simplify = T) %>% t

t_means = apply(randomizations, 1, 
                  mean(p_o_table[region_i %in% x, y_i_1])

c_means = apply(randomizations, 1, 
                  mean(p_o_table[!(region_i %in% x), y_i_0])

Let’s check our work:

##  [1] 3.666667 3.000000 3.000000 3.666667 3.000000 3.000000 3.666667 2.333333
##  [9] 3.000000 3.000000 3.333333 3.333333 4.000000 2.666667 3.333333 3.333333
## [17] 2.666667 3.333333 3.333333 2.666667
##  [1] 2.666667 2.333333 2.000000 3.000000 3.000000 2.666667 3.666667 2.333333
##  [9] 3.333333 3.000000 2.333333 2.000000 3.000000 1.666667 2.666667 2.333333
## [17] 2.333333 3.333333 3.000000 2.666667

Let’s check our work:

#Average Causal Effects (hat)
ace_hats = t_means - c_means

#Expected value of the ACE (hat)
e_ace_hat = mean(ace_hats)
## [1] 0.5
## [1] 0.5

Let’s check our work:

Let’s check our work:

Let’s check our work:


  1. Sample Difference in Means in unbiased

    • mean of the sampling distribution of sample \(\widehat{ACE}\) is same as population \(ACE\)
    • yet none of the samples estimate population \(ACE\) exactly.
  2. Histogram is the exact sampling distribution of the \(\widehat{ACE}\) in this experiment

    • shows us how likely it is we observe sample \(\widehat{ACE}\) by chance (using this randomization scheme)


This sampling distribution could tell us

  • how likely we are to observe \(\widehat{ACE}\) by chance
  • variability in estimate \(\widehat{ACE}\)
  • also the true \(ACE\)

But we never observe this histogram

  • we need a way to estimate the sampling distribution of \(\widehat{ACE}\).

Variance of \(\widehat{ACE}\)

Three approaches

  1. Analytic/Asymptotic approach

  2. Randomization inference

  3. Bootstrap

Three approaches

All approaches involve:

  • assuming a particular process of randomization by which realizations from random variable are created
  • need to answer:
  • what do we want to make inferences about? the study cases or a broader population (randomness in which cases we see)? counterfactuals (randomness in which potential outcomes we see)?
  • what process generates the randomness? (sometimes this is clear, sometimes not at all)

Analytic Approach

First: we want to get variance of \(\widehat{ACE}\)

  • \(\widehat{ACE}\) is a difference of random variables. Why?
  • Need rules for calculating variance of difference between random variables

\[Var[X - Y] = Var[X] + Var[Y] - 2 \cdot Cov[X,Y]\]

Analytic Approach

  • study group of size \(N\) (\(N\) units are assigned)
  • \(m\) units assigned to T; \(N - m = n\) units assigned to C
  • \(Y^T = \frac{1}{m}\sum\limits_{i=1}^{m}[\overbrace{Y_i(1) | Z_i = 1}^{\mathrm{observed \ Y \ for \ Treated}}]\)
  • \(Y^C = \frac{1}{n}\sum\limits_{i=m+1}^{N}[\underbrace{Y_i(0) | Z_i= 0}_{\mathrm{observed \ Y \ for \ Control}}]\)
  • \(\widehat{ACE} = Y^T - Y^C\)

What is \(Var[Y^T - Y^C] = Var[\widehat{ACE}]\)?

  • \(Var[Y^T - Y^C] = Var[Y^T] + Var[Y^C] - 2 Cov[Y^T, Y^C]\)

Analytic Approach

Variances of Treatment/Control Group Means

if we assume independent and identically distributed draws from the study group

  • (is this correct?)

\[Var[Y^T] = \frac{Var[Y_i(1)]}{m}\]

Variance of sampling distribution of the treatment-group mean is variance of potential outcomes under treatment for all cases divided by the treatment group size

Analytic Approach

Variance of potential outcomes under treatment:

\[Var[Y_i(1)] = \frac{1}{N}\sum\limits_{i=1}^{N} \left( Y_i(1) - \overbrace{\frac{\sum\limits_{i=1}^{N} Y_i(1)}{N}}^{mean \ Y_i(1)} \right) ^2\]

This is a parameter, often denoted \(\sigma^2\)

\[Var[Y^T] = \frac{\sigma^2}{m}\]

  • Why is \(\sigma^2\) a parameter?

Analytic Approach

We don’t know \(\sigma^2\), we need to estimate it from our sample.

Like sample mean, sample variance is an unbiased estimator of population variance:

  • if we divide by \(m - 1\) not \(m\)

\[\widehat{Var[Y_i(1)]} = \widehat{\sigma^2} = \frac{1}{\color{red}{m-1}}\sum\limits_{i=1}^{m}[Y_i(1) | Z_i = 1] - Y^T)^2\]


Why is sample variance biased if we divide by \(m\) (instead of \(m-1\))?

  • the mean is the value that minimizes the sum of squared errors

  • If the sample mean \(\hat\mu\) \(\neq\) population mean \(\mu\), then \(\left[ \sum\limits_{i = 1}^{m} [x_i - \hat\mu]^2 \right] < \left[ \sum\limits_{i = 1}^{m} [x_i - \mu]^2 \right]\)

  • Uncorrected sample variance \(\widehat{\sigma^2}\) is \(\frac{1}{m} \sum\limits_{i = 1}^{m} [x_i - \hat\mu]^2\).

  • Then, \(\widehat{\sigma^2} < \sigma^2\) unless sample mean equals population mean


Analytic Approach

Using this approach:

\[\widehat{Var[Y_i(1)]} = \widehat{\sigma^2} = \frac{1}{m-1}\sum\limits_{i=1}^{m}[Y_i(1) | Z_i = 1] - Y^T)^2\]

\[\widehat{Var[Y^T]} = \frac{\widehat{\sigma^2}}{m}\]

we can estimate \(Var(Y^T)\) and \(Var(Y^C)\).

What else do we need to estimate \(Var[\widehat{ACE}]\)?

Analytic Approach

We still need \(Cov(Y^T,Y^C)\) to get variance of \(\widehat{ACE}\), because

\(Var[\widehat{ACE}] = Var[Y^T] + Var[Y^C] - 2 Cov[Y^T, Y^C]\)

\[Cov(Y^T,Y^C) = -\frac{1}{N(N-1)}\sum\limits_{i=1}^{N} \left( Y_i(1) - \overbrace{\frac{\sum\limits_{i=1}^{N} Y_i(1)}{N}}^{mean \ Y_i(1)} \right) \cdot \left(Y_i(0) - \overbrace{\frac{\sum\limits_{i=1}^{N} Y_i(0)}{N}}^{mean \ Y_i(0)} \right)\]

  • Can we estimate this?

Analytic Approach

Can’t estimate the covariance because we don’t see both potential outcomes for each case!

  • What can we do?

Analytic Approach

We can ignore the covariance safely, because doing so deflates the variance.

  1. Estimator \(\widehat{Var[Y^T]}\) ignores that we sample without replacement from finite population (\(\to\) estimated variance too large)
  2. This either exactly or more than offsets the amount by which ignoring \(Cov(Y^T,Y^C)\) makes our estimate \(\widehat{Var}(\widehat{ACE})\) too small

Analytic Approach

Variances we obtain with \(\widehat{Var}[\widehat{ACE}]\) are going to be:

  • exactly correct (if \(\tau_i\) is the same for all \(i\); effect is same for all cases)
  • TOO LARGE, conservative in all other cases.

Analytic Approach

We’ve been trying to estimate the variance of the \(\widehat{ACE}\).

Variance is not usually what we want

  • units are squared.
  • The standard error (standard deviation of the sampling distribution) of \(\widehat{ACE}\) is more helpful. It is the square-root of the variance.
  • Usually, though, we want to do a hypothesis test or create confidence intervals

Hypothesis Testing

Normal approximation:

Per the Central Limit Theorem: the sampling distributions of sums of random variables (and by extension, their means) approach the normal distribution as the \(N \rightarrow\infty\).

Using this fact; estimated sample mean and variance of the sample mean:

  • Use the area under the normal (usually \(t\): why?) curve to get compute chance of observing sample mean by chance, given a null hypothesis.

This approximation performs well, but depends on sample size and population distribution.

Normal approximation:

If the population looks like this:

Normal approximation:

Predict the shape of sampling distribution of sample mean for \((n= 5\))

Normal approximation:

The shape of sampling distribution of sample mean for \((n= 5\))

Normal approximation:

If the population looks like this:

Normal approximation:

Predict the shape of sampling distribution of sample mean for \((n= 5\))

Normal approximation:

The shape of sampling distribution of sample mean for \((n= 5\))

Normal approximation:

The sampling distibution of sample mean for \(n = 25\) is:

Normal approximation:

The sampling distribution of sample mean for \((n= 100\)) is

Normal approximation:

Does normality hold in our experiment?

Normal approximation:

Does normality hold in our experiment?

Normal approximation:

Does normality hold in our experiment?

Hypothesis Tests

We run an experiment on our 6 regions, and observe \(\widehat{ACE} = 0.667\)

The hypothesis test investigates: what is probability of observing a value this large or larger if the true \(ACE = 0\)

  • Null hypothesis \(H_0\) is \(ACE = 0\)
  • alternative hypotheses are \(ACE > 0\) or \(ACE < 0\) or \(ACE \neq 0\).

Hypothesis Tests

  • choose an \(\alpha\): arbitrary cutoff for rejecting null hypothesis
  • We need to estimate the standard error of ACE: \(\widehat{SE}(\widehat{ACE})\)
  • Divide \(\widehat{ACE}\) by \(\widehat{SE}\) (get \(t\) statistic).
  • Compare test-statistic against \(t\) distribution with appropriate degrees of freedom, calculate \(p\) value (interpretation depends on the alternative hypothesis)
  • and reject/fail to reject null (given choice of \(\alpha\): the significance level)

Limitations of Analytical Hypothesis Tests

  • \(t\)-tests assume normality in potential outcomes
  • or assume asymptotic normality of sampling distribution of sample mean. This may be incorrect with small \(N\) or complex designs. (What about in our example?)

If distributional assumptions are wrong, hypothesis test will not be correct

  • need to investigate plausibility of assumption:
    • is \(N\) large?
    • Are values of \(Y(1) | Z=1\) and \(Y(0) | Z =0\) approximately normal?

Limitations of Analytical Hypothesis Tests



randomization inference

Randomization Inference

Randomization Inference

Unlike analytical approach:

  • randomization inference does not assume any asymptotic distribution
  • can be applied to many different estimators (not just, e.g., the \(\widehat{ACE}\))

Randomization Inference

Tests a different null hypothesis

Usually null hypothesis is that the average effect is \(0\) (some units could have positive or negative effects).

\[\frac{1}{N}\sum\limits_{i=1}^{N} \tau_i = ACE = 0\] Randomization inference tests the sharp null hypothesis

\[\tau_i = 0 \ \ \ \forall \ \ \ (i\ \in N)\] that every unit treatment effect is \(0\).

Randomization Inference


  • No assumptions about distributions (no assumption of normality, no asymptotic convergence)
  • directly flows from potential outcomes model
  • any test statistic (median, ranks, whatever)


  • only tests sharp null which may be less interesting\(^*\)
  • Confidence intervals only for constant effects
  • \(p\) values limited by possible combinations (too few, and they are coarse, too many, they are approximate)

Randomization Inference

In practice:

  • Null hypothesis of sharp null lets us complete the potential outcomes table/response schedule based on the observed data
  • Why?
  • Sharp null implies that \(Y_i^1 = Y_i^0\) for all units \(i\).

Randomization Inference

We run Paluck’s experiment and see this:

\(Region_i\) \(Y_i(1)\) \(Y_i(0)\)
1 3 ?
2 4 ?
3 ? 2
4 2 ?
5 ? 4
6 ? 1

Under the sharp null, what are the values that are “?”?

Randomization Inference

Under the sharp null, this would be true:

\(Region_i\) \(Y_i(1)\) \(Y_i(0)\)
1 3 \(\color{red}{3}\)
2 4 \(\color{red}{4}\)
3 \(\color{red}{2}\) 2
4 2 \(\color{red}{2}\)
5 \(\color{red}{4}\) 4
6 \(\color{red}{1}\) 1

Randomization Inference

Once we have this response schedule under the sharp null, we:

  1. Create all possible permutations of randomizations (or sample them, if this number is large)
  2. Calculate the difference in means (or other statistic) for these different randomizations.
  3. Compare our observed statistic against this null distribution (distribution of \(\widehat{ACE}\) that could have occurred by chance assuming the sharp null hypothesis is true)
  4. Calculate \(p\) value based on fraction of outcomes in null distribution more extreme than observed outcome.

Randomization Inference

Randomization Inference

If \(\widehat{ACE}=\) 0.6666667, then \(p(ACE != 0)\) is 0.8.

In R:
