POLI 572B

Michael Weaver

January 17, 2025

Potential Outcomes Model and Hypothesis Testing

Plan for Today

Review \(ACE\) (and unbiasedness of \(\widehat{ACE}\))
Sampling variability of \(\widehat{ACE}\)
Two ways to estimate variance of \(\widehat{ACE}\)
- asymptotic
- randomization inference
- bootstrap

Review

Key ideas from last week

Potential Outcomes Model (and Neyman Causal Model)
Partial and Point Identification

Potential Outcomes and Causal effects

We never observe unit causal effects, but can estimate average causal effect, by filling in missing (\(\color{red}{\text{counterfactual}}\)) potential outcomes.

\[\begin{equation}\begin{split}ACE &= \{E[Y_i(1)|Z_i = 1]\pi_1 + \overbrace{\color{red}{E[Y_i(1)|Z_i = 0]}}^{\text{Mean Y(1) for untreated}}\pi_0\} \\ & \phantom{=}\ - \{\underbrace{\color{red}{E[Y_i(0)|Z_i = 1]}}_{\text{Mean Y(0) for treated}}\pi_1 + E[Y_i(0)|Z_i = 0]\pi_0\} \end{split} \end{equation}\]

Partial Identification

Given assumptions, we fill in the maximum/minimum logically possible values for missing potential outcomes to get bounds. Bounds must contain the true causal effect if the assumptions are true.

Point Identification

Random assignment of \(Z\) lets us plug in sample means for unobserved potential outcomes: Why?

\[\begin{equation}\begin{split}ACE &= \{E[Y_i(1)|Z_i = 1]\pi_1 + \overbrace{\color{red}{E[Y_i(1)|Z_i = 0]}}^{\text{plug in Y(1) for treated}}\pi_0\} \\ & \phantom{=}\ - \{\underbrace{\color{red}{E[Y_i(0)|Z_i = 1]}}_{\text{plug in Y(0) for untreated}}\pi_1 + E[Y_i(0)|Z_i = 0]\pi_0\} \end{split} \end{equation}\]

BUT…

… random assignment does not guarantee that any particular randomization gives us exact estimate of unobserved potential outcomes
different realizations of random process \(\to\) sampling variability

The extent of sampling variability depends on the nature of the random process that generates observations.

Sampling variability

In any given randomization, treatment mean and control mean are likely \(\neq\) the true means of \(Y(1)\) and \(Y(0)\)…

Given this, we want to know…

Error probabilities: if we use this estimator to estimate the ACE, how likely are we to conclude there is a causal effect when in fact there is not?

What are the range of causal effect sizes we can rule out, with known error probabilities? (if these were the true effects, we are highly unlikely to observe our estimated effect)

Sampling Distributions

If we run an experiment (implement a set of randomization procedures that dictate treatment assignment)…

“You give me the rule, and I consider its latitude for erroneous outputs. We’re actually looking at the probability distribution of the rule, over outcomes in the sample space. This distribution is called a sampling distribution.” (Mayo 2018)

What are the other (counterfactual) outcomes that we could have observed through implementing this procedure/test?

Sampling distributions can \(\to\) severe tests of claims

Hypothesis tests: Hypotheses

We evaluate hypotheses (claims) about some estimand of interest (e.g. \(ACE\)).

in pairs with a null hypothesis and some alternative hypothesis about the estimand (complements: if one is true the other must be false, so cover the entire set of possible values of the estimand)
It is common (but not required) for the null hypothesis to be stated in terms of \(0\):

\(H_0: ACE <= 0; H_a: ACE > 0\) one-sided

\(H_0: ACE >= 0; H_a: ACE < 0\) one-sided

\(H_0: ACE = 0; H_a: ACE \neq 0\) two-sided

Hypothesis tests: test statistic

We then use the sampling distribution of the estimate (usually a test statistic), assuming the null hypothesis is true

Often, we use a \(t\) statistic when working with means. Other comparisons may require different test statistics.
we assume that the true \(ACE = 0\) (or whatever is implied by the null) and then construct the sampling distribution of \(\widehat{ACE}\)

Hypothesis tests: \(p\) value

\(p\) value tells us how likely we are to observe the data, assuming the null hypothesis is correct.

it is the probability of observing something as extreme as the data or more, assuming the null is true.
The calculation of probability depends on knowing the shape of the sampling distribution (how frequently would different outcomes occur?)

Hypothesis tests: \(p\) value

Hypothesis tests: significance level

We want to either reject the null hypothesis as False or not reject the null hypothesis.

since we can not know for sure whether it is true or false, we choose some probability (\(p\) value), below which, we arbitrarily decide that the null hypothesis can be rejected.
\(\alpha\) is the threshold which gives us the significance level, and decision rule for rejecting the null.

Hypothesis tests: rejecting the null

Because the null hypothesis is always either True or False, our decision to reject the null based on some probability threshold (\(\alpha\)) may be correct or an error.

Type I error: false positive. We incorrectly reject the null
Type II error: false negative. We incorrectly fail to reject the null

With a single test, the \(p\) value can be interpreted as the false positive rate of this test procedure using that value as \(\alpha\). If we use \(\alpha=.05\), we are saying that we are comfortable with our testing procedure making false positive errors in no more than 5% of tests.

Hypothesis tests: rejecting the null

From the severe testing perspective, we judge the alternative hypothesis to be “severely tested” when we observe evidence that would only occur 5% of the time if the null hypothesis were true.

Hypothesis tests: rejecting the null

If we say, “The result is significantly different from the hypothesized value of zero (\(p=.001\))! We reject that hypothesis!” when the truth is zero we are making a false positive error (claiming to detect something positively when there is no signal, only noise).
If we say, “We cannot distinguish this result from zero (\(p=.3\)). We cannot reject the hypothesis of zero.” when the truth is not zero we are making a false negative error.

Hypothesis tests: Warnings

Hypothesis tests can be useful for learning, if used appropriately.

\(p\)-values indicate probability of false positive in a single test. If you conduct many tests they no longer have the same interpretation. (Multiple hypotheses require adjustments)
\(p\)-values will get smaller, when \(n\) gets large. They do not indicate the magnitude of the effect.
absence of evidence \(\neq\) evidence of absence: failing to reject the null does not mean the null is true. Your test may be under-powered.
null hypotheses should be meaningful with respect to theory.

Confidence Intervals

Confidence intervals summarize a group of hypothesis tests.

Imagine an infinite number of hypothesis tests, where the \(H_0: ACE = b\) and \(b \in (-\infty,\infty)\).
For each null hypothesis, we conduct a hypothesis test using the cutoff \(\alpha\) to decide on rejecting the null.
Then a \(100(1-\alpha)\)% confidence interval is the set of values in \(b\) where we fail to reject the null.

So a 95% confidence interval contains all values \(b\) where we fail to reject the null with \(\alpha = 0.05\).

Confidence Intervals

Because confidence intervals invert hypothesis tests, they are the product of the random process. They either do or do not include the true parameter value.

The correct way to interpret them is:

Our procedure severely tests the observed \(\widehat{ACE}\) against the true \(ACE\) being outside of confidence interval. If true \(ACE\) is outside of the confidence interval, we would have observed \(\widehat{ACE}\) smaller(larger) than we saw, with probability \(1-\alpha\).

But hypothesis tests and confidence intervals require that we have access to the sampling distribution of \(\widehat{ACE}\)… how do we do that?

Bonus: Unbiasedness of \(\widehat{ACE}\)

parameter/estimand: unknown attribute of random variable (e.g., the difference in treatment/control group means) that we want to know

estimator: rule/procedure for estimating the parameter/estimand given observed data

bias: estimator is biased if, on average, the estimator yields a value different from the parameter

So \(\widehat{ACE}\) is unbiased if:

\[E(\widehat{ACE}) - ACE = 0\]

Example

Key ideas

To understand sampling distribution
To understand unbiasedness of the estimator \(\widehat{ACE}\)

We can work through an actual experiment.

Paluck (2010)

Following evidence of effects of soap opera in Rwanda (2009):

Paluck (2010)

Variation on Rwandan study, in Eastern DRC.

Paluck (2010)

Intolerance: “I would not like that group to belong to my community association”; (1 = totally disagree; 4 = totally agree)

\(Region_i\)	\(Y_i(1)\)	\(Y_i(0)\)
1	3	2
2	4	4
3	4	2
4	2	3
5	2	4
6	4	1

What are the unit treatment effects? (\(\tau_i\))

\(Region_i\)	\(Y_i(1)\)	\(Y_i(0)\)
1	3	2
2	4	4
3	4	2
4	2	3
5	2	4
6	4	1

What is the Average Causal Effect (\(ACE\))?

\(Region_i\)	\(Y_i(1)\)	\(Y_i(0)\)
1	3	2
2	4	4
3	4	2
4	2	3
5	2	4
6	4	1

Imagine we run this experiment

We set 3 regions in treatment (soap opera + talk show)
We set 3 regions in control (soap opera only)
How many possible random assignments are there?
What are all possible random assignments (to treatment and control)?
- on paper/in R (hint, use the combn function)

All possible treatment groups


1	2	3
1	2	4
1	2	5
1	2	6
1	3	4
1	3	5
1	3	6
1	4	5
1	4	6
1	5	6
2	3	4
2	3	5
2	3	6
2	4	5
2	4	6
2	5	6
3	4	5
3	4	6
3	5	6
4	5	6

For each randomization, calculate the estimated \(\widehat{ACE}\) (hint, calculate in R, or express this in fractions \(\frac{x}{3}\))

\(Region_i\)	\(Y_i(1)\)	\(Y_i(0)\)
1	3	2
2	4	4
3	4	2
4	2	3
5	2	4
6	4	1

Across all randomizations…

What is the mean \(\widehat{ACE}\)?
How does it compare to the \(ACE\)?
Are there any \(\widehat{ACE} = ACE\)

Let’s check our work:

require(data.table) #data.table function

p_o_table = data.table(region_i = 1:6,
                       y_i_1 = c(3,4,4,2,2,4),
                       y_i_0 = c(2,4,2,3,4,1)
                      )
p_o_table$tau_i = p_o_table$y_i_1 - p_o_table$y_i_0
p_o_table[, tau_i := y_i_1 - y_i_0]

#ACE
ace = mean(p_o_table$tau_i)
ace

## [1] 0.5

Let’s check our work:

require(magrittr) # %>% is from this package

randomizations = combn(6,3,simplify = T) %>% t

t_means = apply(randomizations, 1, 
                function(x) 
                  mean(p_o_table[region_i %in% x, y_i_1])
                )

c_means = apply(randomizations, 1, 
                function(x) 
                  mean(p_o_table[!(region_i %in% x), y_i_0])
                )

Let’s check our work:

t_means

##  [1] 3.666667 3.000000 3.000000 3.666667 3.000000 3.000000 3.666667 2.333333
##  [9] 3.000000 3.000000 3.333333 3.333333 4.000000 2.666667 3.333333 3.333333
## [17] 2.666667 3.333333 3.333333 2.666667

c_means

##  [1] 2.666667 2.333333 2.000000 3.000000 3.000000 2.666667 3.666667 2.333333
##  [9] 3.333333 3.000000 2.333333 2.000000 3.000000 1.666667 2.666667 2.333333
## [17] 2.333333 3.333333 3.000000 2.666667

Let’s check our work:

#Average Causal Effects (hat)
ace_hats = t_means - c_means

#Expected value of the ACE (hat)
e_ace_hat = mean(ace_hats)
e_ace_hat

## [1] 0.5

#ACE
ace

## [1] 0.5

Let’s check our work:

Summary:

Sample Difference in Means in unbiased
- mean of the sampling distribution of sample \(\widehat{ACE}\) is same as population \(ACE\)
- yet none of the samples estimate population \(ACE\) exactly.
Histogram is the exact sampling distribution of the \(\widehat{ACE}\) in this experiment
- shows us how likely it is we observe sample \(\widehat{ACE}\) by chance (using this randomization scheme)

Summary:

This sampling distribution could tell us

how likely we are to observe \(\widehat{ACE}\) by chance
… but also the true \(ACE\)

But we never observe this histogram

we need a way to estimate the sampling distribution of \(\widehat{ACE}\).

Variance of \(\widehat{ACE}\)

Three approaches

Analytic/Asymptotic approach
Randomization inference
~~Bootstrap: later in the course~~

Three approaches

All approaches involve:

assuming a particular process of randomization by which \(\widehat{ACE}\) is drawn from a random variable
need to answer:

what do we want to make inferences about? the study cases or a broader population (randomness in which cases we see)? counterfactuals (randomness in which potential outcomes we see)?

what process generates the randomness? (sometimes this is clear, sometimes not at all)

Analytic Approach

This takes three steps:

We assume a known shape for the sample distribution of the \(\widehat{ACE}\): approximately Normal (or \(t\))
We estimate the variance (or standard deviation) of this sampling distribution
We estimate probability of observing \(\widehat{ACE}\) if \(ACE = 0\) using this distribution

Normal approximation:

Per the Central Limit Theorem: the sampling distributions of sums of random variables (and by extension, their means) approach the normal distribution as the \(N \rightarrow\infty\).

Using this fact; estimated sample mean and variance of the sample mean:

This approximation performs well, but depends on sample size and population distribution.

If the population looks like this:

…Predict the shape of sampling distribution of sample mean for \((n= 5\))

The shape of sampling distribution of sample mean for \((n= 5\))

If the population looks like this:

Predict the shape of sampling distribution of sample mean for \((n= 5\))

The shape of sampling distribution of sample mean for \((n= 5\))

The sampling distibution of sample mean for \(n = 25\) is:

The sampling distribution of sample mean for \((n= 100\)) is

Normal approximation:

Does normality hold in our experiment?

Normal approximation:

Does normality hold in our experiment?

Normal approximation:

Does normality hold in our experiment?

Then: we want to get variance of \(\widehat{ACE}\)

\(\widehat{ACE}\) is a difference of random variables. Why?
Need rules for calculating variance of difference between random variables

\[Var[X - Y] = Var[X] + Var[Y] - 2 \cdot Cov[X,Y]\]

Analytic Approach

study group of size \(N\) (\(N\) units are assigned)
\(m\) units assigned to T; \(N - m = n\) units assigned to C
\(Y^T = \frac{1}{m}\sum\limits_{i=1}^{m}[\overbrace{Y_i(1) | Z_i = 1}^{\mathrm{observed \ Y \ for \ Treated}}]\)
\(Y^C = \frac{1}{n}\sum\limits_{i=m+1}^{N}[\underbrace{Y_i(0) | Z_i= 0}_{\mathrm{observed \ Y \ for \ Control}}]\)
\(\widehat{ACE} = Y^T - Y^C\)

What is \(Var[Y^T - Y^C] = Var[\widehat{ACE}]\)?

\(Var[Y^T - Y^C] = Var[Y^T] + Var[Y^C] - 2 Cov[Y^T, Y^C]\)

Analytic Approach

Variances of Treatment/Control Group Means

if we assume independent and identically distributed draws from the study group

(is this correct?)

\[Var[Y^T] = \frac{Var[Y_i(1)]}{m}\]

Variance of sampling distribution of the treatment-group mean is variance ofpotential outcomes under treatment for all cases divided by the treatment group size

Analytic Approach

Variance of potential outcomes under treatment:

\[Var[Y_i(1)] = \frac{1}{N}\sum\limits_{i=1}^{N} \left( Y_i(1) - \overbrace{\frac{\sum\limits_{i=1}^{N} Y_i(1)}{N}}^{mean \ Y_i(1)} \right) ^2\]

This is a parameter, often denoted \(\sigma^2\)

\[Var[Y^T] = \frac{\sigma^2}{m}\]

Why is \(\sigma^2\) a parameter?

Analytic Approach

We don’t know \(\sigma^2\), we need to estimate it from our sample.

Like sample mean, sample variance is an unbiased estimator of population variance:

if we divide by \(m - 1\) not \(m\)

\[\widehat{Var[Y_i(1)]} = \widehat{\sigma^2} = \frac{1}{\color{green}{m-1}}\sum\limits_{i=1}^{m}[Y_i(1) | Z_i = 1] - Y^T)^2\]

Digression

Why is sample variance biased if we divide by \(m\) (instead of \(m-1\))?

the mean is the value that minimizes the sum of squared errors
If the sample mean \(\hat\mu\) \(\neq\) population mean \(\mu\), then \(\left[ \sum\limits_{i = 1}^{m} [x_i - \hat\mu]^2 \right] < \left[ \sum\limits_{i = 1}^{m} [x_i - \mu]^2 \right]\)
Uncorrected sample variance \(\widehat{\sigma^2}\) is \(\frac{1}{m} \sum\limits_{i = 1}^{m} [x_i - \hat\mu]^2\).
Then, \(\widehat{\sigma^2} < \sigma^2\) unless sample mean equals population mean

Digression

Analytic Approach

Using this approach:

\[\widehat{Var[Y_i(1)]} = \widehat{\sigma^2} = \frac{1}{m-1}\sum\limits_{i=1}^{m}[Y_i(1) | Z_i = 1] - Y^T)^2\]

\[\widehat{Var[Y^T]} = \frac{\widehat{\sigma^2}}{m}\]

we can estimate \(Var(Y^T)\) and \(Var(Y^C)\).

What else do we need to estimate \(Var[\widehat{ACE}]\)?

Analytic Approach

We still need \(Cov(Y^T,Y^C)\) to get variance of \(\widehat{ACE}\), because

\(Var[\widehat{ACE}] = Var[Y^T] + Var[Y^C] - 2 Cov[Y^T, Y^C]\)

\[Cov(Y^T,Y^C) = -\frac{1}{N(N-1)}\sum\limits_{i=1}^{N} \left( Y_i(1) - \overbrace{\frac{\sum\limits_{i=1}^{N} Y_i(1)}{N}}^{mean \ Y_i(1)} \right) \cdot \left(Y_i(0) - \overbrace{\frac{\sum\limits_{i=1}^{N} Y_i(0)}{N}}^{mean \ Y_i(0)} \right)\]

Can we estimate this?

Analytic Approach

Can’t estimate the covariance because we don’t see both potential outcomes for each case!

What can we do?

Analytic Approach

We can ignore the covariance safely, even though it deflates the variance (how does this relate to severity?).

Estimator \(\widehat{Var[Y^T]}\) ignores that we sample without replacement from finite population (\(\to\) estimated variance too large)
This either exactly or more than offsets the amount by which ignoring \(Cov(Y^T,Y^C)\) makes our estimate \(\widehat{Var}(\widehat{ACE})\) too small

Analytic Approach

Variances we obtain with \(\widehat{Var}[\widehat{ACE}]\) are going to be:

exactly correct (if \(\tau_i\) is the same for all \(i\); effect is same for all cases)
TOO LARGE, conservative in all other cases. (What are implications for severity)

So…

\[\widehat{Var}[\widehat{ACE}] = \widehat{Var}[Y^T] + \widehat{Var}[Y^C]\]

Analytic Approach

Variance is not what we want…

units are squared.
The standard error (standard deviation of the sampling distribution) of \(\widehat{ACE}\) is more helpful. It is the square-root of the variance.
Standard errors \(\to\) hypothesis test or confidence intervals

Hypothesis Tests

choose an \(\alpha\): cutoff for rejecting null hypothesis (what counts as “severely testing \(H_a\)”)
assume sampling distribution is \(t\)-distribution
We need to estimate the standard error of ACE: \(\widehat{SE}(\widehat{ACE})\)
Divide \(\widehat{ACE}\) by \(\widehat{SE}\) (get \(t\) statistic).
Compare test-statistic against \(t\) distribution with appropriate degrees of freedom, calculate \(p\) value (interpretation depends on the alternative hypothesis)
and reject/fail to reject null (given choice of \(\alpha\): the significance level)

Let’s assume that villages 1, 2, 4 are treated, and the others are untreated

Let’s calculate \(\widehat{ACE}\)…

region_i	y_i_1	y_i_0	tau_i
1	3	2	1
2	4	4	0
3	4	2	2
4	2	3	-1
5	2	4	-2
6	4	1	3

y1s = p_o_table[region_i %in% c(1,2,4), y_i_1]

y0s = p_o_table[!(region_i %in% c(1,2,4)), y_i_0]

ace_hat = mean(y1s) - mean(y0s)

ace_hat

## [1] 0.6666667

Now, let’s calculate the estimated variance of the \(\widehat{ACE}\)

\(\widehat{Var[Y_i(1)]} = \widehat{\sigma^2} = \frac{1}{m-1}\sum\limits_{i=1}^{m}[Y_i(1) | Z_i = 1] - Y^T)^2\)

var_y_10 = function(y) {
  m = length(y)
  mean_y = mean(y)
  1/(m-1)*sum((y-mean_y)^2)
}
#Variance Y(1)
var_y_10(y1s)

## [1] 1

#Variance Y(0)
var_y_10(y0s)

## [1] 2.333333

Now, let’s calculate the estimated variance of the \(\widehat{ACE}\)

\(\widehat{Var[Y^T]} = \frac{\widehat{\sigma^2}}{m}\)

var_y_tc = function(y) {
  var_y_hat = var_y_10(y)
  var_y_hat/length(y)
}

#Var Y_T (treatment group sample mean)
var_y_tc(y1s)

## [1] 0.3333333

#Var Y_C (control group sample mean)
var_y_tc(y0s)

## [1] 0.7777778

Now, let’s calculate the estimated variance of the \(\widehat{ACE}\)

\(\widehat{Var}[\widehat{ACE}] = \widehat{Var}[Y^T] + \widehat{Var}[Y^C]\)

var_ace_hat = function(y1, y0) {
  var_y_tc(y1) + var_y_tc(y0)
}

#Variance ACE hat (our estimate)
var_ace_hat(y1s, y0s)

## [1] 1.111111

Now, let’s use this to do a one-sided hypothesis test: \(H_a: ACE > 0\); \(H_0: ACE <= 0\)

First, we need get the \(t\) statistic (relative to the null):

\[t = \frac{\widehat{ACE} - H_0}{\widehat{SE}(\widehat{ACE})}\]

\(SE\) is just the square root of the variance:

se_ace = var_ace_hat(y1s, y0s) %>% sqrt

t = (ace_hat - 0)/se_ace

t

## [1] 0.6324555

Next we need to calculate \(p\) values: need to know the distribution. \(t\) distribution depends on degrees of freedom, here we can assume it is 3.4482759.

We want to know probability of observing \(t =\) 0.6324555 or larger:

#we want to look at UPPER TAIL because we want to know probability of observing t or larger.
pt(t, df = df, lower.tail = F)

## [1] 0.2832836

Compare assumed \(t\) distribution against actual sampling distribution (shifted to be centered at 0)

Limitations of Analytical Hypothesis Tests

\(t\)-tests assume normality in potential outcomes
or assume asymptotic normality of sampling distribution of sample mean. This may be incorrect with small \(N\) or complex designs. (What about in our example?)

If distributional assumptions are wrong, hypothesis test will not be correct

need to investigate plausibility of assumption:
- is \(N\) large?
- Are values of \(Y(1) | Z=1\) and \(Y(0) | Z =0\) approximately normal?

Randomization Inference

Unlike analytical approach:

randomization inference does not assume any asymptotic distribution
can be applied to many different estimators (not just, e.g., the \(\widehat{ACE}\))

Randomization Inference

Tests a different null hypothesis

Usually null hypothesis is that the average effect is \(0\) (some units could have positive or negative effects).

\[\frac{1}{N}\sum\limits_{i=1}^{N} \tau_i = ACE = 0\] Randomization inference tests the sharp null hypothesis

\[\tau_i = 0 \ \ \ \forall \ \ \ (i\ \in N)\] that every unit treatment effect is \(0\).

Randomization Inference

Advantages:

No assumptions about distributions (no assumption of normality, no asymptotic convergence)
directly flows from potential outcomes model
any test statistic (median, ranks, whatever)

Disadvantages

only tests sharp null which may be less interesting\(^*\)
Confidence intervals only for constant effects
\(p\) values limited by possible randomizations (too few, and they are coarse, too many, they are approximate)

Randomization Inference

In practice:

Null hypothesis of sharp null lets us complete the potential outcomes table/response schedule based on the observed data
Why?

Sharp null implies that \(Y_i^1 = Y_i^0\) for all units \(i\).

Randomization Inference

We run Paluck’s experiment and see this:

\(Region_i\)	\(Y_i(1)\)	\(Y_i(0)\)
1	3	?
2	4	?
3	?	2
4	2	?
5	?	4
6	?	1

Under the sharp null, what are the values that are “?”?

Randomization Inference

Under the sharp null, this would be true:

\(Region_i\)	\(Y_i(1)\)	\(Y_i(0)\)
1	3	\(\color{red}{3}\)
2	4	\(\color{red}{4}\)
3	\(\color{red}{2}\)	2
4	2	\(\color{red}{2}\)
5	\(\color{red}{4}\)	4
6	\(\color{red}{1}\)	1

Randomization Inference

Once we have this response schedule under the sharp null, we:

Create all possible permutations of randomizations (or sample them, if this number is large)
Calculate the difference in means (or other statistic) for these different randomizations.
Compare our observed statistic against this null distribution (distribution of \(\widehat{ACE}\) that could have occurred by chance assuming the sharp null hypothesis is true)
Calculate \(p\) value based on fraction of outcomes in null distribution more extreme than observed outcome.

Randomization Inference

If \(\widehat{ACE}=\) 0.6666667, then \(p(ACE != 0)\) is 0.4.

In R:

#install.packages('ri2')
require(ri2)

declaration <- declare_ra(N = nrow(p_o_table), m = 3)
# Conduct Randomization Inference
ri2_out <- conduct_ri(
  formula = Y ~ Z,
  declaration = declaration,
  sharp_hypothesis = 0,
  data = exp_data,
  p = 'upper'
)


1	2	3
1	2	4
1	2	5
1	2	6
1	3	4
1	3	5
1	3	6
1	4	5
1	4	6
1	5	6
2	3	4
2	3	5
2	3	6
2	4	5
2	4	6
2	5	6
3	4	5
3	4	6
3	5	6
4	5	6


1	2	3
1	2	4
1	2	5
1	2	6
1	3	4
1	3	5
1	3	6
1	4	5
1	4	6
1	5	6
2	3	4
2	3	5
2	3	6
2	4	5
2	4	6
2	5	6
3	4	5
3	4	6
3	5	6
4	5	6


1	2	3
1	2	4
1	2	5
1	2	6
1	3	4
1	3	5
1	3	6
1	4	5
1	4	6
1	5	6
2	3	4
2	3	5
2	3	6
2	4	5
2	4	6
2	5	6
3	4	5
3	4	6
3	5	6
4	5	6