So far…
We have covered assumptions required for regression to give us unbiased estimate of causal effect.
We have considered the ways this can go wrong \(\to\) choices we can make in analyses to see if our conclusion of a causal effect is wrong
But we have not considered whether we could see observed effects by chance
To see what could have happened by chance, we need “sampling distributions”
“You give me the rule, and I consider its latitude for erroneous outputs. We’re actually looking at the probability distribution of the rule, over outcomes in the sample space. This distribution is called a sampling distribution.” (Mayo 2018)
What are the other (counterfactual) outcomes that we could have observed through implementing this procedure/test?
unbiased causal effect estimates; correctly describing sampling distributions
both require correctly describing counterfactuals
Sampling Distributions (quickly)
Intolerance: “I would not like that group to belong to my community association”; (1 = totally disagree; 4 = totally agree)
\(Region_i\) | \(Y_i(1)\) | \(Y_i(0)\) |
---|---|---|
1 | 3 | 2 |
2 | 4 | 4 |
3 | 4 | 2 |
4 | 2 | 3 |
5 | 2 | 4 |
6 | 4 | 1 |
\(Region_i\) | \(Y_i(1)\) | \(Y_i(0)\) |
---|---|---|
1 | 3 | 2 |
2 | 4 | 4 |
3 | 4 | 2 |
4 | 2 | 3 |
5 | 2 | 4 |
6 | 4 | 1 |
\(Region_i\) | \(Y_i(1)\) | \(Y_i(0)\) |
---|---|---|
1 | 3 | 2 |
2 | 4 | 4 |
3 | 4 | 2 |
4 | 2 | 3 |
5 | 2 | 4 |
6 | 4 | 1 |
We set 3 regions in treatment (soap opera + talk show)
We set 3 regions in control (soap opera only)
How many possible random assignments are there?
What are all possible random assignments (to treatment and control)?
R
For each randomization, calculate the \(\widehat{ACE}\) (hint, express this in fractions \(\frac{x}{3}\))
\(Region_i\) | \(Y_i(1)\) | \(Y_i(0)\) |
---|---|---|
1 | 3 | 2 |
2 | 4 | 4 |
3 | 4 | 2 |
4 | 2 | 3 |
5 | 2 | 4 |
6 | 4 | 1 |
Let’s check our work:
This histogram is the exact sampling distribution of the \(\widehat{ACE}\) in this experiment
“You give me the rule, and I consider its latitude for erroneous outputs. We’re actually looking at the probability distribution of the rule, over outcomes in the sample space. This distribution is called a sampling distribution.” (Mayo 2018)
What are the other (counterfactual) outcomes that we could have observed through implementing this procedure/test?
What process produces the sampling distribution?
We want to compare what we observe in our test against what we might have observed counterfactually, under the assumed data-generating process.
Hypothesis tests, Confidence intervals, Equivalence tests involve:
But, we never know the true sampling distribution.
We can only estimate and use these estimated sampling distribution…
if they emerge from some random/stochastic process
We either know (experiment) or must make assumptions about (any other research design) that random process.
What kinds of assumptions do we make?
First:
What do we want to make inferences about? \(\to\) different SOURCES of randomness
What kinds of assumptions do we make?
Second:
At what level of analysis does randomness occur? (at what level are we taking independent draws?) \(\to\) different sampling distributions
What kinds of assumptions do we make?
Third:
What are the assumptions/properties of the estimator (for our sampling distribution)?
What kinds of assumptions do we make?
Fourth:
What are the relevant hypotheses to consider?
Abadie et al (2020) describe three possible types of inferences.
\(N\) is observed cases from population \(n\). \(Z\) is either \(1,0\). \(Y(z)\) indicates a potential outcome. \(R\) is \(1,0\) and indicates case \(i\) is observed.
Descriptive: difference in mean of \(Y\) across \(Z\): \(E[Y|Z==1] - E[Y|Z==0]\) for the population \(n\).
Causal: mean difference in potential outcomes of \(Y\) across \(Z\): \(E[Y(1) - Y(0)]\) for population \(n\)
Causal-Sample: mean difference in potential outcomes of \(Y\) across \(Z\): \(E[Y(1) - Y(0)]\) for observed sample \(N\)
Only for details:
\[\theta^{descriptive} = \frac{1}{n_1}\sum\limits_i^n Z_iY_i - \frac{1}{n_0}\sum\limits_i^n (1-Z_i)Y_i\]
\[\theta^{causal} = \frac{1}{n}\sum\limits_i^n Y_i(1) - Y_i(0)\]
\[\theta^{causal,sample} = \frac{1}{N}\sum\limits_i^n R_i(Y_i(1) - Y_i(0))\]
Imagine we want to examine whether contact with refugees increases support for increasing refugee admission to Canada.
To investigate this, we conduct a random sample survey of Canadian adults where we:
If we want to infer descriptively the difference in mean support for increasing refugee admission rates among those with a refugee neighbor and those without …
If we want to infer causally the effect of a refugee neighbor on mean support for increasing refugee admission rates…
If we want to infer causally the effect of a refugee neighbor on mean support for increasing refugee admission rates…
Some randomness comes from a random process of sampling cases from a population. As the sample approaches the full population, this variability goes to 0.
Some randomness comes from a random process of realizing some potential outcomes versus other potential outcomes. As the sample approaches the full population, this variability does not approach 0.
In practice…
we have one realization of the data generation process. We attempt to estimate the sampling distribution of what—counterfactually—would have happened in other applications of this process.
… we claim/argue that data are produced by a random process. In absence of an experiment, or actual random sample, we must be careful in making this argument.
… absent random-assignment, we implicitly invoke as-if random assignment (given conditional independence).
We are going to assume for now that assignment to treatment is (as-if/conditional on covariates) randomly assigned at the individual level.
Good news: there are good default choices for all three scenarios that we can use to estimate standard errors (standard deviation of sampling distribution)
Bad news: these estimated standard errors are conservative (too big)
If you want to do better than default choices, the estimators can get more complicated. So many possible choices.
When we estimate sampling distributions, ask:
To understand SEs in least squares, we need to introduce a statistical model.
\[Y_i = \beta_0 + \beta_1 D_i + \beta_2 X_i + \epsilon_i\]
This is the population model:
if we are making a descriptive inference, it tells us the coefficients we would get if we could run least squares on the entire population of interest.
if we are making a causal inference, it tells us the coefficients we would get if we could observe all potential outcomes for all cases (in the sample, or population, depending).
To understand SEs in least squares, we need to introduce a statistical model.
\[Y_i = \beta_0 + \beta_1 D_i + \beta_2 X_i + \epsilon_i\]
This is the population model:
\({n.b.:}\) We do not need to assume the that true relationship between \(D\) and \(Y\) is linear. \(\beta_1\) is just the average partial derivative/best linear approximation of the (possibly non-linear) CEF.
\[Y_i = \beta_0 + \beta_1 D_i + \beta_2 X_i + \epsilon_i\]
Typically, regression standard errors are taught by thinking of \(\epsilon_i\) as a random error pulled out of a box.
The statistical model assumes that this linear equation describes the process generating the data.
\[Y_i = \beta_0 + \beta_1 D_i + \epsilon_i\]
deterministic (not random) \(\beta_0 + \beta_1 D_i\)
stochastic (random error) (independent and identically distributed) \(\epsilon_i\)
This approach is in many textbooks; it derives standard errors as model-based errors.
Better to think of \(\epsilon\) as design (treatment assignment) or sampling based errors.
We start with (both) potential outcomes in a “switching” equation:
\[Y_i = (1-D_i)Y_i(0) + D_iY_i(1)\]
\[Y_i = Y_i(0) - D_iY_i(0) + D_iY_i(1)\]
\[Y_i = Y_i(0) + D_i[Y_i(1) - Y_i(0)]\]
\[Y_i = Y_i(0) + D_i\underbrace{\tau_i}_{\text{unit causal effect}}\]
\[Y_i = Y_i(0) + D_i\underbrace{\tau_i}_{\text{unit causal effect}}\]
\[Y_i = E[Y(0)] + D_i\tau_i + \overbrace{(Y_i(0)-E[Y(0)])}^{i\text{'s deviation from mean }Y(0)}\]
\[Y_i = E[Y(0)] + D_iE[\tau] + \nu_i + D_i\overbrace{(\tau_i - E[\tau])}^{i\text{'s deviation from mean }\tau}\]
\[Y_i = \beta_0 + D_i\beta_1 + \overbrace{\nu_i + D_i\eta_i}^{i\text{'s deviation from }E[Y|D]}\]
\[Y_i = \beta_0 + D_i\beta_1 + \epsilon_i\]
\(\epsilon_i\) is a random error in the following sense:
Additionally:
If we are conditioning as a research design…
Assuming conditional independence is necessary for both estimating causal effects without bias…
and estimating sampling distribution.
“Conditional independence” assumes “as-if” random assignment to treatment, conditional on covariates. How does this “as-if” random process take place?
\[Y_i = \beta_0 + \beta_1 D_i + \beta_2 X_i + \epsilon_i\]
When we see errors arising from sampling: \(\beta_0\), \(\beta_1\), \(\beta_2\) are the the coefficients we would observe if we did this regression for the whole population.
\(\epsilon_i\) is the prediction error for individual \(i\) from this population model. It is a random variable in the sense that only \(\epsilon_i\) varies from case to case, and by randomly sampling individuals, we randomly sample \(\epsilon_i\).
Whether we have random variability induced by design or sampling, we can’t directly observe our vector of coefficients \(\pmb{\beta}\).
But we can estimate it:
\[\pmb{\widehat{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]
\(Y_i = \pmb{X_i\beta} + \epsilon_i\)
\(\epsilon_i\) is random variable.
If we were to repeat the drawing of \(\epsilon\) again - a different sample, a different (as-if) randomization - we would get different estimates: \(\pmb{\widehat{\beta}}\).
This gives rise to a sampling distribution for \(\pmb{\widehat{\beta}}\).
Variance-Covariance Matrix
\[\scriptsize{\begin{pmatrix} Var(\widehat{\beta}_1) & Cov(\widehat{\beta}_1,\widehat{\beta}_2) & Cov(\widehat{\beta}_1,\widehat{\beta}_3) &\ldots & Cov(\widehat{\beta}_1,\widehat{\beta}_p) \\ Cov(\widehat{\beta}_2,\widehat{\beta}_1) & Var(\widehat{\beta}_2) & Cov(\widehat{\beta}_2,\widehat{\beta}_3) & \ldots & Cov(\widehat{\beta}_2,\widehat{\beta}_p) \\ Cov(\widehat{\beta}_3,\widehat{\beta}_1) & Cov(\widehat{\beta}_3,\widehat{\beta}_2) & Var(\widehat{\beta}_3) & \ldots & Cov(\widehat{\beta}_3,\widehat{\beta}_p) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ Cov(\widehat{\beta}_p,\widehat{\beta}_1) & Cov(\widehat{\beta}_p,\widehat{\beta}_2) & Cov(\widehat{\beta}_p, \widehat{\beta}_3) & \ldots & Var(\widehat{\beta}_p)\end{pmatrix}}\]
How do we use the variance-covariance matrix?
The square-root of diagonal elements (variances) gives standard error for each estimate in \(\pmb{\widehat{\beta}}\) (hypothesis testing, confidence intervals)
The off-diagonal elements can help answer: \(\beta_2 + \beta_3 \neq 0\). We need \(Cov(\beta_2, \beta_3)\) to get \(Var(\beta_2 + \beta_3)\). (complex hypothesis testing, e.g. interaction effects )
Because we only observe one realization of the data generating process, we cannot observe the sampling distribution(s) that are described by this variance-covariance matrix.
We must estimate it.
Commonly we use an analytic approach. (An equation we can use to estimate)
A derivation (only the endpoint is necessary)
\(\pmb{\widehat{\beta}} = \pmb{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\), So:
\[cov(\pmb{\widehat{\beta}}|X) = E((\pmb{\widehat{\beta}} - \pmb{\beta})(\pmb{\widehat{\beta}} - \pmb{\beta})' | X)\]
\[ = E( ((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)' | X)\]
\[ = E( ((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon)(\epsilon'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}) | X)\]
\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]
What really matters here is: \(E(\epsilon\epsilon'|X)\)
All analytic variance-covariance matrix estimators are “sandwiches”:
\[ (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]
\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\) is the “bread”; \(E(\epsilon\epsilon'|X)\) is the “meat”.
The bread is always the same. As we change assumptions about the random process, we make different choices of “meat”. (Insert your own charcuterie analogy here)
In groups: if \(\epsilon\) is a vector that is \(n \times 1\) with elements \((\epsilon_1 \ldots \epsilon_n)\).
What are the dimensions of the result of this calculation: \(\epsilon\epsilon'\)?
What are the elements on the diagonal? (What value do they take - in terms of \(\epsilon_i\)?)
What is the expected value of the elements on the off-diagonal (We have assumed that \(\epsilon_i\) are independent of each other. And by design, in the population model, \(E[\epsilon_i] = 0\))
Because we don’t observe \(\epsilon_i\), we have to estimate them by plugging in residuals.
\[\hat{\epsilon_i} = e_i = Y_i - \mathbf{X_i\widehat{\pmb\beta}}\] where \(\mathbf{X_i}\) is the full design matrix, including \(D_i\)
If we plug in \(e_i^2\) in for \(\epsilon\epsilon'\), we get:
Eicker-Huber-White SE estimator: often called the “robust” standard errors. They are “robust” in the sense that they make no assumption about the \(\epsilon_i\) other than that they are independent.
(Return to matrix on Board)
These should be our default standard errors (see Aronow and Miller pp. 153-154)
lm
, most software does not give you these standard
errors.What are the assumptions/properties of our estimators?
The \(ehw\) SE estimator generates estimates, not miracles.
The “robust” SE estimators come in other flavors, with different asymptotic properties
In “ordinary” least squares (OLS) we assume:
\(\epsilon_i\) are independent and identically distributed. That is, they come from a single distribution with the same variance.
These are the default standard errors in lm
, most
software.
In R
:
sandwich
package (to get robust errors)lmtest
packagefixest
packageIn R
:
vcovHC
from sandwich
package)3a. Take sqrt
of diagonal, manually.
3b. In “pretty” format using coeftest
from
lmtest
package, or in modelsummary
tables.
suffrage = read.csv("https://www.dropbox.com/scl/fi/y5wvmmhhknzbpdufaql9p/referenda_vote.csv?rlkey=1z2hromullprdzumkrawu5n21&dl=1")
enlist_rate = veterans/mil_age
suff65_yes
on enlist_rate
,
suff57_yes
, and state
using
lm
.vcovHC
function in the sandwich
package to get the HC0 vcov matrix (call it vcov_hc0
)vcovHC
function in the sandwich
package to get the HC3 vcov matrix (call it vcov_hc3
).diag
and sqrt
to get the robust
standard errorsenlist_rate
.require(sandwich)
require(lmtest)
#1: Estimate OLS / conventional Standard Errors
lm_suff = lm(suff65_yes ~ enlist_rate + suff57_yes + state,
data = veterans)
#2: Estimate robust variance covariance matrix:
vcov_robust = vcovHC(lm_suff, type = "HC3")
#3: SE
vcov_robust %>% diag %>% sqrt
## (Intercept) enlist_rate suff57_yes stateWISCONSIN
## 0.03195070 0.08571651 0.08612007 0.02838870
Or, you can make nice looking tables more easily…
(1) | (2) | (3) | (4) | (5) | |
---|---|---|---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |||||
(Intercept) | 0.405*** | 0.405*** | 0.405*** | 0.405*** | 0.405*** |
(0.027) | (0.029) | (0.030) | (0.030) | (0.032) | |
enlist_rate | 0.274*** | 0.274*** | 0.274*** | 0.274*** | 0.274** |
(0.070) | (0.077) | (0.078) | (0.081) | (0.086) | |
suff57_yes | 0.561*** | 0.561*** | 0.561*** | 0.561*** | 0.561*** |
(0.056) | (0.080) | (0.081) | (0.083) | (0.086) | |
stateWISCONSIN | -0.270*** | -0.270*** | -0.270*** | -0.270*** | -0.270*** |
(0.024) | (0.027) | (0.027) | (0.028) | (0.028) | |
Num.Obs. | 130 | 130 | 130 | 130 | 130 |
R2 | 0.532 | 0.532 | 0.532 | 0.532 | 0.532 |
Std.Errors | IID | HC0 | HC1 | HC2 | HC3 |
We should default to them.
We obtain \(ehw\) standard errors when potential outcomes/sampling processes are random and independent.
What if there is dependence in the \(\epsilon_i\)?
Rather: what if assignment to treatment is not randomly assigned to individuals, but to groups?
There are estimators for standard errors for situations in which errors are not independent of each other across observations:
Ignoring dependence of errors usually underestimates sampling variability of an estimator. Why?
Extends robust standard errors:
Allows for arbitrary dependence of errors within groups…
…by permitting off-diagonal covariance \(\epsilon_i\epsilon_j\) to be non-zero for \(i \neq j\) but \(i\) and \(j\) are part of the same “cluster”. (to the board)
Similar to \(ehw\) standard errors…
Unlike \(ehw\) standard errors…
In R
: many options:
felm
in lfe
extends lm
for
panel models and includes clustering optionsfeols
in fixest
includes clustering
optionssandwich
: (finally!) has clustered errors optionsmultiwayvcov
(1) | (2) | (3) | |
---|---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |||
enlist_rate | 0.274*** | 0.274*** | 0.274+ |
(0.070) | (0.077) | (0.027) | |
suff57_yes | 0.561*** | 0.561*** | 0.561 |
(0.056) | (0.080) | (0.090) | |
Num.Obs. | 130 | 130 | 130 |
R2 | 0.532 | 0.532 | 0.532 |
Std.Errors | IID | HC0 | by: state |
Standard Errors are way smaller (but, with 2 clusters, not reliable!)
Many specific features of these estimators to pay attention to:
When should we use clustering? Abadie et al (2023)
Alternatively:
When should you cluster? Abadie et al (2023)
What if treatment \(D\) is a function of spatial distance to something (e.g., Kumbh Mela)? Is it possible there are spatial trends in your data? Is it possible there is spatial dependence?
What do we do?
State of the art: More details
All sandwich based estimates of standard errors appeal to asymptotic consistency and normality to approximate sampling distribution of \(\widehat{\beta}\)
The bootstrap takes a different approach:
Simulate sampling distribution
k
= 1000bs = data.frame(i = rep(NA, k), beta_hat = rep(NA, k))
for
loop from 1:k
.sample
with replacement from
1:nrow(veterans)
to create bs_idx
data_bs = veterans[bs_idx, ]
m = lm(suff65_yes ~ enlist_rate + suff57_yes + state,
data = data_bs)
bs_out[k, ] = c(k, coef(m)[2])
sd
of bs_out$beta_hat
.quantile(bs_out$beta_hat,
probs = c(0.025,0.975))
hist(bs_out$bs_hat)
There are many choices to make, constant updates to best practices.
Keep in mind:
You may need to look up what the best standard error option for your application
Sometimes we need to specify different null hypotheses:
In these cases, we want to find evidence that the true difference is \(0\).
If we use the conventional hypothesis test where \(H_0: \beta = 0\), the \(p\) values indicate the false positive rate of rejecting the null when in truth there is no difference.
But in this instance, we are concerned about the false negatives.
We don’t want a test that stacks the deck in favor of our hypothesis.
Analogous to this situation:
We want a COVID test that we plan to use as evidence that we don’t have COVID and so can safely spend time with immunocompromised people.
But the COVID test we use has been designed to minimize false positives.
What could go wrong?
To solve this problem and get useful \(p\) values, we can transform this into an equivalence test. We transform the null hypothesis.
Let us assume that there is some level of imbalance that we consider negligible, lets call that \(\delta\).
Our new null hypothesis is:
\(H_{01}: \beta <= -\delta\) OR \(H_{02}: \beta >= \delta\)
That is, two one-sided tests (TOST).
TOST:
If the probability of observing \(\hat{\beta}\) under both null hypotheses is less that \(\alpha\) we can reject the null, and then accept the alternative:
\(H_1: -\delta < \beta < \delta\): the true parameter is within some acceptable distance to \(0\).
TOST visualization
These tests can be conducted in R
using:
TOSTER
packagesequivalence_test
in parameters
packageThese tests can be inverted to get confidence intervals (range of values for \(delta\) which cannot be rejected at \(\alpha\))
These tests require, in addition to everything else:
Need to know how likely we are to observe an effect, even if there is no true effect.
This requires:
The variance (\(p\)th diagonal element of \(\widehat{cov}(\widehat{\beta})\)) is
\(\widehat{Var}(\hat{\beta_p}) = \frac{\hat{\sigma^2}}{nVar(X_p^*)}\)
The numerator will shift as a function of the “meat” in the variance-covariance estimator we choose. But the denominator is a function of the “bread”.
This has implications for “multicollinearity” as a problem.
Least squares requires linear independence of columns in design matrix.
When columns in \(\pmb{X}\) approach linear dependence (they are nearly perfectly linearly correlated) there can be two “problems” that arise.
Least squares requires linear independence of columns in design matrix.
When columns in \(\pmb{X}\) approach linear dependence (they are nearly perfectly linearly correlated) there can be two “problems” that arise.
\(2.\) If the correlation is very close to \(1,-1\) then there can be numerical instability in computer calculations of coefficients. This is a problem, but occurs only at very high correlations, usually R will let you know.