POLI 572B

Michael Weaver

March 7, 2025

Introduction

Outline

Regression and Conditioning
How regression “conditions”
Potential Problems

Interpolation Bias (Conditional Independence Assumption)
Extrapolation Bias (Positivity Assumption)

How to Practically Addres these Problems

Conditioning

Recall:

When there is possible confounding, we want to “block” these causal paths using conditioning

In order for conditioning to estimate the $ACE$ without bias, we must assume

$1$. Ignorability/Conditional Independence: within strata of $X$, potential outcomes of $Y$ must be independent of cause $D$ (i.e. within values of $X$, $D$ must be as-if random)

all ‘backdoor’ paths are blocked
no conditioning on colliders to ‘unblock’ backdoor path

In order for conditioning to estimate the $ACE$ without bias, we must assume

$2$. Positivity/Common Support: For all values of treatment $d$ in $D$ and all value of $x$ in $X$: $Pr(D = d | X = x) > 0$ and $Pr(D = d | X = x) < 1$

There must be variation in the levels of treatment within every strata of $X$

In order for conditioning to estimate the $ACE$ without bias, we must assume

No Measurement Error: If conditioning variables $X$ are mis-measured, bias will persist.

Two Ways of Conditioning:

imputation: “plug in” unobserved potential outcomes of $\color{red}{Y(1)}$ ($\color{red}{Y(0)}$) using observed potential outcomes of $Y(1)$($Y(0)$) from cases with same/similar values of $\mathbf{X_i}$.
reweighting: compare values of $Y_i$ across values of $D_i$, weighting $i$ by probability of “treatment” given $\mathbf{X_i}$ ($Pr(D_i | X_i)$)

Conditioning by Imputation

We find effect of $D$ on $Y$ within each subset of the data uniquely defined by values of $X_i$.

$\widehat{ACE}[X = x] = E[Y(1) | D=1, X = x] - E[Y(0) | D=0, X = x]$

for each value of $x$ in the data.

Under the Conditional Independence Assumption, $E[Y(1) | D=1, X=x] = \color{red}{E[Y(1) | D=0, X=x]}$ and $E[Y(0) | D=0, X=x] = \color{red}{E[Y(0) | D=1, X=x]}$

We impute the missing potential outcomes… with the expected value of the outcome of observed cases with same values of $x$, different values of $D$.

Conditioning by Imputation

$E[Y(0) | D=0, X=x]$ is a conditional expectation function; something we can estimate using regression

A Causal Model for Least Squares

To understand how we can use regression to estimate causal effects, we need to describe the kinds of causal estimands we might be interested in.

Response Schedule

In the context of experiments, each observation has potential outcomes corresponding to their behavior under different treatments. If treatment is dichotomous, we might generally be interested in:

\[ACE = E[Y(1)] - E[Y(0)]\]

Response Schedule

In regression, where levels of treatment might be continuous, we generalize this idea to the “response schedule”:

each unit has a set of potential outcomes across every possible value of the “treatment”
- $Y(0), Y(1), Y(2), \dots, Y(D_{max})$
we can summarize the causal effect of $D$ in two ways:

average causal response function:

what is $E[Y(d)]$ for all $d \in D$, or what are mean potential outcomes of $Y$ at values of $D$.
this is a conditional expectation function, defined in terms of potential outcomes
board: income returns to years of education…

We typically assume that we are only interested in the $ACRF$ for the cases in our study; otherwise need to make assumptions about the sampling from a population.

average partial derivative (of the $ACRF$:

the average slope of the (possibly non-linear) average causal response function
what is the average causal (rate of) change in $Y$ induced by $D$, averaged over all values $d$ in $D$
if there are just two levels, the difference between them is the $ACE$

(Return to board)

Response Schedule

We can use regression to estimate the linear approximation of the average causal response function:

\[Y_i(D_i = d) = \beta_0 + \beta_1 D_i + \epsilon_i\]

Here $Y_i(D_i = d)$ is the potential outcome of case $i$ for a value of $D = d$.

this response schedule says that $E[Y(D = d)]$, on average, changes by $\beta_1$ for a unit changes in $D$ (or a weighted average of non-linear effects of $D$… the average partial derivative)
$\epsilon_i$ is unit $i$ deviation from $Y_i(D=d) - E[Y_i(D = d)]$
We are using $\beta$ rather than $b$ because we are interpreting coefficients as estimands/parameters (giving a causal interpretation)

Response Schedule

If we don’t know parameters $\beta_0, \beta_1$, what do we need to assume to obtain an estimate $\widehat{\beta}_1$ that we can give a causal interpretation? (On average, change in $D$ causes $\widehat{\beta}_1$ change in $Y$)

We must assume

$Y_i$ actually produced according to the response schedule (equation is correctly specified; e.g., linear and additive)
$D_i$ is independent of $\epsilon_i$: $D_i \perp \!\!\! \perp \epsilon_i$.

In this scenario, if $D$ were binary and we had randomization, this is equivalent to estimating the $ACE$ for an experiment.

Response Schedule

If we want to use regression for conditioning, then the model would look different:

\[E[Y_i(D_i = d, X_i = x)] = \beta_0 + \beta_1 D_i + \mathbf{X_i\beta_i}\]

here we add conditioning variables $\mathbf{X_i}$ to the model.
regression is estimating the conditional expectation function $E[Y(d) | D = d, X = x]$: this is what we need for conditioning

\[Y_i(D_i = d, X_i = x) = \beta_0 + \beta_1 D_i + \mathbf{X_i\beta_i} + \epsilon_i\]

Given what we have learned about regression so far…

How is using regression to estimate $E[Y(d) | D = d, X = x]$ different from when we exactly matched cases?
How do the assumptions required for conditioning translate to the regression context?

Classic “matching” to impute…

$\widehat{ACE} = \sum\limits_{x \in X} \underbrace{Pr[X = x]}_{\text{Fraction } X = x}\overbrace{(E[Y_i(1) | D_i=1, X_i = x] - E[Y_i(0) | D_i = 0, X_i = x])}^{\text{Treated - Untreated, same in x}}$

How does Regression differ?

Regression: Causal Effect Weights

Positivity requirement speaks to what subsets of the data contribute to our causal estimate:

in exact matching, subsets of $X$ without variation in $D$ get weight of $0$
otherwise, in proportion to number of cases in strata of $X$.

If regression returns weighted average partial derivative of $ACRF$…

What are the weights?

Regression: Causal Effect Weights

Aronow and Samii (2016) discuss how regression weights cases:

In a regression context, cases contribute to estimating $\beta_D$ (causal effect of $D$) as a weighted average:

\[\beta_D = \frac{1}{\sum\limits_i^n w_i}\sum\limits_i^n \tau_iw_i\] Where $w_i = [D_i - E(D_i | X_i)]^2$: the square of $D$ residual on $\hat{D}$ predicted by $X$.

$\tau_i = Y(D_i = 1) - Y(D_i = 0)$ (unit causal effect)

Regression: Causal Effect Weights

These weights are lower when $D$ is well predicted by $X$.

These weights are higher when $D$ is less well predicted by $X$.

Thus, for cases where values of $X$ nearly perfectly predict $D$, there is close to zero positivity, and weights are closer to zero:

in the best case scenario where $D$ is truly linear in $X$, these weights make sense
if $D$ is not linear in $X$, then these weights also reflect errors in model and upweight cases which are fit poorly.

Regression: Causal Effect Weights

Regression weights may behave usefully (less weight where we have less variation in $D$, so less information), but they change causal estimand:

we need to ask, for what set of cases are we estimating effects?
the weights used are only easily interpreted under specific conditions

Regression and Weights

If $D$ is continuous, regression also places more or less weight on different values of $d$:

Conditioning w/ Regression

Conditioning with Regression

How it works:

Imagine we want to know the efficacy of UN Peacekeeping operations (Doyle & Sambanis 2000) after civil wars:

We can compare the post-conflict outcomes of countries with and without UN Peacekeeping operations.

To address concern about confounding, we condition on war type (non/ethnic), war deaths, war duration, number of factions, economic assistance, energy consumption, natural resource dependence, and whether the civil war ended in a treaty.

122 conflicts… can we find exact matches?

Using just treaty (2 values), decade (6 values), factnum (8 values), wardur (43 values), we have only one exact match…

Without perfect matches on possible confounders, we don’t have cases without a Peacekeeping operation that we can use to substitute for the counterfactual outcome in conflicts with a Peacekeeping force.

We can use regression to linearly approximate the conditional expectation function $E[Y(d) | D = d, X = x]$ to plug in the missing values.

\[Y_i = \beta_0 + \beta_D D_i + \mathbf{X_i\beta_X} + \epsilon_i\]

Download the data

ds2000 = read.csv("https://www.dropbox.com/s/4miay9ezzjp03w7/ds2000.csv?dl=1")

Regress success on untype4 ($D$) and logcost, wardur, factnum, trnsfcap, treaty', develop, exp, decade, using lm(), save this as m
Use model to predict counterfactual potential outcomes for each conflict:

Create a copy of ds2000, ds2000_cf, and flip the value of untype4 (0 to 1, 1 to 0).
Use the predict(m, newdata = ds2000_cf) to add a new column called y_hat to ds2000.

Create a new column y1, which equals success for cases with untype4 == 1, and yhat for cases with untype4 == 0
Create a new column y0, which equals success for cases with untype4 == 0, and yhat for cases with untype4 == 1
Calculate tau_i as the difference between y1 and y0. Then calculate the mean tau_i.
Compare to the coefficient on untype4 in your regression results.

m = lm(success ~ untype4 + treaty  + decade + 
                factnum + logcost +  wardur + trnsfcap + develop + exp,
       data = ds2000)

ds2000_cf = ds2000 %>% copy
ds2000_cf$untype4 =  1*!(ds2000_cf$untype4)

ds2000[, y_hat := predict(m, newdata = ds2000_cf)]
ds2000[, y1 := ifelse(untype4 %in% 1, success, y_hat)]
ds2000[, y0 := ifelse(untype4 %in% 0, success, y_hat)]
ds2000[, tau := y1 - y0]
ds2000[, tau] %>% mean

## [1] 0.4876439

Results:

	Model 1
(Intercept)	1.484*** (0.207)
untype4	0.488** (0.174)
treaty	0.331*** (0.096)
decade	-0.050+ (0.027)
factnum	-0.067* (0.027)
logcost	-0.071*** (0.017)
wardur	0.000 (0.000)
trnsfcap	0.000 (0.000)
develop	0.000 (0.000)
exp	-0.812+ (0.453)
Num.Obs.	122
R2	0.378
RMSE	0.38

$i$	$UN_i$	$Y_i(1)$	$Y_i(0)$	$\tau_i$
111	0	$\color{red}{E[Y(1) \| D = 0, X = x]}$	0.00	$\color{red}{?}$
112	0	$\color{red}{E[Y(1) \| D = 0, X = x]}$	0.00	$\color{red}{?}$
113	0	$\color{red}{E[Y(1) \| D = 0, X = x]}$	0.00	$\color{red}{?}$
114	0	$\color{red}{E[Y(1) \| D = 0, X = x]}$	1.00	$\color{red}{?}$
115	0	$\color{red}{E[Y(1) \| D = 0, X = x]}$	1.00	$\color{red}{?}$
116	1	0.00	$\color{red}{E[Y(0) \| D = 1, X = x]}$	$\color{red}{?}$
117	1	1.00	$\color{red}{E[Y(0) \| D = 1, X = x]}$	$\color{red}{?}$
118	1	1.00	$\color{red}{E[Y(0) \| D = 1, X = x]}$	$\color{red}{?}$
119	1	1.00	$\color{red}{E[Y(0) \| D = 1, X = x]}$	$\color{red}{?}$
120	1	1.00	$\color{red}{E[Y(0) \| D = 1, X = x]}$	$\color{red}{?}$
121	1	1.00	$\color{red}{E[Y(0) \| D = 1, X = x]}$	$\color{red}{?}$
122	1	1.00	$\color{red}{E[Y(0) \| D = 1, X = x]}$	$\color{red}{?}$

$i$	$UN_i$	$Y_i(1)$	$Y_i(0)$	$\tau_i$
111	0	$\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}$	0.00	$\color{red}{?}$
112	0	$\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}$	0.00	$\color{red}{?}$
113	0	$\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}$	0.00	$\color{red}{?}$
114	0	$\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}$	1.00	$\color{red}{?}$
115	0	$\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}$	1.00	$\color{red}{?}$
116	1	0.00	$\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}$	$\color{red}{?}$
117	1	1.00	$\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}$	$\color{red}{?}$
118	1	1.00	$\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}$	$\color{red}{?}$
119	1	1.00	$\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}$	$\color{red}{?}$
120	1	1.00	$\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}$	$\color{red}{?}$
121	1	1.00	$\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}$	$\color{red}{?}$
122	1	1.00	$\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}$	$\color{red}{?}$

$i$	$UN_i$	$Y_i(1)$	$Y_i(0)$	$\tau_i$
111	0	0.78	0.00	$\color{red}{0.78}$
112	0	0.85	0.00	$\color{red}{0.85}$
113	0	0.57	0.00	$\color{red}{0.57}$
114	0	1.01	1.00	$\color{red}{0.01}$
115	0	1.10	1.00	$\color{red}{0.10}$
116	1	0.00	0.17	$\color{red}{-0.17}$
117	1	1.00	0.26	$\color{red}{0.74}$
118	1	1.00	0.19	$\color{red}{0.81}$
119	1	1.00	0.63	$\color{red}{0.37}$
120	1	1.00	0.58	$\color{red}{0.42}$
121	1	1.00	0.44	$\color{red}{0.56}$
122	1	1.00	0.32	$\color{red}{0.68}$

Conditioning with Regression

How it works:

Rather than “fill” missing potential outcomes of $Y(d)$ with observed outcomes that cases that match exactly on $X$, we fit the linear conditional expectation function, plug in the predicted value for counterfactual potential outcome
Rather than hold confounders constant, we obtain “plug-in” equation by requiring conditioning variables and $D$ to be orthogonal (perfectly $0$ linear correlation).

Conditioning with Regression

Assumptions

Conditional Independence: within strata of $X$, potential outcomes of $Y$ must be independent of cause $D$ (i.e. within values of $X$, $D$ must be as-if random)

$\epsilon_i$ contributes to $Y_i$, so determines potential outcomes. Thus, this assumption is equivalent to $D_i \perp \!\!\! \perp \epsilon_i | X_i$.
we specify the relationship between $X, D$, and $X,Y$ to be linear. So we must assume that any dependence is linear and additive (does not depend on combinations of variables in $X$)

Example:

Let’s turn to a different example: does working more hours increase earnings? Here we are interested in the causal effect of hours worked.

Let’s say that we to block all backdoor paths, we estimate this model

\[\begin{eqnarray}Y_i = \beta_0 + \beta_1 Hours_i + \beta_2 Female_i + \\ \beta_3 Age_i + \beta_4 Law_i + \epsilon_i\end{eqnarray}\]

And we want to find $\beta_1$ with $\widehat{\beta_1}$

how are we assuming linearity?
how are we assuming additivity?

Example:

If we are imputing missing potential outcomes of earnings for different hours worked using this model…

We need to ask…

Are relationships between age ($X$) and hours($D$), age ($X$) and income earned ($Y$) linear?
Are the earnings ($Y$) related to gender and age ($X$) in an additive way? (relationship between age and earnings the same for men and women?)
Are the effects of hours worked linear? The same, regardless of gender and age?

Example: D and X

Y and X

D and Y

Assuming additivity and linearity:

m_linear_additive = lm(INCEARN ~ UHRSWORK + Gender + AGE + LAW, data = acs_data)

	Linear/Additive
Hours Worked	1076*** (115)
Male	37223*** (2800)
Age (Years)	3453*** (125)
Law	-48920*** (2675)
Intercept	-26140** (9017)
Num.Obs.	10000
R2	0.146
RMSE	127723.58

How should we causally interpret coefficient for “Hours Worked”?

Assuming additivity and linearity:

m_linear_additive = lm(INCEARN ~ UHRSWORK + Gender + AGE + LAW, data = acs_data)

	Linear/Additive
Hours Worked	1076*** (115)
Male	37223*** (2800)
Age (Years)	3453*** (125)
Law	-48920*** (2675)
Intercept	-26140** (9017)
Num.Obs.	10000
R2	0.146
RMSE	127723.58

Can we give causal interpretation to, e.g. coefficient “Law”?

Side note:

We generally cannot give clear causal interpretation coefficients for variables $\mathbf{X}$ in our regression model.

We have chosen $X$ to block backdoor paths out of $D$. At a minimum…

$D$ is a possible mediator between $X$ and $Y$.
$D$ may be a collider.
We have not blocked backdoor paths out of $X$.

Non-linear dependence between residual $D^*$ and X, despite regression

Assuming linearity, not additivity: linear relationship between Age and hours($D$)/earnings($Y$) allowed to differ by gender and profession

m_linear_interactive = lm(INCEARN ~ UHRSWORK + Gender*AGE*LAW, data = acs_data)

	Linear/Additive	Linear/Interactive
Hours Worked	1076*** (115)	1175*** (114)
Male	37223*** (2800)	51887** (18118)
Age (Years)	3453*** (125)	4985*** (341)
Law	-48920*** (2675)	105972*** (19357)
Num.Obs.	10000	10000
R2	0.146	0.158
RMSE	127723.58	126847.51

Non-linear dependence between D and X, despite regression

Assuming neither linearity or additivity: fit an intercept for every combination of gender, profession, and age in years. (linear, but not by assumption)

m_full = lm(INCEARN ~ UHRSWORK + as.factor(Gender)*as.factor(AGE) + LAW, data = acs_data)

	Linear/Additive	Linear/Interactive	Nonlinear/Interactive
Hours Worked	1076*** (115)	1175*** (114)	1305*** (113)
Male	37223*** (2800)	51887** (18118)
Age (Years)	3453*** (125)	4985*** (341)
Law	-48920*** (2675)	105972*** (19357)	-46702*** (2614)
Num.Obs.	10000	10000	10000
R2	0.146	0.158	0.199
RMSE	127723.58	126847.51	123733.07

No dependence between $D$ and $X$, after regression

Conditional Independence in Regression

Even if we included all variables on backdoor path between $D$ and $Y$, regression may still produce biased estimate:

we assume the conditional expectation function is linear and additive.
but the world might be non-linear and interactive
our decisions about how to specify the regression equation can lead to bias: $\Rightarrow$ model dependence
Our imputation of missing potential outcomes is biased due to choices we make about how to impute (linear/additive)

This bias can take two forms.

(2) Interpolation Bias:

Typically: we approximate the relationship between variables in $X$ and $D$ to be additive and linear. If this approximation is wrong, we can have bias.

By forcing relationship between $X$ and $D$ to be linear and additive, conditioning on $X$ may not remove non-linear association between $X$ and $D$, $X$ and $Y$. $\to$ bias.
This unmodeled relationship will become part of $\epsilon_i$ (because $X$ affects $Y$), and will not be independent of $D_i$ (because there is a non-linear dependence between $X$ and $D$).
In other words, regression may “impute” wrong counterfactual values of $Y$.

What is wrong with interpolation bias?

In conditioning, we want to compare $Y_i(D_i = 1)|X_i$ to $Y_i(D_i = 0)|X_i$ for cases where values of confounding variables $X_i = x$ are the same
In regression, we compare $Y_i(D_i = 1)|X_i$ against linear prediction of $\widehat{Y}(D_i = 0)|X_i$.
In regression, $\beta_D$, effect of $D$, weighted by $w_i = [D_i - E(D_i | X_i)]^2 = [D_i - \hat{D_i}|X_i]^2$ the square of residual $D$ (linearly predicted by $X$)
This approximation may fail, sometimes spectacularly…

transparent triangles indicate $\color{red}{\widehat{Y}}$ imputed by regression.

Actual (black) vs Regression (red) weights

	(1)	(2)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	12.004***	1.606***
	(1.142)	(0.289)
d	-6.528***	1.536***
	(1.354)	(0.293)
x	-0.153	-0.568***
	(0.287)	(0.053)
x2		0.999***
		(0.019)
Num.Obs.	100	100
R2	0.210	0.973
RMSE	5.91	1.08

Actual (black) vs Regression (red) weights

Interpolation Bias:

When we approximate of the relationship of confounding variables $X$ with $D$ and $Y$ through linearity (or any imposed functional form) and additivity, we may generate interpolation bias even if all confounders “blocked”:

there may be non-linear, interactive dependencies between $D, X$, $X, Y$. (not blocked backdoor path)
prediction errors in $E[D|X]$ may up-weight/down-weight observations. (different causal estimand)

Conditional Independence

In the regression context, Conditional Independence involves

assuming that potential outcomes of $Y$ are independent of $D$, conditional on values of $X$ (as always, a very big assumption)
assuming that we have the chosen correct functional form of relationship between $Y,D,X$.

If we specify model wrong, regression’s linear and additive approximation of true CEF may lead us astray.

Conditional Independence

Can detect problems with scatterplot of $D$ and $X$; $Y$ and $X$
Can choose more flexible model (non-linear/interactive)
Linear/additive apprx. may be acceptable because least squares is best linear apprx. of non-linear functions within the region where we have data.

Positivity and Regression

$2$. Positivity/Common Support: For all values of treatment $d$ in $D$ and all value of $x$ in $X$: $Pr(D = d | X = x) > 0$ and $Pr(D = d | X = x) < 1$

There must be variation in the levels of treatment within every strata of $X$
Without positivity, we cannot get an estimate of the $ACE$ or average causal response function, because there are subsets of the data that are missing values (can only get $ACE$ for subset of cases with positivity)

Positivity and Regression

Regression seems to solve the problem of positivity

on one hand: regression fills in missing potential outcomes using linear estimate of CEF

Example:

Does being a democracy cause a country to provide better public goods?

We decide to estimate the following regression model. Let’s assume that log GDP per capita is the only confounder we need to block.

$Public \ Goods_i = \beta_0 + \beta_1 Democracy_i +$ $\beta_2 ln(Per \ Capita \ GDP_i) + \epsilon_i$

Download (simulated) data from here:

data_9 = read.csv('https://www.dropbox.com/s/ultold1kgsvfs4s/example_9.csv?dl=1')

$1$. Estimate $Public \ Goods_i = \beta_0 + \beta_1 Democracy_i + \beta_2 ln(Per \ Capita \ GDP_i) + \epsilon_i$ using the lm function. What is the ‘effect’ of democracy on public goods?

$2$. What does this model assume about the relationship between per capita GDP and public goods for democracies? For non-democracies?

$3$. Plot public goods on GDP per capita, grouping by democracy

require(ggplot2)
ggplot(data_9, aes(x = l_GDP_per_capita, y = public_goods, colour = Democracy %>% as.factor)) + geom_point()

$4$. What is the true functional form of the relationship between GDP per capita and public goods for Democracies? For non-democracies where GDP per capita is $> 0$?

$5$. Create a new variable for per capita GDP $< 0$.

$6$. Repeat the regression from (1) for cases where l_GDP_per_capita $ < 0$. Why is the estimate of $\widehat{\beta_1}$ different?

In this example:

We observe no non-democracies with GDP per capita as high as some democracies, but want to condition on GDP per capita.

Options

Using regression ($1$ on previous slide): we linearly extrapolated from the model what public goods for non-democracies with high per capita GDP would be (where we have no data, only the model)

We observe no non-democracies with GDP per capita as high as some democracies, but want to condition on GDP per capita.

Options

Using regression ($1$ on previous slide): we linearly extrapolated from the model what public goods for non-democracies with high per capita GDP would be (where we have no data, only the model)

Using regression ($6$ on previous slide): we restrict our analysis to the region of the data with common support (range of GDP per capita values that contain both democracies and non-democracies). Linear interpolation, but the model is not extrapolating to regions without data.

Extrapolation Bias

Typically: we approximate the relationship between variables in $X$ and $D$ to be additive and linear.

If $D = 1$ never occurs for certain values of $X$ (e.g. $X > 0$), regression model will use linearity (or ANY functional form$^*$) to extrapolate a predicted value of what $D = 1$ would look like when $X > 0$.

King and Zeng show that extrapolations are very sensitive to model choice: small changes in assumptions create large changes in estimates.

$^*$: Linearity might not even be the worst for extrapolation.

Positivity and Regression

Regression seems to solve the problem of positivity:

on one hand: regression fills in missing potential outcomes using linear estimate of CEF
on the other hand: regression extrapolates out to infinity, even where this is no data (recall: interpreting the intercept).
if there is a lack of positivity, our results may depend on extensive extrapolation beyond the data.
linear approximations of potential outcomes where we have no data likely to be terrible (no data to determine the best linear fit)
a lack of positivity exacerbates problems of regression

Is there common support here? Let’s just condition on these two variables

\[Success_i = b_0 + b_1 Peacekeeping_i + b_2 Treaty_i + b_3 Development_i\]

We assume each variable contributes additively and linearly to successful post-conflict situations

In the linear context, the estimated average treatment effect of peacekeeping is 0.419

What if we permit a more flexible model: we model effect of peacekeeping as possibly different across treaty, economic development (we permit treaty and development to interact with peacekeeping)?

Then the estimated average treatment effect is: 0.233

Why the big discrepancy?

We are extrapolating to areas where there are no cases observed with a peacekeeping operation. Small changes in model lead to large changes in estimates.

(Allowing effects of peacekeeping to vary across treaty and development)

Extrapolation Bias

Regression will extrapolate coefficients beyond the support of the data to “plug in” missing counter-factuals.

extrapolation is unconstrained and can be wildly wrong
no warning that this is happening.

Conditioning with Regression in Practice

How do we address these issues?

Interpolation Bias
Extrapolation Bias
~~Weighting~~ (another time)

what to do about…

interpolation bias: - flexible functional forms… polynomials not a great choice (show slide from last week) - natural splines: - splines: fit different slopes for different segments of the data… - basics: cubic polynomial that are “piece-wise”, but must be continuous at cutpoints; linear at ends - example… hours worked - choice of knots… - how do we choose? cross validation - cross-validation explained

extrapolation bias: - convex hull? - histogram of covariates by treatment group (what to do if treatment is continuous?… not clear)

saturated regression…

- what is it?
- how does it work?
- limits extrapolation

Interpolation Bias

Solutions?

Visually explore data to look into plausible linearity (scatter plots)
More flexible functional form assumptions

Visual Inspection:

Before running a regression, scatterplots and histograms of variables. Looking to explore possible non-linearities, unexpectedly large/small values.

ggpairs in GGally package is nice.

ggpairs(ds2000)

Flexible Specification

We saw we can use polynomial expansions

but these have problems with over-fitting, bad interpolation/extrapolation.

Splines https://bookdown.org/ssjackson300/Machine-Learning-Lecture-Notes/splines.html

fit separate slopes/polynomials for different segments along values of $X$.

Instead of fitting a single slope in X…

We can fit slopes, piecewise..

Better still is to fit polynomials, piecewise…

Splines

Choices to be made:

what degree polynomial?
where are the “knots” that define separate segments?
what should happen at the extreme values of X?

Common choice: “natural cubic splines”

cubic polynomial, but linear at the min/max of X.
still have to choose knots. Common to choose by quantile (e.g. quartiles)

lm(success ~ untype4 + splines::ns(logcost), data = ds2000)

Replicate regression success on untype4 and other variables; now using splines::ns() for all other continuous variables.

Addressing Extrapolation Bias

If treatment is binary:

histograms of covariates, color coded by treatment.
use flexible model to fit “propensity score” (probability of treatment); drop observations with $Pr(D=1)$ close to $0,1$.

If treatment is continuous:

Not any really great tools.

Saturated Regression

One solution to extrapolation and interpolation bias is saturated regression

a dummy variable for every unique combination of values for conditioning variables $X$.
we now compare difference in averages of treated/untreated within each strata of $X$
returns an average causal effect, but weighted by $N$ and variance of $D$ within each strata of $X$. If $D$ is binary, weights have precise interpretation.

Saturated Regression

Pros:

no possibility of interpolation: model is still linear and additive in $X$, but “assumption-free” because all unique values of $X$ have their own means (turn potentially non-linear/interactive CEF into a linear problem).
zero weight on strata of $X$ with all “treatment”/all “control” (no extrapolation)
- why is this the case?

Example:

What is the effect of poverty on prejudice?

Courtin, Nellis, and Weaver examine the determinant of Anti-Muslim prejudice in Myanmar just before the 2017 genodice of the Rohingya.

Using survey reported income and prejudice, examine the estimated effect of income on prejudice.

	Naive	Linear	Saturated
Individual covariates:	None	Linear	Saturated
Township covariates:	None	Linear	Saturated

Income level (1-4)	-0.050***	-0.030***	-0.032***
	(0.004)	(0.003)	(0.005)
$N$	20,695	20,515	20,695
$R^2$	0.04	0.08	0.78

Saturated Regression

analysis_df = analysis_df %>% 
  mutate(saturated = paste(svy_sh_female_rc, svy_sh_age_rc, 
                           svy_sh_education_rc, svy_sh_ethnicity_rc,  
                           svy_sh_religion_rc, svy_sh_profession_type_rc, 
                           svy_sh_income_source_rc, 
                           sep = ":"))

x
0:18-27:Graduate:Shan:Buddhist:1:Agriculture
1:18-27:High:Mixed ancestry:Christian:1:Staff
1:28-37:Graduate:Kayah:Christian:2:Staff
1:18-27:High:Kayah:Christian:1:Agriculture
1:18-27:Graduate:Shan:Buddhist:1:Day Labour
1:28-37:High:Kayah:Buddhist:4:Staff
1:18-27:High:Kayah:Christian:2:Day Labour
1:18-27:High:Kayah:Christian:1:Day Labour
0:18-27:High:Kayah:Christian:3:Day Labour
1:18-27:High:Shan:Christian:2:Staff

Saturated Regression

We can use regression for conditioning without interpolation bias, extrapolation bias, but

we still make conditional independence assumption
saturated regression suffers from curse of dimensionality (a lot of $N$ needed, usually)
returns a variance-weighted effect, may not be what we want (this is fixable, though)

Nevertheless, is a good starting point for choosing a model.

\(i\)	\(UN_i\)	\(Y_i(1)\)	\(Y_i(0)\)	\(\tau_i\)
111	0	\(\color{red}{E[Y(1) \| D = 0, X = x]}\)	0.00	\(\color{red}{?}\)
112	0	\(\color{red}{E[Y(1) \| D = 0, X = x]}\)	0.00	\(\color{red}{?}\)
113	0	\(\color{red}{E[Y(1) \| D = 0, X = x]}\)	0.00	\(\color{red}{?}\)
114	0	\(\color{red}{E[Y(1) \| D = 0, X = x]}\)	1.00	\(\color{red}{?}\)
115	0	\(\color{red}{E[Y(1) \| D = 0, X = x]}\)	1.00	\(\color{red}{?}\)
116	1	0.00	\(\color{red}{E[Y(0) \| D = 1, X = x]}\)	\(\color{red}{?}\)
117	1	1.00	\(\color{red}{E[Y(0) \| D = 1, X = x]}\)	\(\color{red}{?}\)
118	1	1.00	\(\color{red}{E[Y(0) \| D = 1, X = x]}\)	\(\color{red}{?}\)
119	1	1.00	\(\color{red}{E[Y(0) \| D = 1, X = x]}\)	\(\color{red}{?}\)
120	1	1.00	\(\color{red}{E[Y(0) \| D = 1, X = x]}\)	\(\color{red}{?}\)
121	1	1.00	\(\color{red}{E[Y(0) \| D = 1, X = x]}\)	\(\color{red}{?}\)
122	1	1.00	\(\color{red}{E[Y(0) \| D = 1, X = x]}\)	\(\color{red}{?}\)

\(i\)	\(UN_i\)	\(Y_i(1)\)	\(Y_i(0)\)	\(\tau_i\)
111	0	\(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\)	0.00	\(\color{red}{?}\)
112	0	\(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\)	0.00	\(\color{red}{?}\)
113	0	\(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\)	0.00	\(\color{red}{?}\)
114	0	\(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\)	1.00	\(\color{red}{?}\)
115	0	\(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\)	1.00	\(\color{red}{?}\)
116	1	0.00	\(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\)	\(\color{red}{?}\)
117	1	1.00	\(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\)	\(\color{red}{?}\)
118	1	1.00	\(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\)	\(\color{red}{?}\)
119	1	1.00	\(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\)	\(\color{red}{?}\)
120	1	1.00	\(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\)	\(\color{red}{?}\)
121	1	1.00	\(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\)	\(\color{red}{?}\)
122	1	1.00	\(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\)	\(\color{red}{?}\)

\(i\)	\(UN_i\)	\(Y_i(1)\)	\(Y_i(0)\)	\(\tau_i\)
111	0	0.78	0.00	\(\color{red}{0.78}\)
112	0	0.85	0.00	\(\color{red}{0.85}\)
113	0	0.57	0.00	\(\color{red}{0.57}\)
114	0	1.01	1.00	\(\color{red}{0.01}\)
115	0	1.10	1.00	\(\color{red}{0.10}\)
116	1	0.00	0.17	\(\color{red}{-0.17}\)
117	1	1.00	0.26	\(\color{red}{0.74}\)
118	1	1.00	0.19	\(\color{red}{0.81}\)
119	1	1.00	0.63	\(\color{red}{0.37}\)
120	1	1.00	0.58	\(\color{red}{0.42}\)
121	1	1.00	0.44	\(\color{red}{0.56}\)
122	1	1.00	0.32	\(\color{red}{0.68}\)