Regression and Conditioning
How regression “conditions”
Potential Problems
Recall:
When there is possible confounding, we want to “block” these causal paths using conditioning
In order for conditioning to estimate the \(ACE\) without bias, we must assume
\(1\). Ignorability/Conditional Independence: within strata of \(X\), potential outcomes of \(Y\) must be independent of cause \(D\) (i.e. within values of \(X\), \(D\) must be as-if random)
In order for conditioning to estimate the \(ACE\) without bias, we must assume
\(2\). Positivity/Common Support: For all values of treatment \(d\) in \(D\) and all value of \(x\) in \(X\): \(Pr(D = d | X = x) > 0\) and \(Pr(D = d | X = x) < 1\)
In order for conditioning to estimate the \(ACE\) without bias, we must assume
We find effect of \(D\) on \(Y\) within each subset of the data uniquely defined by values of \(X_i\).
\(\widehat{ACE}[X = x] = E[Y(1) | D=1, X = x] - E[Y(0) | D=0, X = x]\)
for each value of \(x\) in the data.
Under the Conditional Independence Assumption, \(E[Y(1) | D=1, X=x] = \color{red}{E[Y(1) | D=0, X=x]}\) and \(E[Y(0) | D=0, X=x] = \color{red}{E[Y(0) | D=1, X=x]}\)
We impute the missing potential outcomes… with the expected value of the outcome of observed cases with same values of \(x\), different values of \(D\).
\(E[Y(0) | D=0, X=x]\) is a conditional expectation function; something we can estimate using regression
To understand how we can use regression to estimate causal effects, we need to describe the kinds of causal estimands we might be interested in.
In the context of experiments, each observation has potential outcomes corresponding to their behavior under different treatments. If treatment is dichotomous, we might generally be interested in:
\[ACE = E[Y(1)] - E[Y(0)]\]
In regression, where levels of treatment might be continuous, we generalize this idea to the “response schedule”:
average causal response function:
We typically assume that we are only interested in the \(ACRF\) for the cases in our study; otherwise need to make assumptions about the sampling from a population.
average partial derivative (of the \(ACRF\):
(Return to board)
We can use regression to estimate the linear approximation of the average causal response function:
\[Y_i(D_i = d) = \beta_0 + \beta_1 D_i + \epsilon_i\]
Here \(Y_i(D_i = d)\) is the potential outcome of case \(i\) for a value of \(D = d\).
If we don’t know parameters \(\beta_0, \beta_1\), what do we need to assume to obtain an estimate \(\widehat{\beta}_1\) that we can give a causal interpretation? (On average, change in \(D\) causes \(\widehat{\beta}_1\) change in \(Y\))
We must assume
In this scenario, if \(D\) were binary and we had randomization, this is equivalent to estimating the \(ACE\) for an experiment.
If we want to use regression for conditioning, then the model would look different:
\[E[Y_i(D_i = d, X_i = x)] = \beta_0 + \beta_1 D_i + \mathbf{X_i\beta_i}\]
\[Y_i(D_i = d, X_i = x) = \beta_0 + \beta_1 D_i + \mathbf{X_i\beta_i} + \epsilon_i\]
Given what we have learned about regression so far…
Classic “matching” to impute…
\(\widehat{ACE} = \sum\limits_{x \in X} \underbrace{Pr[X = x]}_{\text{Fraction } X = x}\overbrace{(E[Y_i(1) | D_i=1, X_i = x] - E[Y_i(0) | D_i = 0, X_i = x])}^{\text{Treated - Untreated, same in x}}\)
How does Regression differ?
Positivity requirement speaks to what subsets of the data contribute to our causal estimate:
If regression returns weighted average partial derivative of \(ACRF\)…
Aronow and Samii (2016) discuss how regression weights cases:
In a regression context, cases contribute to estimating \(\beta_D\) (causal effect of \(D\)) as a weighted average:
\[\beta_D = \frac{1}{\sum\limits_i^n w_i}\sum\limits_i^n \tau_iw_i\] Where \(w_i = [D_i - E(D_i | X_i)]^2\): the square of \(D\) residual on \(\hat{D}\) predicted by \(X\).
\(\tau_i = Y(D_i = 1) - Y(D_i = 0)\) (unit causal effect)
These weights are lower when \(D\) is well predicted by \(X\).
These weights are higher when \(D\) is less well predicted by \(X\).
Thus, for cases where values of \(X\) nearly perfectly predict \(D\), there is close to zero positivity, and weights are closer to zero:
Regression weights may behave usefully (less weight where we have less variation in \(D\), so less information), but they change causal estimand:
If \(D\) is continuous, regression also places more or less weight on different values of \(d\):
How it works:
Imagine we want to know the efficacy of UN Peacekeeping operations (Doyle & Sambanis 2000) after civil wars:
We can compare the post-conflict outcomes of countries with and without UN Peacekeeping operations.
To address concern about confounding, we condition on war type (non/ethnic), war deaths, war duration, number of factions, economic assistance, energy consumption, natural resource dependence, and whether the civil war ended in a treaty.
122 conflicts… can we find exact matches?
Using just treaty
(2 values), decade
(6
values), factnum
(8 values), wardur
(43
values), we have only one exact match…
Without perfect matches on possible confounders, we don’t have cases without a Peacekeeping operation that we can use to substitute for the counterfactual outcome in conflicts with a Peacekeeping force.
We can use regression to linearly approximate the conditional expectation function \(E[Y(d) | D = d, X = x]\) to plug in the missing values.
\[Y_i = \beta_0 + \beta_D D_i + \mathbf{X_i\beta_X} + \epsilon_i\]
Download the data
success
on untype4
(\(D\)) and logcost
,
wardur
, factnum
, trnsfcap
,
treaty'
, develop
, exp
,
decade
, using lm()
, save this as
m
ds2000
, ds2000_cf
, and
flip the value of untype4
(0 to 1, 1 to 0).predict(m, newdata = ds2000_cf)
to add a new
column called y_hat
to ds2000
.Next:
y1
, which equals
success
for cases with untype4 == 1
, and
yhat
for cases with untype4 == 0
y0
, which equals
success
for cases with untype4 == 0
, and
yhat
for cases with untype4 == 1
tau_i
as the difference between
y1
and y0
. Then calculate the mean
tau_i
.untype4
in your
regression results.m = lm(success ~ untype4 + treaty + decade +
factnum + logcost + wardur + trnsfcap + develop + exp,
data = ds2000)
ds2000_cf = ds2000 %>% copy
ds2000_cf$untype4 = 1*!(ds2000_cf$untype4)
ds2000[, y_hat := predict(m, newdata = ds2000_cf)]
ds2000[, y1 := ifelse(untype4 %in% 1, success, y_hat)]
ds2000[, y0 := ifelse(untype4 %in% 0, success, y_hat)]
ds2000[, tau := y1 - y0]
ds2000[, tau] %>% mean
## [1] 0.4876439
Model 1 | |
---|---|
(Intercept) | 1.484*** (0.207) |
untype4 | 0.488** (0.174) |
treaty | 0.331*** (0.096) |
decade | -0.050+ (0.027) |
factnum | -0.067* (0.027) |
logcost | -0.071*** (0.017) |
wardur | 0.000 (0.000) |
trnsfcap | 0.000 (0.000) |
develop | 0.000 (0.000) |
exp | -0.812+ (0.453) |
Num.Obs. | 122 |
R2 | 0.378 |
RMSE | 0.38 |
\(i\) | \(UN_i\) | \(Y_i(1)\) | \(Y_i(0)\) | \(\tau_i\) |
---|---|---|---|---|
111 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 0.00 | \(\color{red}{?}\) |
112 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 0.00 | \(\color{red}{?}\) |
113 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 0.00 | \(\color{red}{?}\) |
114 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 1.00 | \(\color{red}{?}\) |
115 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 1.00 | \(\color{red}{?}\) |
116 | 1 | 0.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
117 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
118 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
119 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
120 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
121 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
122 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
\(i\) | \(UN_i\) | \(Y_i(1)\) | \(Y_i(0)\) | \(\tau_i\) |
---|---|---|---|---|
111 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 0.00 | \(\color{red}{?}\) |
112 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 0.00 | \(\color{red}{?}\) |
113 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 0.00 | \(\color{red}{?}\) |
114 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 1.00 | \(\color{red}{?}\) |
115 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 1.00 | \(\color{red}{?}\) |
116 | 1 | 0.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
117 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
118 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
119 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
120 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
121 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
122 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
\(i\) | \(UN_i\) | \(Y_i(1)\) | \(Y_i(0)\) | \(\tau_i\) |
---|---|---|---|---|
111 | 0 | 0.78 | 0.00 | \(\color{red}{0.78}\) |
112 | 0 | 0.85 | 0.00 | \(\color{red}{0.85}\) |
113 | 0 | 0.57 | 0.00 | \(\color{red}{0.57}\) |
114 | 0 | 1.01 | 1.00 | \(\color{red}{0.01}\) |
115 | 0 | 1.10 | 1.00 | \(\color{red}{0.10}\) |
116 | 1 | 0.00 | 0.17 | \(\color{red}{-0.17}\) |
117 | 1 | 1.00 | 0.26 | \(\color{red}{0.74}\) |
118 | 1 | 1.00 | 0.19 | \(\color{red}{0.81}\) |
119 | 1 | 1.00 | 0.63 | \(\color{red}{0.37}\) |
120 | 1 | 1.00 | 0.58 | \(\color{red}{0.42}\) |
121 | 1 | 1.00 | 0.44 | \(\color{red}{0.56}\) |
122 | 1 | 1.00 | 0.32 | \(\color{red}{0.68}\) |
How it works:
Assumptions
Let’s turn to a different example: does working more hours increase earnings? Here we are interested in the causal effect of hours worked.
Let’s say that we to block all backdoor paths, we estimate this model
\[\begin{eqnarray}Y_i = \beta_0 + \beta_1 Hours_i + \beta_2 Female_i + \\ \beta_3 Age_i + \beta_4 Law_i + \epsilon_i\end{eqnarray}\]
And we want to find \(\beta_1\) with \(\widehat{\beta_1}\)
If we are imputing missing potential outcomes of earnings for different hours worked using this model…
We need to ask…
Assuming additivity and linearity:
Linear/Additive | |
---|---|
Hours Worked | 1076*** (115) |
Male | 37223*** (2800) |
Age (Years) | 3453*** (125) |
Law | -48920*** (2675) |
Intercept | -26140** (9017) |
Num.Obs. | 10000 |
R2 | 0.146 |
RMSE | 127723.58 |
How should we causally interpret coefficient for “Hours Worked”?
Assuming additivity and linearity:
Linear/Additive | |
---|---|
Hours Worked | 1076*** (115) |
Male | 37223*** (2800) |
Age (Years) | 3453*** (125) |
Law | -48920*** (2675) |
Intercept | -26140** (9017) |
Num.Obs. | 10000 |
R2 | 0.146 |
RMSE | 127723.58 |
Can we give causal interpretation to, e.g. coefficient “Law”?
We generally cannot give clear causal interpretation coefficients for variables \(\mathbf{X}\) in our regression model.
We have chosen \(X\) to block backdoor paths out of \(D\). At a minimum…
Non-linear dependence between residual \(D^*\) and X, despite regression
Assuming linearity, not additivity: linear relationship between Age and hours(\(D\))/earnings(\(Y\)) allowed to differ by gender and profession
Linear/Additive | Linear/Interactive | |
---|---|---|
Hours Worked | 1076*** (115) | 1175*** (114) |
Male | 37223*** (2800) | 51887** (18118) |
Age (Years) | 3453*** (125) | 4985*** (341) |
Law | -48920*** (2675) | 105972*** (19357) |
Num.Obs. | 10000 | 10000 |
R2 | 0.146 | 0.158 |
RMSE | 127723.58 | 126847.51 |
Non-linear dependence between D and X, despite regression
Assuming neither linearity or additivity: fit an intercept for every combination of gender, profession, and age in years. (linear, but not by assumption)
Linear/Additive | Linear/Interactive | Nonlinear/Interactive | |
---|---|---|---|
Hours Worked | 1076*** (115) | 1175*** (114) | 1305*** (113) |
Male | 37223*** (2800) | 51887** (18118) | |
Age (Years) | 3453*** (125) | 4985*** (341) | |
Law | -48920*** (2675) | 105972*** (19357) | -46702*** (2614) |
Num.Obs. | 10000 | 10000 | 10000 |
R2 | 0.146 | 0.158 | 0.199 |
RMSE | 127723.58 | 126847.51 | 123733.07 |
No dependence between \(D\) and \(X\), after regression
Even if we included all variables on backdoor path between \(D\) and \(Y\), regression may still produce biased estimate:
This bias can take two forms.
Typically: we approximate the relationship between variables in \(X\) and \(D\) to be additive and linear. If this approximation is wrong, we can have bias.
What is wrong with interpolation bias?
transparent triangles indicate \(\color{red}{\widehat{Y}}\) imputed by regression.
Actual (black) vs Regression (red) weights
(1) | (2) | |
---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
(Intercept) | 12.004*** | 1.606*** |
(1.142) | (0.289) | |
d | -6.528*** | 1.536*** |
(1.354) | (0.293) | |
x | -0.153 | -0.568*** |
(0.287) | (0.053) | |
x2 | 0.999*** | |
(0.019) | ||
Num.Obs. | 100 | 100 |
R2 | 0.210 | 0.973 |
RMSE | 5.91 | 1.08 |
Actual (black) vs Regression (red) weights
When we approximate of the relationship of confounding variables \(X\) with \(D\) and \(Y\) through linearity (or any imposed functional form) and additivity, we may generate interpolation bias even if all confounders “blocked”:
In the regression context, Conditional Independence involves
If we specify model wrong, regression’s linear and additive approximation of true CEF may lead us astray.
\(2\). Positivity/Common Support: For all values of treatment \(d\) in \(D\) and all value of \(x\) in \(X\): \(Pr(D = d | X = x) > 0\) and \(Pr(D = d | X = x) < 1\)
Regression seems to solve the problem of positivity
Does being a democracy cause a country to provide better public goods?
We decide to estimate the following regression model. Let’s assume that log GDP per capita is the only confounder we need to block.
\(Public \ Goods_i = \beta_0 + \beta_1 Democracy_i +\) \(\beta_2 ln(Per \ Capita \ GDP_i) + \epsilon_i\)
Download (simulated) data from here:
\(1\). Estimate \(Public \ Goods_i = \beta_0 + \beta_1 Democracy_i
+ \beta_2 ln(Per \ Capita \ GDP_i) + \epsilon_i\) using the
lm
function. What is the ‘effect’ of democracy on public
goods?
\(2\). What does this model assume about the relationship between per capita GDP and public goods for democracies? For non-democracies?
\(3\). Plot public goods on GDP per capita, grouping by democracy
require(ggplot2)
ggplot(data_9, aes(x = l_GDP_per_capita, y = public_goods, colour = Democracy %>% as.factor)) + geom_point()
\(4\). What is the true functional form of the relationship between GDP per capita and public goods for Democracies? For non-democracies where GDP per capita is \(> 0\)?
\(5\). Create a new variable for per capita GDP \(< 0\).
\(6\). Repeat the regression from
(1) for cases where l_GDP_per_capita
$ < 0$. Why is the
estimate of \(\widehat{\beta_1}\)
different?
We observe no non-democracies with GDP per capita as high as some democracies, but want to condition on GDP per capita.
We observe no non-democracies with GDP per capita as high as some democracies, but want to condition on GDP per capita.
or
Typically: we approximate the relationship between variables in \(X\) and \(D\) to be additive and linear.
If \(D = 1\) never occurs for certain values of \(X\) (e.g. \(X > 0\)), regression model will use linearity (or ANY functional form\(^*\)) to extrapolate a predicted value of what \(D = 1\) would look like when \(X > 0\).
\(^*\): Linearity might not even be the worst for extrapolation.
Regression seems to solve the problem of positivity:
Is there common support here? Let’s just condition on these two variables
\[Success_i = b_0 + b_1 Peacekeeping_i + b_2 Treaty_i + b_3 Development_i\]
We assume each variable contributes additively and linearly to successful post-conflict situations
In the linear context, the estimated average treatment effect of peacekeeping is 0.419
What if we permit a more flexible model: we model effect of peacekeeping as possibly different across treaty, economic development (we permit treaty and development to interact with peacekeeping)?
Then the estimated average treatment effect is: 0.233
Why the big discrepancy?
(Allowing effects of peacekeeping to vary across treaty and development)
Regression will extrapolate coefficients beyond the support of the data to “plug in” missing counter-factuals.
what to do about…
interpolation bias: - flexible functional forms… polynomials not a great choice (show slide from last week) - natural splines: - splines: fit different slopes for different segments of the data… - basics: cubic polynomial that are “piece-wise”, but must be continuous at cutpoints; linear at ends - example… hours worked - choice of knots… - how do we choose? cross validation - cross-validation explained
extrapolation bias: - convex hull? - histogram of covariates by treatment group (what to do if treatment is continuous?… not clear)
saturated regression…
- what is it?
- how does it work?
- limits extrapolation
Before running a regression, scatterplots and histograms of variables. Looking to explore possible non-linearities, unexpectedly large/small values.
ggpairs
in GGally
package is nice.
ggpairs(ds2000)
We saw we can use polynomial expansions
Splines https://bookdown.org/ssjackson300/Machine-Learning-Lecture-Notes/splines.html
Instead of fitting a single slope in X…
We can fit slopes, piecewise..
Better still is to fit polynomials, piecewise…
Choices to be made:
Common choice: “natural cubic splines”
Replicate regression success
on untype4
and
other variables; now using splines::ns()
for all other
continuous variables.
If treatment is binary:
If treatment is continuous:
One solution to extrapolation and interpolation bias is saturated regression
Pros:
What is the effect of poverty on prejudice?
Courtin, Nellis, and Weaver examine the determinant of Anti-Muslim prejudice in Myanmar just before the 2017 genodice of the Rohingya.
Using survey reported income and prejudice, examine the estimated effect of income on prejudice.
Naive
|
Linear
|
Saturated
|
|
---|---|---|---|
Individual covariates:
|
None
|
Linear
|
Saturated
|
Township covariates:
|
None
|
Linear
|
Saturated
|
|
|
|
|
Income level (1-4) | -0.050*** | -0.030*** | -0.032*** |
(0.004) | (0.003) | (0.005) | |
\(N\) | 20,695 | 20,515 | 20,695 |
\(R^2\) | 0.04 | 0.08 | 0.78 |
x |
---|
0:18-27:Graduate:Shan:Buddhist:1:Agriculture |
1:18-27:High:Mixed ancestry:Christian:1:Staff |
1:28-37:Graduate:Kayah:Christian:2:Staff |
1:18-27:High:Kayah:Christian:1:Agriculture |
1:18-27:Graduate:Shan:Buddhist:1:Day Labour |
1:28-37:High:Kayah:Buddhist:4:Staff |
1:18-27:High:Kayah:Christian:2:Day Labour |
1:18-27:High:Kayah:Christian:1:Day Labour |
0:18-27:High:Kayah:Christian:3:Day Labour |
1:18-27:High:Shan:Christian:2:Staff |
We can use regression for conditioning without interpolation bias, extrapolation bias, but