Regression and Conditioning
How regression “conditions”
Potential Problems
Recall:
When there is possible confounding, we want to “block” these causal paths using conditioning
In order for conditioning to estimate the \(ACE\) without bias, we must assume
\(1\). Ignorability/Conditional Independence: within strata of \(X\), potential outcomes of \(Y\) must be independent of cause \(D\) (i.e. within values of \(X\), \(D\) must be as-if random)
In order for conditioning to estimate the \(ACE\) without bias, we must assume
\(2\). Positivity/Common Support: For all values of treatment \(d\) in \(D\) and all value of \(x\) in \(X\): \(Pr(D = d | X = x) > 0\) and \(Pr(D = d | X = x) < 1\)
In order for conditioning to estimate the \(ACE\) without bias, we must assume
We find effect of \(D\) on \(Y\) within each subset of the data uniquely defined by values of \(X_i\).
\(\widehat{ACE}[X = x] = E[Y(1) | D=1, X = x] - E[Y(0) | D=0, X = x]\)
for each value of \(x\) in the data.
Under the Conditional Independence Assumption, \(E[Y(1) | D=1, X=x] = \color{red}{E[Y(1) | D=0, X=x]}\) and \(E[Y(0) | D=0, X=x] = \color{red}{E[Y(0) | D=1, X=x]}\)
We impute the missing potential outcomes… with the expected value of the outcome of observed cases with same values of \(x\), different values of \(D\).
\(E[Y(0) | D=0, X=x]\) is a conditional expectation function; something we can estimate using regression
How does regression relate to our causal estimands?
Curse of dimensionality \(\to\) regression:
To understand how we can use regression to estimate causal effects, we need to describe the kinds of causal estimands we might be interested in.
In the context of experiments, each observation has potential outcomes corresponding to their behavior under different treatments. If treatment is dichotomous, we might generally be interested in:
\[ACE = E[Y(1)] - E[Y(0)]\]
In regression, where levels of treatment might be continuous, we generalize this idea to the “response schedule”:
average causal response function:
We typically assume that we are only interested in the \(ACRF\) for the cases in our study; otherwise need to make assumptions about the sampling from a population.
average partial derivative (of the \(ACRF\):
(Return to board)
We can use regression to estimate the linear approximation of the average causal response function:
\[Y_i(D_i = d) = \beta_0 + \beta_1 D_i + \epsilon_i\]
Here \(Y_i(D_i = d)\) is the potential outcome of case \(i\) for a value of \(D = d\).
\[Y_i(D_i = d) = \beta_0 + \beta_1 D_i + \epsilon_i\]
If we want to use regression for conditioning, then the model would look different:
\[E[Y_i(D_i = d, X_i = x)] = \beta_0 + \beta_1 D_i + \mathbf{X_i\beta_i} + \epsilon_i\]
\[Y_i(D_i = d, X_i = x) = \beta_0 + \beta_1 D_i + \mathbf{X_i\beta_i} + \epsilon_i\]
Given what we have learned about regression so far…
Classic “matching” to impute…
\[\begin{aligned} \widehat{ACE} = \sum\limits_{x \in X} \overbrace{Pr[X = x]}^{\text{Fraction} X = x}&(E[Y_i(1) | D_i=1, X_i = x] - \\ & E[Y_i(0) | D_i = 0, X_i = x]) \end{aligned}\]
How does Regression differ?
Positivity requirement speaks to what subsets of the data contribute to our causal estimate:
Aronow and Samii (2016) discuss how regression weights cases, with a binary treatment.
Cases contribute to estimating \(\beta_D\) (causal effect of \(D\)) as a weighted average:
\[\beta_D = \frac{1}{\sum\limits_i^n w_i}\sum\limits_i^n \tau_iw_i\] Where \(w_i = [D_i - E(D_i | X_i)]^2\): the square of \(D\) residual on \(\hat{D}\) predicted by \(X\).
\(\tau_i = Y(D_i = 1) - Y(D_i = 0)\) (unit causal effect)
These weights are lower for values \(x\) where \(D\) is well predicted by \(X\).
These weights are higher for values \(x\) when \(D\) is less well predicted by \(X\).
Thus, for cases where values of \(X\) nearly perfectly predict \(D\), there is close to zero positivity, and weights are closer to zero:
Regression weights may behave usefully (less weight where we have less variation in \(D\), so less information), but they change causal estimand:
Ask: which cases, defined in terms of \(X\), get more or less weight?
If \(D\) is continuous, regression also places more or less weight on different values of \(d\):
If \(D\) is continuous, regression also places more or less weight on different values of \(d\):
We want to know… how do conditional independence and common support translate to regression context?
What new problems could arise? (We’ll assume that we have the right variables to block backdoor path)
How it works:
Imagine we want to know the efficacy of UN Peacekeeping operations (Doyle & Sambanis 2000) after civil wars:
We can compare the post-conflict outcomes of countries with and without UN Peacekeeping operations.
To address concern about confounding, we condition on war type (non/ethnic), war deaths, war duration, number of factions, economic assistance, energy consumption, natural resource dependence, and whether the civil war ended in a treaty.
122 conflicts… can we find exact matches?
Using just treaty (2 values), decade (6
values), factnum (8 values), wardur (43
values), we have only one exact match…
Without perfect matches on possible confounders, we don’t have cases without a Peacekeeping operation that we can use to substitute for the counterfactual outcome in conflicts with a Peacekeeping force.
We can use regression to linearly approximate the conditional expectation function \(E[Y(d) | D = d, X = x]\) to plug in the missing values.
\[Y_i = \beta_0 + \beta_D D_i + \mathbf{X_i\beta_X} + \epsilon_i\]
Download the data
success on untype4 (\(D\)) and logcost,
wardur, factnum, trnsfcap,
treaty', develop, exp,
decade, using lm(), save this as
mds2000 called ds2000_cf,
and flip the value of untype4 (0 to 1, 1 to 0).predict(m, newdata = ds2000_cf) to add a new
column called y_hat to ds2000.Next:
y1, which equals
success for cases with untype4 == 1, and
yhat for cases with untype4 == 0y0, which equals
success for cases with untype4 == 0, and
yhat for cases with untype4 == 1tau_i as the difference between
y1 and y0. Then calculate the mean
tau_i.untype4 in your
regression results.m = lm(success ~ untype4 + treaty + decade +
factnum + logcost + wardur + trnsfcap + develop + exp,
data = ds2000)
ds2000_cf = ds2000 %>% copy
ds2000_cf$untype4 = 1*!(ds2000_cf$untype4)
ds2000[, y_hat := predict(m, newdata = ds2000_cf)]
ds2000[, y1 := ifelse(untype4 %in% 1, success, y_hat)]
ds2000[, y0 := ifelse(untype4 %in% 0, success, y_hat)]
ds2000[, tau := y1 - y0]
ds2000[, tau] %>% mean## [1] 0.4876439
| Model 1 | |
|---|---|
| (Intercept) | 1.484*** (0.207) |
| untype4 | 0.488** (0.174) |
| treaty | 0.331*** (0.096) |
| decade | -0.050+ (0.027) |
| factnum | -0.067* (0.027) |
| logcost | -0.071*** (0.017) |
| wardur | 0.000 (0.000) |
| trnsfcap | 0.000 (0.000) |
| develop | 0.000 (0.000) |
| exp | -0.812+ (0.453) |
| Num.Obs. | 122 |
| R2 | 0.378 |
| RMSE | 0.38 |
| \(i\) | \(UN_i\) | \(Y_i(1)\) | \(Y_i(0)\) | \(\tau_i\) |
|---|---|---|---|---|
| 111 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 0.00 | \(\color{red}{?}\) |
| 112 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 0.00 | \(\color{red}{?}\) |
| 113 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 0.00 | \(\color{red}{?}\) |
| 114 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 1.00 | \(\color{red}{?}\) |
| 115 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 1.00 | \(\color{red}{?}\) |
| 116 | 1 | 0.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| 117 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| 118 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| 119 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| 120 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| 121 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| 122 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| \(i\) | \(UN_i\) | \(Y_i(1)\) | \(Y_i(0)\) | \(\tau_i\) |
|---|---|---|---|---|
| 111 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 0.00 | \(\color{red}{?}\) |
| 112 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 0.00 | \(\color{red}{?}\) |
| 113 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 0.00 | \(\color{red}{?}\) |
| 114 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 1.00 | \(\color{red}{?}\) |
| 115 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 1.00 | \(\color{red}{?}\) |
| 116 | 1 | 0.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| 117 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| 118 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| 119 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| 120 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| 121 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| 122 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| \(i\) | \(UN_i\) | \(Y_i(1)\) | \(Y_i(0)\) | \(\tau_i\) |
|---|---|---|---|---|
| 111 | 0 | 0.78 | 0.00 | \(\color{red}{0.78}\) |
| 112 | 0 | 0.85 | 0.00 | \(\color{red}{0.85}\) |
| 113 | 0 | 0.57 | 0.00 | \(\color{red}{0.57}\) |
| 114 | 0 | 1.01 | 1.00 | \(\color{red}{0.01}\) |
| 115 | 0 | 1.10 | 1.00 | \(\color{red}{0.10}\) |
| 116 | 1 | 0.00 | 0.17 | \(\color{red}{-0.17}\) |
| 117 | 1 | 1.00 | 0.26 | \(\color{red}{0.74}\) |
| 118 | 1 | 1.00 | 0.19 | \(\color{red}{0.81}\) |
| 119 | 1 | 1.00 | 0.63 | \(\color{red}{0.37}\) |
| 120 | 1 | 1.00 | 0.58 | \(\color{red}{0.42}\) |
| 121 | 1 | 1.00 | 0.44 | \(\color{red}{0.56}\) |
| 122 | 1 | 1.00 | 0.32 | \(\color{red}{0.68}\) |
How it works:
Assumptions
Let’s turn to a different example: does working more hours increase earnings? Here we are interested in the causal effect of hours worked.
Let’s say that we to block all backdoor paths, we estimate this model
\[\begin{eqnarray}Y_i = \beta_0 + \beta_1 Hours_i + \beta_2 Female_i + \\ \beta_3 Age_i + \beta_4 Law_i + \epsilon_i\end{eqnarray}\]
And we want to find \(\beta_1\) with \(\widehat{\beta_1}\)
If we are imputing missing potential outcomes of earnings for different hours worked using this model…
We need to ask…
Assuming additivity and linearity:
| Linear/Additive | |
|---|---|
| Hours Worked | 1076*** (115) |
| Male | 37223*** (2800) |
| Age (Years) | 3453*** (125) |
| Law | -48920*** (2675) |
| Intercept | -26140** (9017) |
| Num.Obs. | 10000 |
| R2 | 0.146 |
| RMSE | 127723.58 |
How should we causally interpret coefficient for “Hours Worked”?
Assuming additivity and linearity:
| Linear/Additive | |
|---|---|
| Hours Worked | 1076*** (115) |
| Male | 37223*** (2800) |
| Age (Years) | 3453*** (125) |
| Law | -48920*** (2675) |
| Intercept | -26140** (9017) |
| Num.Obs. | 10000 |
| R2 | 0.146 |
| RMSE | 127723.58 |
Can we give causal interpretation to, e.g. coefficient “Law”?
We generally cannot give clear causal interpretation coefficients for variables \(\mathbf{X}\) in our regression model.
We have chosen \(X\) to block backdoor paths out of \(D\). At a minimum…
Don’t interpret coefficients other than for \(D\)
Non-linear dependence between residual \(D^*\) and X, despite regression
Assuming linearity, not additivity: linear relationship between Age and hours(\(D\))/earnings(\(Y\)) allowed to differ by gender and profession
| Linear/Additive | Linear/Interactive | |
|---|---|---|
| Hours Worked | 1076*** (115) | 1175*** (114) |
| Male | 37223*** (2800) | 51887** (18118) |
| Age (Years) | 3453*** (125) | 4985*** (341) |
| Law | -48920*** (2675) | 105972*** (19357) |
| Num.Obs. | 10000 | 10000 |
| R2 | 0.146 | 0.158 |
| RMSE | 127723.58 | 126847.51 |
Non-linear dependence between D and X, despite regression
Assuming neither linearity or additivity: fit an intercept for every unique combination of gender, profession, and age in years. (linear, by specification not by assumption)
| Linear/Additive | Linear/Interactive | Nonlinear/Interactive | |
|---|---|---|---|
| Hours Worked | 1076*** (115) | 1175*** (114) | 1305*** (113) |
| Male | 37223*** (2800) | 51887** (18118) | |
| Age (Years) | 3453*** (125) | 4985*** (341) | |
| Law | -48920*** (2675) | 105972*** (19357) | -46702*** (2614) |
| Num.Obs. | 10000 | 10000 | 10000 |
| R2 | 0.146 | 0.158 | 0.199 |
| RMSE | 127723.58 | 126847.51 | 123733.07 |
No dependence between \(D\) and \(X\), after regression
Even if we included all variables on backdoor path between \(D\) and \(Y\), regression may still produce biased estimate:
This bias can take two forms.
Bias due to model dependence:
Interpolation bias: model dependence \(\to\) failure of conditional independence, even if backdoor paths “blocked”
Extrapolation bias: model dependence + lack of common support \(\to\) bias
Typically: to impute missing potential outcomes of \(Y\), remove dependence between \(D\) and \(X\), we assume relationships are additive and linear. If this approximation is wrong, we can have bias.
What is wrong with interpolation bias?
transparent triangles indicate \(\color{red}{\widehat{Y}}\) imputed by regression.
Actual (black) vs Regression (red) weights; mean of \(D\) by \(X\) (blue)
| (1) | (2) | |
|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
| (Intercept) | 12.004*** | 1.606*** |
| (1.142) | (0.289) | |
| d | -6.528*** | 1.536*** |
| (1.354) | (0.293) | |
| x | -0.153 | -0.568*** |
| (0.287) | (0.053) | |
| x2 | 0.999*** | |
| (0.019) | ||
| Num.Obs. | 100 | 100 |
| R2 | 0.210 | 0.973 |
| RMSE | 5.91 | 1.08 |
Actual (black) vs Regression (red) weights
When we approximate of the relationship of confounding variables \(X\) with \(D\) and \(Y\) through linearity (or any imposed functional form) and additivity, we may generate interpolation bias even if all confounders “blocked”:
In the regression context, Conditional Independence involves
If we specify model wrong, regression’s linear and additive approximation of true CEF may lead us astray.
\(2\). Positivity/Common Support: For all values of treatment \(d\) in \(D\) and all value of \(x\) in \(X\): \(Pr(D = d | X = x) > 0\) and \(Pr(D = d | X = x) < 1\)
Regression seems to solve the problem of positivity
Does being a democracy cause a country to provide better public goods?
We decide to estimate the following regression model. Let’s assume that log GDP per capita is the only confounder we need to block.
\(Public \ Goods_i = \beta_0 + \beta_1 Democracy_i +\) \(\beta_2 ln(Per \ Capita \ GDP_i) + \epsilon_i\)
Download (simulated) data from here:
\(1\). Estimate \(Public \ Goods_i = \beta_0 + \beta_1 Democracy_i
+ \beta_2 ln(Per \ Capita \ GDP_i) + \epsilon_i\) using the
lm function. What is the ‘effect’ of democracy on public
goods?
\(2\). What does this model assume about the relationship between per capita GDP and public goods for democracies? For non-democracies?
\(3\). Plot public goods on GDP per capita, grouping by democracy
require(ggplot2)
ggplot(data_9, aes(x = l_GDP_per_capita, y = public_goods, colour = Democracy %>% as.factor)) + geom_point()
\(4\). What is the true functional form of the relationship between GDP per capita and public goods for Democracies? For non-democracies where GDP per capita is \(> 0\)?
\(5\). Create a new variable for per capita GDP \(< 0\).
\(6\). Repeat the regression from
(1) for cases where l_GDP_per_capita $ < 0$. Why is the
estimate of \(\widehat{\beta_1}\)
different?
We observe no non-democracies with GDP per capita as high as some democracies, but want to condition on GDP per capita.
We observe no non-democracies with GDP per capita as high as some democracies, but want to condition on GDP per capita.
or
Typically: we approximate the relationship between variables in \(X\) and \(D\) to be additive and linear.
If \(D = 1\) never occurs for certain values of \(X\) (e.g. \(X > 0\)), regression model will use linearity (or ANY functional form\(^*\)) to extrapolate a predicted value of what \(D = 1\) would look like when \(X > 0\).
\(^*\): Linearity might not even be the worst for extrapolation.
Regression seems to solve the problem of positivity:
Is there common support here? Let’s just condition on these two variables
\[Success_i = b_0 + b_1 Peacekeeping_i + b_2 Treaty_i + b_3 Development_i\]
We assume each variable contributes additively and linearly to successful post-conflict situations
In the linear context, the estimated average treatment effect of peacekeeping is 0.419
What if we permit a more flexible model: we model effect of peacekeeping as possibly different across treaty, economic development (we permit treaty and development to interact with peacekeeping)?
\[Success_i = b_0 + b_1 Peacekeeping_i + b_2 Treaty_i + b_3 Development_i + b_4 P_i \times D_i + b_5 P_i \times T_i + b_6 P_i \times T_i \times D_i\]
Then the estimated average treatment effect is: 0.233
Why the big discrepancy?
(Allowing effects of peacekeeping to vary across treaty and development)
Regression will extrapolate coefficients beyond the support of the data to “plug in” missing counter-factuals.
Regression can permit lack of exact common support, but need overlapping values of \(X\) and \(D\)
flexible functional forms… polynomials not a great choice (show slide from last week)
natural splines:
convex hull?
histogram of covariates by treatment group (what to do if treatment is continuous?… not clear)
saturated regression…
Before running a regression, scatterplots and histograms of variables. Looking to explore possible non-linearities, unexpectedly large/small values.
ggpairs in GGally package is nice.
ggpairs(ds2000)
We saw we can use polynomial expansions
Splines https://bookdown.org/ssjackson300/Machine-Learning-Lecture-Notes/splines.html
Instead of fitting a single slope in X…
We can fit slopes, piecewise..
Better still is to fit polynomials, piecewise…
Choices to be made:
Common choice: “natural cubic splines”
Replicate regression success on untype4 and
other variables; now using splines::ns() for all other
continuous variables.
If treatment is binary:
If treatment is continuous:
One solution to extrapolation and interpolation bias is saturated regression
Pros:
What is the effect of poverty on prejudice?
Courtin, Nellis, and Weaver examine the determinant of Anti-Muslim prejudice in Myanmar just before the 2017 genodice of the Rohingya.
Using survey reported income and prejudice, examine the estimated effect of income on prejudice.
|
Naive
|
Linear
|
Saturated
|
|
|---|---|---|---|
|
Individual covariates:
|
None
|
Linear
|
Saturated
|
|
Township covariates:
|
None
|
Linear
|
Saturated
|
|
|
|
|
|
| Income level (1-4) | -0.050*** | -0.030*** | -0.032*** |
| (0.004) | (0.003) | (0.005) | |
| \(N\) | 20,695 | 20,515 | 9,241 |
| \(R^2\) | 0.04 | 0.08 | 0.48 |
| x |
|---|
| 0:18-27:Graduate:Shan:Buddhist:1:Agriculture |
| 1:18-27:High:Mixed ancestry:Christian:1:Staff |
| 1:28-37:Graduate:Kayah:Christian:2:Staff |
| 1:18-27:High:Kayah:Christian:1:Agriculture |
| 1:18-27:Graduate:Shan:Buddhist:1:Day Labour |
| 1:28-37:High:Kayah:Buddhist:4:Staff |
| 1:18-27:High:Kayah:Christian:2:Day Labour |
| 1:18-27:High:Kayah:Christian:1:Day Labour |
| 0:18-27:High:Kayah:Christian:3:Day Labour |
| 1:18-27:High:Shan:Christian:2:Staff |
We can use regression for conditioning without interpolation bias, extrapolation bias, but