1. Fundamental Problem of Causal Inference
- independent/dependent variables
- solutions
2. Correlation
- what is it?
- problems with correlation
- Random association
November 4, 2024
Criminal cases; campaign claims hinge on whether Trump’s statements affect real-world actions.
Do Trump’s statements this year increase the probability of violence after the election?
Did Trump’s election fraud claims in 2020 increase the probability of violence (like January 6th)?
Causal claims are about counterfactuals, can be expressed in terms of potential outcomes:
“Trump’s election fraud claims caused January 6th” \(\xrightarrow{implies}\)
\[\text{Capitol Attacked}_{2021} (\text{Trump Claims Fraud}) = Yes \\ \color{red}{\text{Capitol Attacked}_{2021} (\text{Trump Doesn't Claim Fraud}) = No} \]
In order to provide scientific evidence (i.e., meets weak severity) regarding causal claims, we need to find ways around the FPCI:
What behaviors cause a person to become wealthy?
Can we learn anything from this evidence?
We cannot say anything about causality if:
We focus on effects of causes, so we need to see variation in exposure to the cause
We make causal claims testable by translating them into statements about potential outcomes: the relationship between independent (cause) and dependent (outcome) variables.
For instance:
We then have to compare cases with different values of \(X\).
Independent variable:
The variable capturing the alleged cause in a causal claim.
Dependent variable:
The variable capturing the alleged outcome (what is affected) in a causal claim.
Potential Outcomes are the values of dependent variable a case would take if exposed to different values of the independent variable
We can’t easily find a counterfactual US where Trump didn’t claim election was stolen…
But we can examine whether other statements he has made increased violence.
In February 2019, Donald Trump held a rally in El Paso, TX. Argued that migrants were dangerous.
In August 2019, an armed man killed 22 people at a Walmart in El Paso, TX. In advance of his attack, he issued a manifesto that stated he was motivated in response to an alleged “Hispanic invasion of Texas.”
A causal claim:
“Trump’s rally in El Paso increased the likelihood of hate crimes against immigrants.”
“Trump rallies in a community increase the likelihood of hate crimes against immigrants.”
What could be an independent variable used to test this causal claim?
What could be a dependent variable used to test this causal claim?
Causal claim implies…
Counterfactual claim:
“If Trump had not held a rally in El Paso (in 2019), then there would have been fewer hate crimes against immigrants.”
Potential Outcomes:
\(\mathrm{Hate \ Crimes}_{El \ Paso}(\mathrm{Rally}) >\) \(\color{red}{\mathrm{Hate \ Crimes}_{El \ Paso}(\mathrm{No \ Rally})}\)
\(\mathrm{Black}\) is factual; \(\color{red}{\mathrm{Red}}\) is counterfactual
If Trump’s rally caused hate crimes to increase in El Paso, we would expect to see this:
\(\mathrm{Hate \ Crimes}_{El \ Paso}(\mathrm{Rally}) >\) \(\color{red}{\mathrm{Hate \ Crimes}_{El \ Paso}(\mathrm{No \ Rally})}\)
While the value in \(\mathrm{Black}\) is factual; The value in \(\color{red}{\mathrm{Red}}\) is counterfactual and can never be known.
\(\mathrm{City}_i\) | \(\mathrm{Rally}_i\) | \(\mathrm{Hate \ Crimes}_i(\mathrm{Rally})\) | \(\mathrm{Hate \ Crimes}_i(\mathrm{No \ Rally})\) |
---|---|---|---|
El Paso | Yes | > 1 | ? |
How would we find the “\(?\)”?
We cannot observe: \(\color{red}{\mathrm{Hate \ Crimes}_{El \ Paso}(\mathrm{No \ Rally})}\)
But we can observe, e.g.: \(\mathrm{Hate \ Crimes}_{Austin}(\mathrm{No \ Rally})\)
If we assume: \(\mathrm{Hate \ Crimes}_{Austin}(\mathrm{No \ Rally})\) \(=\) \(\color{red}{\mathrm{Hate \ Crimes}_{El \ Paso}(\mathrm{No \ Rally})}\)
Then, we can empirically test our causal claim, to see if:
\(\mathrm{Hate \ Crimes}_{El \ Paso}(\mathrm{Rally}) >\) \(\mathrm{Hate \ Crimes}_{Austin}(\mathrm{No \ Rally})\)
\(\mathrm{City}_i\) | \(\mathrm{Rally}_i\) | \(\mathrm{Hate \ Crimes}_i(\mathrm{Rally})\) | \(\mathrm{Hate \ Crimes}_i(\mathrm{No \ Rally})\) |
---|---|---|---|
El Paso | Yes | \(\mathrm{Hate \ Crimes}_{El \ Paso}(\mathrm{Rally})\) | \(\color{red}{\mathrm{Hate \ Crimes}_{El \ Paso}(\mathrm{No \ Rally})}\) |
\(\mathbf{\Uparrow}\) | |||
Austin | No | \(\color{red}{\mathrm{Hate \ Crimes}_{Austin}(\mathrm{Rally})}\) | \(\boxed{\mathrm{Hate \ Crimes}_{Austin}(\mathrm{No \ Rally})}\) |
\(\mathrm{City}_i\) | \(\mathrm{Rally}_i\) | \(\mathrm{Hate \ Crimes}_i(\mathrm{Rally})\) | \(\mathrm{Hate \ Crimes}_i(\mathrm{No \ Rally})\) |
---|---|---|---|
El Paso | Yes | \(\mathrm{Hate \ Crimes}_{El \ Paso}(\mathrm{Rally})\) | \(\boxed{\mathrm{Hate \ Crimes}_{Austin}(\mathrm{No \ Rally})}\) |
\(\mathbf{\Uparrow}\) | |||
Austin | No | \(\color{red}{\mathrm{Hate \ Crimes}_{Austin}(\mathrm{Rally})}\) | \(\mathrm{Hate \ Crimes}_{Austin}(\mathrm{No \ Rally})\) |
Every solution to the FPCI involves:
Comparing the observed values of outcome \(Y\) in cases that actually have different values of cause \(X\)
Making assumption that factual (observed) potential outcomes from one case as equivalent to counterfactual (unobserved) potential outcomes of another case.
Correlation is the association/relationship between the observed values of \(X\) (the independent variable) and \(Y\) (the dependent variable)
All empirical evidence for causal claims relies on correlation between the independent and dependent variables.
But, you’ve all heard this:
How do we turn correlation into evidence of causation?
Many different ways of assessing correlation.
Many ways of examining correlations:
data from this paper
common mathematical definition: correlation is the degree of linear association between \(X\) and \(Y\)
negative correlation: (\(< 0\)) values of \(X\) and \(Y\) move in opposite direction:
positive correlation: (\(> 0\)) values of \(X\) and \(Y\) move in same direction:
It is possible to see perfect correlation but small change in \(Y\) across \(X\)
It is possible to see weak correlation but large change in \(Y\) across \(X\)
It is possible to see perfect nonlinear relationship between \(X\) and \(Y\) with \(0\) correlation
weak correlation: values for \(X\) and \(Y\) do not cluster along line
strong correlation: values for \(X\) and \(Y\) cluster strongly along a line
strength of correlation does not determine the slope of line describing \(X,Y\) relationship
magnitude: this is the slope of the line describing the \(X,Y\) relationship. The larger the effect, the steeper the slope
Correlation: \(0.08\), Magnitude: \(0.22\). Does this correlation prove that Trump rallies caused hate crimes? Why or why not?
Correlation: \(0.08\), Magnitude: \(0.22\). Does this correlation prove that Trump rallies caused hate crimes? Why or why not?
Correlation: 0.67, Magnitude: 5.82. Does this correlation prove that Nick Cage caused drownings? Why or why not?
random association: correlations between \(X\) and \(Y\) occur by chance and do not reflect any systematic relationship between \(X\) and \(Y\). (In the extreme, absolutely no relationship between \(X\) and \(Y\))
bias (spurious correlation, confounding): \(X\) and \(Y\) are correlated but the correlation does not result from causal relationship between those variables
Solving these problems involves making assumptions: what are those assumptions? how plausible are they?
Arbitrary processes can make seemingly-strong patterns.
If you look long enough at pure chaos, you might find a strong correlation
To see that random patterns can emerge, I use random number generators to
We can imagine these are the observed \(X\) and \(Y\) for \(5\) cases.
How easy is it to find a strong correlation (even though \(X\) and \(Y\) totally unrelated)
\(\#\) Tries to get correlation \(> 0.9\): 58
What do we do about this problem?
Field of statistics investigates properties of chance events:
This procedure works…
Tries to get correlation \(> 0.9\): 1377
Tries to get correlation \(> 0.9\): 905248
Tries to get correlation \(> 0.45\): 30
statistical significance:
An indication of how likely it is that correlation we observe could have happened purely by chance.
higher degree of statistical significance indicates correlation is unlikely to have happened by chance
\(p\) value:
A numerical measure of statistical significance. Puts a number on how likely observed correlation would have occurred by chance, assuming we know the chance procedure and assuming truth is a \(0\) correlation.
It is a probability, so is between \(0\) and \(1\).
Lower \(p\)-values indicate greater statistical significance
\(p < 0.05\) often used as threshold for “significant” result.
\(p\) value:
Advertised promise about how likely “true correlation” is actually \(0\). An indicator of how likely correlation is going to lead us astray, due to random association.
\(p\) value:
Be wary of “\(p\)-hacking”/“snooping”, “data dredging”
Statistical Significance |
\(p\)-value | By Chance? | Why? | “Real”? |
---|---|---|---|---|
Low | High \((p > 0.05)\) | Likely | small \(N\) weak correlation |
Probably not |
High | Low \((p < 0.05)\) | Unlikely | large \(N\) strong correlation |
Possibly |
Correlation: \(0.08\), Magnitude: \(0.22\), \(p = 0.00001\)
Did Trump rallies cause hate crimes?
Correlation: \(0.08\), Magnitude: \(0.22\), \(p = 0.00001\)
Did Trump rallies cause hate crimes?
\(1.\) Correlation as “solution” to Fundamental Problem of Causal Inference
\(2.\) Correlation suffers from two problems: