**1. Correlation: Review**

- definition
- attributes
- problems

**2. Problems with Correlation**

- Random correlation
- Bias in correlation (
**confounding**)

March 18, 2021

- definition
- attributes
- problems

- Random correlation
- Bias in correlation (
**confounding**)

We solve the fundamental problem of causal inference by:

- comparing the values of the outcome (\(Y\)) across cases we can observe with different values of the cause (\(X\))
- assuming that cases with different values of \(X\) can stand in for the
**same case**in the counterfactual world where its value of \(X\) is different.

**Correlation** is the degree of association/relationship between the **observed** values of \(X\) (the independent variable) and \(Y\) (the dependent variable)

- There are formal mathematical definitions.
- We use the term loosely to describe observed relationship between \(X\) and \(Y\)

**All empirical evidence for causal claims** relies on **correlation** between the independent and dependent variables.

But, youâ€™ve all heard this:

POLL

Correlations have

**direction**:- positive: implies that as \(X\) increases, \(Y\) increases
- negative: \(X\) increases, \(Y\) decreases

**strength**(has nothing to do**size of effect**):**strong**: \(X\) and \(Y\) almost**always**move together (near \(1,-1\))**weak**: \(X\) and \(Y\) do not move together very much (near \(0\))

**slope/effect size**:- this is the how much \(Y\) changes with \(X\).
- The larger the effect of \(X\) on \(Y\), the steeper the slope

What do we need to assume to use correlation as evidence of causation?

**random association**: correlations between \(X\) and \(Y\) occur**by chance**and do not reflect causal relationship.**bias**(spurious correlation,**confounding**): \(X\) and \(Y\) are correlated but the correlation does not result from a**causal relationship**between those variables

How do we know a correlation is **systematic**?

- How do we know that it is not simply a pattern by random chance?
- Apparent patterns can be produced by pure randomness

If you look at enough possible sets of variables, you might find a strong correlation

- But it could have happened by chance!
- So a correlation might not be meaningful (e.g.Â Nick Cage)

(Arbitary Correlations)[http://www.tylervigen.com/spurious-correlations]

To see that random patterns can emerge, I use random number generators to

**randomly**pick \(5\) values of \(X\)**randomly**pick \(5\) values of \(Y\)

We can imagine these are the observed \(X\) and \(Y\) for \(5\) cases.

How easy is it to find a strong correlation?

Tries to get correlation \(> 0.9\): 1

Field of statistics investigates properties of **chance** events (stochastic processes):

- Probability theory tells us how likely events are to happen, given chance
- Can tell us how likely correlation of some value is to happen by chance

- Compute correlation of \(X\) and \(Y\)
- How
**strong**is the correlation>- Patterns that are
**stronger**are**less likely**to happen by chance

- Patterns that are
- How many
**cases**do we have?- Patterns with
**many cases**are**less likely**to happen by chance

- Patterns with
- Assign a probability that the correlation we see would have happened by chance

This procedure worksâ€¦

we know the chance processes that might affect this correlation

Tries to get correlation \(> 0.9\): 397

Tries to get correlation \(> 0.9\): 63963

Tries to get correlation \(> 0.45\): 76

**statistical significance**:

An indication of

how likelycorrelation we observe could have happened purely by chance.

higherdegree of statistical significance indicates correlation isunlikelyto have happened by chance

\(p\) **value**:

A numerical measure of

**statistical significance**. Puts a number on how likely observed correlation would have occurred by chance,**assuming**a we know the chance procedure and the truth is a \(0\) correlation.It is a probability, so is between \(0\) and \(1\).

**Lower**\(p\)-values indicate**greater**statistical significance

\(p < 0.05\) often used as threshold for “significant” result.

- but it is not a magic number
- Can observe \(p < 0.05\) by chance (\(\frac{1}{20}\))

\(p\) **value**:

Be wary of “\(p\)-hacking”

- \(p\) values become meaningless if we look at many associations, then only report the ones that are “significant”.

- low \(p\)-values occur by chance when we look at lots of associations