Seen several approaches to estimating causal effects
Estimation of these causal effects involves:
We need a tool that lets us plug in and compare averages.
In order to use least squares in causal inference, need to understand what the math of regression is doing, as it pertains to:
No causality
No parameters/estimands
Regression/Least Squares as Algorithm
Mechanics of Bivariate Regression
Why do we use the mean to summarize values?
Why are we always squaring differences?
Variance
\(\frac{1}{n}\sum\limits_{i = 1}^{n} (x_i - \bar{x})^2\)
Covariance
\(\frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)
Mean Squared Error
\(\frac{1}{n}\sum\limits_{i=1}^n(\hat{y_i} - y_i)^2\)
It is linked to distance
What is the distance between two points?
In \(2\) dimensional space: \((p_1,p_2)\), \((q_1,q_2)\)
\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]
In \(k\) dimensional space: \((p_1,p_2, \ldots, p_k)\), \((q_1,q_2, \ldots ,q_k)\)
\(d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_k - q_k)^2}\)
What is the distance between these two points?
What is the distance between these two points?
\(p = (3,0); q = (0,4)\)
\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]
What is the distance between these two points?
\(p = (3,0); q = (0,4)\)
\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]
\[d(p,q) = \sqrt{(3 - 0)^2 + (0 - 4)^2}\] \[d(p,q) = \sqrt{3^2 + (-4)^2} = \ ?\]
What is the distance between these two points?
What is the distance between two points?
In \(2\) dimensional space: \((p_1,p_2)\), \((q_1,q_2)\)
\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]
In \(k\) dimensional space: \((p_1,p_2, \ldots, p_k)\), \((q_1,q_2, \ldots ,q_k)\)
\(d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_k - q_k)^2}\)
One way of thinking about the mean is as a prediction.
We observe many values in \(Y\): \(y_1 \dots y_n\). We want to choose a single value, \(\hat{y}\), that is the best prediction for all the values in \(Y\).
If we say that the “best” prediction is one with with the shortest distance between the prediction \(\hat{y}\) and the actual values of \(y_1 \dots y_n\)…
then the algorithm that gives us this best prediction is the mean.
Imagine we have a variable \(Y\) that we observe as a sample of size \(n\). We can represent this variable as a vector in \(n\) dimensional space.
\[Y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\]
More generally, we imagine our observed data \(Y\) is a vector with one dimension for each data point (observation)
a vector is an array of numbers of length \(n\).
can be portrayed graphically as an arrow from the origin (the point \((0,0)\) or \((0,0,\dots,0)\)) to a point in \(n\) dimensional space
vectors can be added, vectors can be multiplied by a number to extend/shorten their length (each element multiplied by same number)
We want to pick one number (a scalar) \(\hat{y}\) to predict all of the values in our vector \(Y\).
This is equivalent to doing this:
\[Y = \begin{pmatrix}3 \\ 5 \end{pmatrix} \approx \begin{pmatrix}\hat{y} \\ \hat{y} \end{pmatrix} = \hat{y} \begin{pmatrix}1 \\ 1 \end{pmatrix}\]
Multiplying a \((1,1)\) vector by a constant.
Choose \(\hat{y}\) on the blue line at point that minimizes the distance to \(y\).
\(y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\) can be decomposed into two separate vectors: a vector containing our prediction (\(\hat{y}\)):
\(\begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix} = \hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\)
and another vector \(\mathbf{e}\), which is difference between the prediction vector and the vector of observations (the residual or prediction error):
\(\mathbf{e} = \begin{pmatrix}3 \\ 5 \end{pmatrix} - \begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix}\)
This means our goal is to minimize \(\mathbf{e}\), residuals.
How do we find the closest distance? The length of \(\mathbf{e}\) is calculated by taking:
\[len(\mathbf{e})= \sqrt{(3-\hat{y})^2 + (5 - \hat{y})^2}\]
When is the length of \(\mathbf{e}\) minimized?
Values \(Y\) are sample of size \(n\) are represented by a vector in \(n\) dimensional space. \(\hat{y}\) is a prediction of \(y\).
We choose a value \(\hat{y}\) in one dimensional sub-space (typically on a line through: \(\begin{pmatrix} 1_1 \ 1_2 \ldots 1_n \end{pmatrix}\))
Such that difference between \(\hat{y}\) and \(Y\)—\(\mathbf{e}\), residual—is shortest
What if we have more information about each case \(i\)? Let’s call this information \(X\), with \(x_1 \dots x_n\).
How can we use this to improve our predictions, \(\hat{y}\), of \(Y\)?
We could choose a value \(\hat{y}\) in two dimensional plane (on 2D rectangle that passes through: \(\begin{pmatrix} 1_1 \ 1_2 \ldots 1_n \end{pmatrix}\)) and \(\begin{pmatrix} x_1 \ x_2 \ldots x_n \end{pmatrix}\))
Such that difference between \(\hat{y}\) and \(y\)—\(\mathbf{e}\), residual—is shortest
If we are thinking about how knowing the value of one variable informs us about the value of another… might want to think about correlation.
How are these variables associated?
How are these variables associated?
Covariance
\[Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\]
\[Cov(X,Y) = \overline{xy} - \bar{x}\bar{y}\]
Variance
Variance is also the covariance of a variable with itself:
\[Var(X) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})^2\]
\[Var(X) = \overline{x^2} - \bar{x}^2\]
Covariance of military aged males and suffrage vote:
## [1] -61.77359
Covariance of enlistment rate and suffrage vote:
## [1] 0.002691231
Why is the \(Cov(MilAge,Suffrage)\) larger than \(Cov(Enlist,Suffrage)\)?
Why is the \(Cov(Width,Volume)\) larger than \(Cov(Width,Height)\)?
Scale of covariance reflects scale of the variables.
Can’t directly compare the two covariances
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Covariance
\(Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)
Pearson Correlation
\(r(X,Y) = \frac{Cov(X,Y)}{SD(X)SD(Y)}\)
Dividing by product of standard deviations scales the covariance
\(|Cov(X,Y)| <= \sqrt{Var(X)*Var(Y))}\)
Correlation coefficient must be between \(-1, 1\)
At \(-1\) or \(1\), all points are on a straight line
Negative value implies increase in \(X\) associated with decrease \(Y\).
If correlation is \(0\), the covariance must be?
If \(Var(X)=0\), then \(Cor(X,Y) = ?\)
Correlation of \((x,y)\) is same as correlation of \((y,x)\)
Values closer to -1 or 1 imply “stronger” association
Correlations cannot be understood using ratios.
Pearson correlation agnostic about outliers/nonlinearities
Correlation is not causation
Practice is two-fold: understand the concept and write functions in
R
#name your function: my_mean
#function(x); x is an argument, we can add other arguments
my_mean = function(x) {
n = length(x) #we'll plug in whatever value of x we give to the function
s = sum(x)
return(s/n) #what value(s) should the function return?
}
Get data_1
here: https://pastebin.com/LeSpqKKk
Without using cor()
or cov()
or
var()
or sd()
functions in R
hint: generate your OWN functions
data_1
The mean is useful…
… but often we want generate better predictions of \(Y\), using additional information \(X\).
The mean is useful…
… but often we want to know if the mean of something \(Y\) is different across different values of something else \(X\).
To put it another way: the mean of \(Y\) is the \(E(Y)\) (if we are talking about random variables). Sometimes we want to know \(E[Y | X]\)
We are interested in finding the conditional expectation function (Angrist and Pischke)
expectation: because it is about the mean: \(E[Y]\)
conditional: because it is conditional on values of \(X\) … \(E[Y |X]\)
function: because \(E[Y | X] = f(X)\), there is some relationship we can look at between values of \(X\) and \(E[Y]\).
\[E[Y | X = x]\]
Another way of thinking about what the conditional expectation function is:
\[E[Y | X = x] = \hat{y} | X = x\]
What is the predicted value \(\hat{y}\), where \(X = x\), such that \(\hat{y}\) has the smallest distance to \(Y\)?
These points emphasize the conditional and expectation part of the CEF.
The difficulty is: how do we find the function?
There are many ways to define and estimate the function in the conditional expectation function
or, by convention:
\[E[Y] = a + b\cdot X\]
or
\[\hat{y} = a + b\cdot X\]