Seen several approaches to estimating causal effects
all causal estimands we have seen are average effects: ACE, ATT, etc.
all estimands involves comparing average outcomes of \(Y\) across different values of \(D\) or \(Z\)
conditioning involves plugging in average outcome of \(Y\) at values of \(D,X\).
the conditional expectation function (Angrist and Pischke)
expectation: because it is about the mean: \(E[Y]\)
conditional: because it is conditional on values of \(X\) … \(E[Y |X]\)
function: because \(E[Y | X] = f(X)\), there is some mathematical mapping between values of \(X\) and \(E[Y]\).
\[E[Y | X = x] = f(x)\]
The difficulty is: how do we find the function?
Another way of thinking about what the conditional expectation function is:
\[E[Y | X = x] = \hat{y} | X = x\]
What is our predicted value \(\hat{y}\), where \(X = x\), such that our prediction of \(Y\), \(\hat{y}\) has the least error?
The difficulty is: how do we find the function?
In order to use least squares in causal inference, need to understand what the math of regression is doing, as it pertains to:
No causality
No parameters/estimands
Regression/Least Squares as Algorithm
Why do we use the mean to summarize values?
Why are we always squaring differences?
Variance
\(\frac{1}{n}\sum\limits_{i = 1}^{n} (x_i - \bar{x})^2\)
Covariance
\(\frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)
Mean Squared Error
\(\frac{1}{n}\sum\limits_{i=1}^n(\hat{y_i} - y_i)^2\)
It is linked to distance
What is the distance between two points, \(p\) and \(q\)?
In \(2\) dimensional space: \((p_1,p_2)\), \((q_1,q_2)\)
\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]
In \(k\) dimensional space: \((p_1,p_2, \ldots, p_k)\), \((q_1,q_2, \ldots ,q_k)\)
\(d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_k - q_k)^2}\)
What is the distance between these two points?
What is the distance between these two points?
\(p = (3,0); q = (0,4)\)
\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]
What is the distance between these two points?
\(p = (3,0); q = (0,4)\)
\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]
\[d(p,q) = \sqrt{(3 - 0)^2 + (0 - 4)^2}\] \[d(p,q) = \sqrt{3^2 + (-4)^2} = \ ?\]
What is the distance between these two points?
What is the distance between two points?
In \(2\) dimensional space: \((p_1,p_2)\), \((q_1,q_2)\)
\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]
In \(k\) dimensional space: \((p_1,p_2, \ldots, p_k)\), \((q_1,q_2, \ldots ,q_k)\)
\(d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_k - q_k)^2}\)
One way of thinking about the mean is as a prediction.
We observe many values in \(Y\): \(y_1 \dots y_n\). We want to choose a single value, \(\hat{y}\), that is the best prediction for all the values in \(Y\).
If we say that the “best” prediction is one with with the shortest distance between the prediction \(\hat{y}\) and the actual values of \(y_1 \dots y_n\)…
then the algorithm that gives us this best prediction is the mean.
“Geometric Interpretation” is helpful:
To understand this “geometric” interpretation, we need linear algebra.
We assume you have watched this series, Chapters 1-9.
Ask for clarification where required.
\[v = \begin{pmatrix}3 \\ 5 \end{pmatrix} = \begin{pmatrix}x \\ y \end{pmatrix}\]
Vectors can be added:
vectors of the same dimensions can be added element-by-element.
For example:
\[\begin{pmatrix}1 \\ 1 \end{pmatrix} + \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix}-1 \\ 4 \end{pmatrix}\] Equivalent to putting the second vector’s tail at the tip of the first vector, following it the end.
Vectors can be multiplied by a number, element-by-element.
\[2.5 \cdot \begin{pmatrix}1 \\ 1 \end{pmatrix} = \begin{pmatrix} 2.5 \\ 2.5 \end{pmatrix}\] Equivalent to stretching out this vector by factor of \(2.5\)
\[0.5 \cdot \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix} -1 \\ 1.5 \end{pmatrix}\] Equivalent to squishing this vector by factor of \(0.5\)
\(a = 2.5 \cdot \begin{pmatrix}1 \\ 1 \end{pmatrix} = \begin{pmatrix} 2.5 \\ 2.5 \end{pmatrix}; \ b = 0.5 \cdot \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix} -1 \\ 1.5 \end{pmatrix}\)
The span of a vector is the set of points which it could reach by scaling it up/down by some factor.
For the vector \(\begin{pmatrix}1 \\ 1 \end{pmatrix}\), the span is the straight line stretching from \(\begin{pmatrix}-\infty \\ -\infty \end{pmatrix}\) to \(\begin{pmatrix}\infty \\ \infty \end{pmatrix}\)
The span of any vector always goes through the origin: Why?
We can think of any vector as being decomposed into movement along each of the basis vectors - unit-length (length \(1\)) vectors along each of the dimensions of the space (e.g. \(x,y,z, etc\)).
For instance, \(\begin{pmatrix}3 \\ 4 \\ 5 \end{pmatrix}\) can be achieved by adding up:
Note this kind of addition is only possible because \(x,y,z\) are perpendicular or orthogonal to each other
We can change the space a vector lives in by giving it new basis vectors.
Matrices are 2-dimensional arrays of numbers \(m \times n\), essentially multiple \(n \times 1\) vectors stuck side by side.
If matrices must have inner dimensions that match, can be multiplied
\[\begin{pmatrix} 1 & 1 \\ -1 & 2 \end{pmatrix} \times \begin{pmatrix} 1 & -1 \\ 1 & 2 \end{pmatrix} =\]
\[\begin{pmatrix} (1 \cdot 1)+(1\cdot1) & (1\cdot-1) + (1 \cdot 2) \\ (-1\cdot1) + (2\cdot1) & (-1\cdot-1) + (2\cdot2) \end{pmatrix} = \]
\[ = \begin{pmatrix} 2 & 1 \\ 1 & 5 \end{pmatrix}\]
transposition: rotate a matrix/vector so that columns turn into rows and vice versa:
\[u = \begin{pmatrix} 1 \\ 5 \\ -2 \end{pmatrix}\]
\[u^T = u' = \begin{pmatrix} 1 & 5 & -2 \end{pmatrix}\]
Matrix Example (first row to first column, second row to second column, etc.)
We can think about how much one vector \(w\) can be captured by moving along another vector \(v\):
If \(u\) and \(v\) are \(n \times 1\) vectors, the inner product or dot product of \(u \bullet v = u' \times v\); \(u'\) is transpose of \(u\).
\[u = \begin{pmatrix} 1 \\ -2 \end{pmatrix}; \ v =\begin{pmatrix} 4 \\ 2 \end{pmatrix} \]
\[u \cdot v = \begin{pmatrix} 1 & -2 \end{pmatrix} \begin{pmatrix} 4 \\ 2 \end{pmatrix} = 0\]
The dot product is equal to the length of the projection of \(u\) on \(v\) multiplied by the length of \(v\) (length is distance from origin to vector tip)
In addition to multiplying matrices (applying a linear transformation), we can “undo” this multiplication by multiplying by the inverse of a matrix: this is like division.
\[A \times A^{-1} = A^{-1} \times A = I_{3 \times 3} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}\]
This is an identity matrix with 1s on diagonal, 0s everywhere else. \(A_{p \times p} \times I_{p \times p} = A\)
If a matrix multiplied by its inverse gives Identity matrix…
How does this relate to orthogonality?
We keep talking about orthogonality because it is key to understanding what least squares does
We want to pick one number (a scalar) \(\hat{y}\) to predict all of the values in our vector \(y\).
This is equivalent to doing this:
\[y = \begin{pmatrix}3 \\ 5 \end{pmatrix} \approx \hat{y} \begin{pmatrix}1 \\ 1 \end{pmatrix}\]
Choose \(\hat{y}\) on the blue line at point that minimizes the distance to \(y\).
\(y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\)
can be decomposed into two separate vectors: a vector containing our prediction (\(\hat{y}\)):
\(\begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix} = \hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\)
and another vector \(\mathbf{e}\), which is difference between the prediction vector and the vector of observations:
\(\mathbf{e} = \begin{pmatrix}3 \\ 5 \end{pmatrix} - \begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix}\)
This means our goal is to minimize \(\mathbf{e}\).
How do we find the closest distance? The length of \(\mathbf{e}\) is calculated by taking:
\[len(\mathbf{e})= \sqrt{(3-\hat{y})^2 + (5 - \hat{y})^2}\]
When is the length of \(\mathbf{e}\) minimized?
We know that two vector are orthogonal (\(\perp\)) when their dot product is \(0\), so we can create the following equality and solve for \(\hat{y}\).
\(\mathbf{e} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)
\((\begin{pmatrix}3 & 5 \end{pmatrix} - \begin{pmatrix} \hat{y} & \hat{y} \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)
\((\begin{pmatrix}3 & 5 \end{pmatrix} - \hat{y} \begin{pmatrix} 1 & 1 \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)
\((\begin{pmatrix} 3 & 5 \end{pmatrix} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix}) - (\hat{y} \begin{pmatrix} 1 & 1 \end{pmatrix} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix}) = 0\)
\((8) - (\hat{y} 2) = 0\)
\(8 = \hat{y} 2\)
\(4 = \hat{y}\)
\(\mathbf{e} \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)
\((\begin{pmatrix} y_1 & \ldots & y_n \end{pmatrix} - \begin{pmatrix} \hat{y} & \ldots & \hat{y} \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)
\((\begin{pmatrix} y_1 & \ldots & y_n \end{pmatrix} - \hat{y}\begin{pmatrix} 1 & \ldots & 1 \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)
\((\sum\limits_{i=1}^{n} y_i\cdot1) - \hat{y} \sum\limits_{i=1}^{n} 1 = 0\)
\(\sum\limits_{i=1}^{n} y_i = \hat{y} n\)
\(\frac{1}{n}\sum\limits_{i=1}^{n} y_i = \hat{y}\)
What is the mean of our residuals \(\mathbf{e}\)
We say the mean is a way of choosing some \(\hat{y}\) as a prediction of values of \(y\), where \(\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}\) are prediction errors or residuals.
we choose \(\hat{y}\) such that this vector of points has closest possible distance to vector \(y\) (length of \(\mathbf{e}\) minimized by being orthogonal to \(\hat{y}\cdot\begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}\))
gives us the name “least squares”: “squared errors” “squared residuals” are minimized, because length of vector of prediction errors (a distance) is minimized
Bivariate regression chooses \(\hat{y}\) as closest possible prediction of \(y\) with the form
\[\mathbf{\hat{y}} = b_0 + b_1\cdot \mathbf{x}\]
An intercept \(b_0\) and coefficient \(b_1\) multiplied by \(\mathbf{x}\)
The red line above is the prediction using least squares.
It closely approximates the conditional mean of son’s height (\(y\)) across values of father’s height (\(x\)).
How do we obtain this line mathematically? (proof/derivation here)
The red line above is the prediction using least squares.
It closely approximates the conditional mean of son’s height (\(y\)) across values of father’s height (\(x\)).
How do we obtain this line mathematically? (proof/derivation here)
If we choose slope \(b_1\) and intercept \(b_0\) such that they minimize \(\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\), we can obtain these formulae:
\[b_1 = \frac{Cov(X,Y)}{Var(X)}\]
Covariance (does this look familiar?)
\(Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)
\(Cov(X,Y) = \overline{xy} - \bar{x}\bar{y}\)
Pearson Correlation
\(r(X,Y) = \frac{Cov(X,Y)}{SD(X)SD(Y)}\)
\(b_1 = \frac{Cov(X,Y)}{SD(X)SD(Y)}\frac{SD(Y)}{SD(X)} = \frac{Cov(X,Y)}{Var(X)}\)
\[b_0 = \overline{y} - \overline{x}\cdot b_1\]
Shows us that at \(\bar{x}\), the line goes through \(\bar{y}\). The regression line (of predicted values) goes through the point \((\bar{x}, \bar{y})\) or the point of averages.
But to go beyond bivariate regression, need to derive more general solution:
Rather than project the \(n \times 1\)-dimensional vector \(\mathbf{y}\) onto one dimension (as we did with the mean), we project it into \(p\) (number of coefficients/parameters) dimensional subspace. Hard to visualize, but we still end up minimizing the distance between our \(n\) dimensional vector \(\mathbf{\hat{y}}\) and the vector \(\mathbf{y}\).
Given \(\mathbf{y}\), an \(n \times 1\) dimensional vector of all values \(y\) for \(n\) observations
and \(\mathbf{X}\), an \(n \times 2\) dimensional matrix (\(2\) columns, \(n\) observations). We call this the design matrix. A vector of \(\mathbf{1}\) (for an intercept), a vector \(x\) for our other variable.
\(\mathbf{\hat{y}}\) is an \(n \times 1\) dimensional vector of predicted values (for the mean of Y conditional on X) computed by \(\mathbf{X\beta}\). \(\mathbf{\beta}\) is a vector \(p\times 1\) of (coefficients) that we multiply by \(\mathbf{X}\).
We’ll assume there are only two coefficients in \(\mathbf{\beta}\): \((b_0,b_1)\) so that \(\hat{y_i} = b_0 + b_1 \cdot x_i\), so \(p = 2\)
\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}; \beta = \begin{pmatrix} b_0 \\ b_1 \end{pmatrix}\]
\[\widehat{y_i} = b_0 + b_1 \cdot x_i\]
\[\widehat{y}_{n \times 1} = \mathbf{X}_{n \times p}\beta_{p \times 1}\]
\[\begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} = \begin{pmatrix} 1 \cdot b_0 + x_1\cdot b_1 \\ \vdots \\ 1\cdot b_0 + x_n \cdot b_1 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \mathbf{\widehat{y}}\]
\(\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}\) gives us the residuals (prediction errors).
We want to choose \(\mathbf{\beta}\) or \(b_0,b_1\) such that the distance between \(\mathbf{y}\) and \(\mathbf{\hat{y}}\) is minimized.
Like before, the distance is minimized when the vector of residuals \(\mathbf{y} - \mathbf{\hat{y}} = \mathbf{e}\) is orthogonal or \(\perp\) to \(\mathbf{X}\)
\(\mathbf{X}'_{p\times n}\mathbf{e}_{n\times1} = \begin{pmatrix} 0_1 \\ \vdots \\ 0_p \end{pmatrix} = \mathbf{0}_{p \times 1}\)
\(\mathbf{X}'(\mathbf{Y} - \mathbf{\hat{Y}}) = \mathbf{0}_{p \times 1}\)
\(\mathbf{X}'(\mathbf{Y} - \mathbf{X\beta}) = \mathbf{0}_{p \times 1}\)
\(\mathbf{X}'\mathbf{Y} - \mathbf{X}'\mathbf{X{\beta}} = \mathbf{0}_{p \times 1}\)
\(\mathbf{X}'\mathbf{Y} = \mathbf{X}'\mathbf{X{\beta}}\)
\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)
\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\]
This is the matrix formula for least squares regression.
If \(X\) is a column vector of \(1\)s, \(\beta\) is just the mean of \(Y\). (We just did this)
If \(X\) is a column of \(1\)s and a column of \(x\)s, it is bivariate regression. (algebraic proof showing equivalence here)
We can now add \(p > 2\): more variables
The mathematical procedures we use in regression ensure that:
\(1\). the mean of the residuals is always zero (if we include an intercept). Because we included an intercept (\(b_0\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.
the mean of the residuals is always zero.
We choose \(\begin{pmatrix}b_0 \\ b_1 \end{pmatrix}\) such that \(e\) is orthogonal to \(\mathbf{X}\). One column of \(\mathbf{X}\) is all \(1\)s, to get the intercept (recall how we used vectors to get the mean). So \(e\) is orthogonal to \(\begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}\).
\[\mathbf{1}'e = 0\]
And if this is true, the \(\sum e_i = 0\) so \(\frac{1}{n}\sum e_i = 0\).
The mathematical procedures we use in regression ensure that:
\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.
Recall that \(Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e}\)
We chose \(\beta\) (\(a,b\)) such that \(X'e = 0\) so they would be orthogonal.
\(X'e = 0 \to \sum x_ie_i = 0 \to \overline{xe}=0\);
And, from above, we know that \(\overline{e}=0\);
so \(Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e} = 0 - \overline{x}0 = 0\).
\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.
This also means that residuals \(e\) are always perfectly uncorrelated (Pearson correlation) with all the columns in our matrix \(\mathbf{X}\): all the variables we include in the regression model.
Previously we predicted \(Y\) as a linear function of \(x\):
\[\hat{y_i} = b_0 + b_1 \cdot x_i\]
Now, we can imagine predicting \(y\) as a linear function of many variables:
\[\hat{y_i} = b_0 + b_1 x_1 + b_2 x_2 + \ldots + b_k x_k\]
What is “projecting onto \(p\) dimensions”
When we project into two dimensions, these dimensions are precisely like the \(x\) and \(y\) axes on a graph: perpendicular/orthogonal to each other.
In multivariate regression, because we are going to project \(y\) onto \(p\) orthogonal dimensions in \(\mathbf{X}\). (\((\mathbf{X}'\mathbf{X})^{-1}\) transforms to orthogonal basis)
Examples: Linear Dependence?
1 | 1 | 0 | 0 |
1 | 0 | 1 | 0 |
1 | 0 | 0 | 1 |
Examples: Linear Dependence?
1 | 2 | 0 | 0 |
1 | 0 | 2 | 0 |
1 | 0 | 0 | 2 |
Examples: Linear Dependence?
1 | 0.50 | 0.25 | 0.25 |
1 | 0.25 | 0.50 | 0.25 |
1 | 0.25 | 0.25 | 0.50 |
When we include more than one variable in the equation, we cannot calculate slopes using simple algebraic expressions like \(\frac{Cov(X,Y)}{Var(X)}\).
We calculate least squares using same matrix equation (\((X'X)^{-1}X'Y\)) as in bivariate regression, but what is the math doing in the multivariate case?
When fitting the equation:
\(\hat{y_i} = b_0 + b_1x_i + b_2z_i\)
Where \(x^* = x - \hat{x}\) from the regression: \(\hat{x} = c_0 + c_1 z\).
Where \(z^* = z - \hat{z}\) from the regression: \(\hat{z} = d_0 + d_1 x\)
Does anything look familiar here?
More generally:
\[\hat{y} = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]
\(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)
where \(X_k^* = X_k - \hat{X_k}\) obtained from the regression:
\(X_{k} = c_0 + c_1 x_{1} + \ldots + c_{j} X_{j}\)
\(X_k^*\) is the residual from regressing \(X_k\) on all other \(\mathbf{X_{j \neq k}}\)
How do we make sense of \(X_k^*\) as residual \(X_k\) after regressing on all other \(\mathbf{X_{j \neq k}}\)?
How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\) (if \(X_k^*\) as residual \(X_k\) after regressing on all other \(\mathbf{X_{j \neq k}}\)?)
How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)
There are additional implications of defining the slope \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\):
Now we can see why columns of \(X\) must be linearly independent:
Bivariate regression not guaranteed to uncover true CEF if it is not linear:
With the mean, \(\hat{y}\) is a projection of \(n\) dimensional \(y\) onto a line.
With bivariate regression, \(\hat{y}\) is a projection of \(y\) onto a plane - which is 2d:
We want to choose \(a, b\) such that \(\mathbf{\hat{y}} = a + b\cdot \mathbf{x}\) has minimum distance to \(\mathbf{y}\)
Another way of thinking of this is in terms of residuals, or the difference between true and predicted values using the equation of the line. (Prediction errors)
\(\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}\)
(Proof itself is not something you need to memorize)
We need to solve this equation:
\[\min_{a,b} \sum\limits_i^n (y_i - a - b x_i)^2\] Choose \(a\) and \(b\) to minimize this value, given \(x_i\) and \(y_i\)
We can do this with calculus: solve for when first derivative is \(0\) (since this means distance will be at its minimum)
First, we take derivative with respect to \(a\): yields:
\(-2 \left[ \sum\limits_i^n (y_i - a - b x_i) \right] = 0\)
\(\sum\limits_i^n y_i - \sum\limits_i^n a - \sum\limits_i^n b x_i = 0\)
\(-\sum\limits_i^n a = -\sum\limits_i^n y_i + \sum\limits_i^n b x_i\)
\(\sum\limits_i^n a = \sum\limits_i^n y_i - b x_i\)
\(\sum\limits_i^n a = \sum\limits_i^n y_i - b x_i\)
Dividing both sides by \(n\), we get:
\(a = \bar{y} - b\bar{x}\)
Where \(\bar{y}\) is mean of \(y\) and \(\bar{x}\) is mean of \(x\).
Implication: regression line goes through the point of averages \(\bar{y} = a + b \bar{x}\)
Next, we take derivative with respect to \(b\):
\(-2 \left[ \sum\limits_i^n (y_i - a - b x_i) x_i\right] = 0\)
\(\sum\limits_i^n (y_i - (\bar{y} - b\bar{x}) - b x_i) x_i = 0\)
\(\sum\limits_i^n y_ix_i - \bar{y}x_i + b\bar{x}x_i - b x_ix_i= 0\)
\(\sum\limits_i^n (y_i - \bar{y})x_i = b\sum\limits_i^n (x_i - \bar{x})x_i\)
Dividing both sides by \(n\) gives us:
\(\frac{1}{n}\sum\limits_i^n y_ix_i - \bar{y}x_i = b\frac{1}{n}\sum\limits_i^n x_i^2 - \bar{x}x_i\)
\(\overline{yx} - \bar{y}\bar{x} = b \overline{xx} - \bar{x}\bar{x}\)
\(Cov(y,x) = b \cdot Var(x)\)
\(\frac{Cov(y,x)}{Var(x)} = b\)
But we also want to know more intuitively what these matrix operations are doing! It isn’t magic.
\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]
\[\mathbf{X}'\mathbf{X} = \begin{pmatrix} 1 & \ldots & 1 \\ x_1 & \ldots & x_n \end{pmatrix} \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}\]
\[= \begin{pmatrix} n & \sum_i x_i \\ \sum_i x_i & \sum_i x_i^2 \end{pmatrix} = n \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{x^2} \end{pmatrix}\]
\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]
\[\mathbf{X}'\mathbf{Y} = \begin{pmatrix} 1 & \ldots & 1 \\ x_1 & \ldots & x_n \end{pmatrix} \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\] \[= \begin{pmatrix} \sum_i y_i \\ \sum_i x_i y_i \end{pmatrix} = n \begin{pmatrix} \overline{y} \\ \overline{xy} \end{pmatrix}\]
How do we get \(^{-1}\)? This is inverting a matrix.
\[A \times A^{-1} = A^{-1} \times A = I_{p \times p} = \begin{pmatrix} 1 & 0 & \ldots & 0 \\ 0 & \ddots & \ldots & 0 \\ 0 & \ldots & \ddots & 0 \\ 0 & \ldots & 0 & 1 \end{pmatrix}\]
This is an identity matrix with 1s on diagonal, 0s everywhere else.
We need to get the determinant
For sake of ease, will show for a scalar and for a \(2 \times 2\) matrix:
\[det(a) = a\]
\[det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - cb\]
Then we need to get the adjoint. It is the transpose of the matrix of cofactors (don’t ask me why):
\[adj(a) = 1\]
\[adj\begin{pmatrix} a & b \\ c & d \end{pmatrix} = \begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]
The inverse of \(A\) is \(adj(A)/det(A)\)
\[A^{-1} = \frac{1}{ad - cb}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]
\[A^{-1} = \frac{1}{ad - cb}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]
\[(\mathbf{X}'\mathbf{X}) = n \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{x^2} \end{pmatrix}\]
\[(\mathbf{X}'\mathbf{X})^{-1} = \frac{n}{n^2(\overline{x^2} - \overline{x}^2)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix}\]
\[(\mathbf{X}'\mathbf{X})^{-1} = \frac{1}{n \cdot Var(x)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix}\]
We can put it together to get: \((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)
\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \frac{1}{n \cdot Var(x)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix} \frac{n}{1} \begin{pmatrix} \overline{y} \\ \overline{xy} \end{pmatrix}\]
\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x}\ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]
The slope:
\[\beta = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x} \ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]
\[b = \frac{\overline{xy} - \overline{x} \ \overline{y}}{Var(x)} = \frac{Cov(x,y)}{Var(x)}\]
\[b = \frac{Cov(x,y)}{Var(x)} = r \frac{SD_y}{SD_x}\]
The slope:
\[\beta = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x} \ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]
\[a = \frac{\overline{x^2}\overline{y} -\overline{x} \ \overline{xy}}{Var(x)} = \frac{(Var(x) + \overline{x}^2)\overline{y} - \overline{x}(Cov(x,y) + \overline{x}\overline{y})}{Var(x)}\]
\[= \frac{Var(x)\overline{y} + \overline{x}^2\overline{y} - \overline{x}^2\overline{y} - \overline{x}Cov(x,y)}{Var(x)}\]
\[= \overline{y} - \overline{x}\frac{Cov(x,y)}{Var(x)}\]
\[a = \overline{y} - \overline{x}\cdot b\]
The Intercept:
\[a = \overline{y} - \overline{x}\cdot b\]
Shows us that at \(\bar{x}\), the line goes through \(\bar{y}\). The regression line (of predicted values) goes through the point \((\bar{x}, \bar{y})\) or the point of averages.
If we are thinking about how knowing the value of one variable informs us about the value of another… might want to think about correlation.
How are these variables associated?
How are these variables associated?
Covariance
\[Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\]
\[Cov(X,Y) = \overline{xy} - \bar{x}\bar{y}\]
Variance
Variance is also the covariance of a variable with itself:
\[Var(X) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})^2\]
\[Var(X) = \overline{x^2} - \bar{x}^2\]
Covariance of military aged males and suffrage vote:
## [1] -61.77359
Covariance of enlistment rate and suffrage vote:
## [1] 0.002691231
Why is the \(Cov(MilAge,Suffrage)\) larger than \(Cov(Enlist,Suffrage)\)?
Why is the \(Cov(Width,Volume)\) larger than \(Cov(Width,Height)\)?
Scale of covariance reflects scale of the variables.
Can’t directly compare the two covariances
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Covariance
\(Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)
Pearson Correlation
\(r(X,Y) = \frac{Cov(X,Y)}{SD(X)SD(Y)}\)
Dividing by product of standard deviations scales the covariance
\(|Cov(X,Y)| <= \sqrt{Var(X)*Var(Y))}\)
Correlation coefficient must be between \(-1, 1\)
At \(-1\) or \(1\), all points are on a straight line
Negative value implies increase in \(X\) associated with decrease \(Y\).
If correlation is \(0\), the covariance must be?
If \(Var(X)=0\), then \(Cor(X,Y) = ?\)
Correlation of \((x,y)\) is same as correlation of \((y,x)\)
Values closer to -1 or 1 imply “stronger” association
Correlations cannot be understood using ratios.
Pearson correlation agnostic about outliers/nonlinearities
Correlation is not causation
Practice is two-fold: understand the concept and write functions in
R
#name your function: my_mean
#function(x); x is an argument, we can add other arguments
my_mean = function(x) {
n = length(x) #we'll plug in whatever value of x we give to the function
s = sum(x)
return(s/n) #what value(s) should the function return?
}
Get data_1
here: https://pastebin.com/LeSpqKKk
Without using cor()
or cov()
or
var()
or sd()
functions in R
hint: generate your OWN functions
data_1