POLI 572B

Michael Weaver

February 14, 2025

Introduction

Seen several approaches to estimating causal effects

all causal estimands we have seen are average effects: ACE, ATT, etc.
all estimands involves comparing average outcomes of \(Y\) across different values of \(D\) or \(Z\)
conditioning involves plugging in average outcome of \(Y\) at values of \(D,X\).

We need a tool that lets us plug in and compare averages of \(Y\) across values of \(Z,D,X\).

Conditional Expectation Function

the conditional expectation function (Angrist and Pischke)

expectation: because it is about the mean: \(E[Y]\)

conditional: because it is conditional on values of \(X\) … \(E[Y |X]\)

function: because \(E[Y | X] = f(X)\), there is some mathematical mapping between values of \(X\) and \(E[Y]\).

\[E[Y | X = x] = f(x)\]

Conditional Expectation Function

The difficulty is: how do we find the function?

a function takes some value of \(X\) and uniquely maps it to some value of \(E[Y]\)
we need to “learn” this function from the data
- how do we learn without “overfitting” the data?
depending on how much data we have, we have to make some choice of how to interpolate/extrapolate when “learning” this function

Conditional Expectation Function

Another way of thinking about what the conditional expectation function is:

\[E[Y | X = x] = \hat{y} | X = x\]

What is our predicted value \(\hat{y}\), where \(X = x\), such that our prediction of \(Y\), \(\hat{y}\) has the least error?

With the CEF, we choose a particular definition of what it means to have the “least error”: minimum (Euclidean) distance

Conditional Expectation Function

The difficulty is: how do we find the function?

a function takes some value of \(X = x\) and uniquely maps it to some value of \(E[Y]\)
we need to “learn” this function from the data
depending on how much data we have, we have to make some choice of how to interpolate/extrapolate when “learning” this function

This naturally leads us to the “imputation” approach to conditioning

Conditional Expectation Function

one easy choice is to approximate the CEF as linear.
That is to say \(E[Y]\) is linear in \(X\).
The function takes the form of an equation of a line.
This leads us to least squares

Why least squares?

We need averages \(\to\) LS is a generalization of the mean
A “simple”, yet flexible tool to plug in these values
Ubiquitous, often “good enough”

As we move from design-based toward model-based inference, we need to be in control of the tool we use.

Why least squares?

In order to use least squares in causal inference, need to understand what the math of regression is doing, as it pertains to:

given our need for average, how is regression related to the mean?
given our need to “impute”: how does regression “plug-in” values?
how does regression “control” for variables? (conditional independence)
how does assumption of “common support” translate to regression?
how does math of regression relate back to the causal estimands of interest? (how are they similar/different?)

**Why least squares math?**

No causality

no effort to prove causality; no assumptions about causal model.

No parameters/estimands

no statistical model; no random variables; no potential outcomes

Regression/Least Squares as Algorithm

algorithm with mathematical properties
without further assumption, plug in data, generate predictions
limited interpretation

The Mean: Revisited

Why do we use the mean to summarize values?

For random variables, it is “best guess” of what value we will draw
It is a kind of prediction
In what sense is it the “best”?

Squared Deviations

Why are we always squaring differences?

Variance

\(\frac{1}{n}\sum\limits_{i = 1}^{n} (x_i - \bar{x})^2\)

Covariance

\(\frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)

Mean Squared Error

\(\frac{1}{n}\sum\limits_{i=1}^n(\hat{y_i} - y_i)^2\)

Squared Deviations

It is linked to distance

What is the distance between two points, \(p\) and \(q\)?

In \(2\) dimensional space: \((p_1,p_2)\), \((q_1,q_2)\)

\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]

In \(k\) dimensional space: \((p_1,p_2, \ldots, p_k)\), \((q_1,q_2, \ldots ,q_k)\)

\(d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_k - q_k)^2}\)

What is the distance between these two points?

\(p = (3,0); q = (0,4)\)

\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]

What is the distance between these two points?

\(p = (3,0); q = (0,4)\)

\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]

\[d(p,q) = \sqrt{(3 - 0)^2 + (0 - 4)^2}\] \[d(p,q) = \sqrt{3^2 + (-4)^2} = \ ?\]

What is the distance between these two points?

Remember Pythagoras?

What is the distance between two points?

In \(2\) dimensional space: \((p_1,p_2)\), \((q_1,q_2)\)

\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]

In \(k\) dimensional space: \((p_1,p_2, \ldots, p_k)\), \((q_1,q_2, \ldots ,q_k)\)

\(d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_k - q_k)^2}\)

The Mean

One way of thinking about the mean is as a prediction.

We observe many values in \(Y\): \(y_1 \dots y_n\). We want to choose a single value, \(\hat{y}\), that is the best prediction for all the values in \(Y\).

If we say that the “best” prediction is one with with the shortest distance between the prediction \(\hat{y}\) and the actual values of \(y_1 \dots y_n\)…

then the algorithm that gives us this best prediction is the mean.

The Mean

“Geometric Interpretation” is helpful:

since distances are sums of squared differences, this is why mean minimizes the variance of prediction errors (\(y_i - \hat{y}\)).
key to understanding how least squares is related to the mean
key to understanding how least squares works

To understand this “geometric” interpretation, we need linear algebra.

Linear Algebra

Quick Review

We assume you have watched this series, Chapters 1-9.

Ask for clarification where required.

What is a vector?

a vector is an \(n \times 1\) or \(1 \times n\) array of numbers.
these values together represent a point that lives in \(n\) dimensional space
each number corresponds to a dimension in that space:
- e.g. can think of a vector as going \(x\) units along X axis and \(y\) units along Y axis and drawing arrow from origin \(\begin{pmatrix}0 \\ 0\end{pmatrix}\) to that point

\[v = \begin{pmatrix}3 \\ 5 \end{pmatrix} = \begin{pmatrix}x \\ y \end{pmatrix}\]

What is a vector?

Vectors can be added:

vectors of the same dimensions can be added element-by-element.

For example:

\[\begin{pmatrix}1 \\ 1 \end{pmatrix} + \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix}-1 \\ 4 \end{pmatrix}\] Equivalent to putting the second vector’s tail at the tip of the first vector, following it the end.

Vectors Can be Added

Vectors can be Scaled

Vectors can be multiplied by a number, element-by-element.

\[2.5 \cdot \begin{pmatrix}1 \\ 1 \end{pmatrix} = \begin{pmatrix} 2.5 \\ 2.5 \end{pmatrix}\] Equivalent to stretching out this vector by factor of \(2.5\)

\[0.5 \cdot \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix} -1 \\ 1.5 \end{pmatrix}\] Equivalent to squishing this vector by factor of \(0.5\)

\(a = 2.5 \cdot \begin{pmatrix}1 \\ 1 \end{pmatrix} = \begin{pmatrix} 2.5 \\ 2.5 \end{pmatrix}; \ b = 0.5 \cdot \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix} -1 \\ 1.5 \end{pmatrix}\)

Vectors have a span

The span of a vector is the set of points which it could reach by scaling it up/down by some factor.

For the vector \(\begin{pmatrix}1 \\ 1 \end{pmatrix}\), the span is the straight line stretching from \(\begin{pmatrix}-\infty \\ -\infty \end{pmatrix}\) to \(\begin{pmatrix}\infty \\ \infty \end{pmatrix}\)

The span of any vector always goes through the origin: Why?

Basis Vectors

We can think of any vector as being decomposed into movement along each of the basis vectors - unit-length (length \(1\)) vectors along each of the dimensions of the space (e.g. \(x,y,z, etc\)).

Basis Vectors

For instance, \(\begin{pmatrix}3 \\ 4 \\ 5 \end{pmatrix}\) can be achieved by adding up:

\(3 \times \begin{pmatrix}1 \\ 0 \\ 0 \end{pmatrix}\) (basis vector in \(x\) axis)
\(4 \times \begin{pmatrix}0 \\ 1 \\ 0 \end{pmatrix}\) (basis vector in \(y\) axis)
\(5 \times \begin{pmatrix}0 \\ 0 \\ 1 \end{pmatrix}\) (basis vector in \(z\) axis)

Note this kind of addition is only possible because \(x,y,z\) are perpendicular or orthogonal to each other

We can change the space a vector lives in by giving it new basis vectors.

Matrices

Matrices are 2-dimensional arrays of numbers \(m \times n\), essentially multiple \(n \times 1\) vectors stuck side by side.

better understood a linear transformation: set of instructions that transforms a vector by setting new locations for the basis vectors
each column in the matrix indicates the new position of the basis vector (in terms of the current coordinate space)
if matrix is not square you transform between spaces with different dimensions

Matrices can be Multiplied

If matrices must have inner dimensions that match, can be multiplied

\(m \times n\) matrix \(A\) can be multiplied by matrix \(B\) if \(B\) is \(n \times p\).
Matrix \(AB\) is \(m \times p\)
Rows of \(A\) multiplied with columns of \(B\) and then summed
Multiplication not commutative: cannot multiply \(BA\)
Multiplication not commutative: \(B'A' = (AB)' \neq AB\)

Matrices can be Multiplied

\[\begin{pmatrix} 1 & 1 \\ -1 & 2 \end{pmatrix} \times \begin{pmatrix} 1 & -1 \\ 1 & 2 \end{pmatrix} =\]

\[\begin{pmatrix} (1 \cdot 1)+(1\cdot1) & (1\cdot-1) + (1 \cdot 2) \\ (-1\cdot1) + (2\cdot1) & (-1\cdot-1) + (2\cdot2) \end{pmatrix} = \]

\[ = \begin{pmatrix} 2 & 1 \\ 1 & 5 \end{pmatrix}\]

Vectors and Matrices can be transposed:

transposition: rotate a matrix/vector so that columns turn into rows and vice versa:

\[u = \begin{pmatrix} 1 \\ 5 \\ -2 \end{pmatrix}\]

\[u^T = u' = \begin{pmatrix} 1 & 5 & -2 \end{pmatrix}\]

Matrix Example (first row to first column, second row to second column, etc.)

Orthogonal Projection

We can think about how much one vector \(w\) can be captured by moving along another vector \(v\):

If we imagine the sun shining directly down onto \(v\) (perpendicular to \(v\)), the shadow cast by \(w\) on \(v\) is the orthogonal projection of \(w\) on \(v\). (does this seem familiar??)

Dot Products

If \(u\) and \(v\) are \(n \times 1\) vectors, the inner product or dot product of \(u \bullet v = u' \times v\); \(u'\) is transpose of \(u\).

\[u = \begin{pmatrix} 1 \\ -2 \end{pmatrix}; \ v =\begin{pmatrix} 4 \\ 2 \end{pmatrix} \]

\[u \cdot v = \begin{pmatrix} 1 & -2 \end{pmatrix} \begin{pmatrix} 4 \\ 2 \end{pmatrix} = 0\]

Dot Products

The dot product is equal to the length of the projection of \(u\) on \(v\) multiplied by the length of \(v\) (length is distance from origin to vector tip)

if the dot product is equal to \(0\), then \(u\) and \(v\) are orthogonal or perpendicular (no projection, no “shadow”)
have we seen anything that looks like this dot product before?

matrix multiplication, covariance!

Matrix Inversion

In addition to multiplying matrices (applying a linear transformation), we can “undo” this multiplication by multiplying by the inverse of a matrix: this is like division.

Inverse \(A^{-1}\) of matrix \(A\) that is square \(3\times 3\) (generally, \(p\times p\) has the property:

\[A \times A^{-1} = A^{-1} \times A = I_{3 \times 3} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}\]

This is an identity matrix with 1s on diagonal, 0s everywhere else. \(A_{p \times p} \times I_{p \times p} = A\)

identity matrix is matrix equivalent of 1: \(A \times I = A\)

Matrix Inversion

If a matrix multiplied by its inverse gives Identity matrix…

How does this relate to orthogonality?

Row \(i\) of inverse is orthogonal to column \(j\) of matrix for \(i \neq j\)
Because dot products of row \(i\) and column \(j\) have dot product of \(0\)
Multiply a matrix by its inverse transforms vectors to be orthogonal

We keep talking about orthogonality because it is key to understanding what least squares does

Deriving the mean:

We want to pick one number (a scalar) \(\hat{y}\) to predict all of the values in our vector \(y\).

This is equivalent to doing this:

\[y = \begin{pmatrix}3 \\ 5 \end{pmatrix} \approx \hat{y} \begin{pmatrix}1 \\ 1 \end{pmatrix}\]

Choose \(\hat{y}\) on the blue line at point that minimizes the distance to \(y\).

Deriving the mean:

\(y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\)

can be decomposed into two separate vectors: a vector containing our prediction (\(\hat{y}\)):

\(\begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix} = \hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\)

and another vector \(\mathbf{e}\), which is difference between the prediction vector and the vector of observations:

\(\mathbf{e} = \begin{pmatrix}3 \\ 5 \end{pmatrix} - \begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix}\)

Deriving the mean:

This means our goal is to minimize \(\mathbf{e}\).

How do we find the closest distance? The length of \(\mathbf{e}\) is calculated by taking:

\[len(\mathbf{e})= \sqrt{(3-\hat{y})^2 + (5 - \hat{y})^2}\]

When is the length of \(\mathbf{e}\) minimized?

when angle between \(\hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\) and \(\mathbf{e}\) is \(90^{\circ}\).

Deriving the mean:

We know that two vector are orthogonal (\(\perp\)) when their dot product is \(0\), so we can create the following equality and solve for \(\hat{y}\).

\(\mathbf{e} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)

\((\begin{pmatrix}3 & 5 \end{pmatrix} - \begin{pmatrix} \hat{y} & \hat{y} \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)

\((\begin{pmatrix}3 & 5 \end{pmatrix} - \hat{y} \begin{pmatrix} 1 & 1 \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)

Deriving the mean:

\((\begin{pmatrix} 3 & 5 \end{pmatrix} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix}) - (\hat{y} \begin{pmatrix} 1 & 1 \end{pmatrix} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix}) = 0\)

\((8) - (\hat{y} 2) = 0\)

\(8 = \hat{y} 2\)

\(4 = \hat{y}\)

Deriving the mean:

More generally:

\(\mathbf{e} \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)

\((\begin{pmatrix} y_1 & \ldots & y_n \end{pmatrix} - \begin{pmatrix} \hat{y} & \ldots & \hat{y} \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)

\((\begin{pmatrix} y_1 & \ldots & y_n \end{pmatrix} - \hat{y}\begin{pmatrix} 1 & \ldots & 1 \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)

More generally:

\((\sum\limits_{i=1}^{n} y_i\cdot1) - \hat{y} \sum\limits_{i=1}^{n} 1 = 0\)

\(\sum\limits_{i=1}^{n} y_i = \hat{y} n\)

\(\frac{1}{n}\sum\limits_{i=1}^{n} y_i = \hat{y}\)

The Mean

What is the mean of our residuals \(\mathbf{e}\)

If \(e = y - \hat{y}\), and \(\hat{y}\) is mean of \(y\), mean of \(e\) must be \(0\)
We choose \(e\) orthogonal to \(\begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}\), so their dot product is \(0\). That means sum of \(e\) must be \(0\), so mean of \(e\) must be \(0\).

Beyond the Mean

We say the mean is a way of choosing some \(\hat{y}\) as a prediction of values of \(y\), where \(\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}\) are prediction errors or residuals.

we choose \(\hat{y}\) such that this vector of points has closest possible distance to vector \(y\) (length of \(\mathbf{e}\) minimized by being orthogonal to \(\hat{y}\cdot\begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}\))
gives us the name “least squares”: “squared errors” “squared residuals” are minimized, because length of vector of prediction errors (a distance) is minimized

Beyond the Mean

Bivariate regression chooses \(\hat{y}\) as closest possible prediction of \(y\) with the form

\[\mathbf{\hat{y}} = b_0 + b_1\cdot \mathbf{x}\]

An intercept \(b_0\) and coefficient \(b_1\) multiplied by \(\mathbf{x}\)

Which line?

Graph of Averages

Which line?

The red line above is the prediction using least squares.

It closely approximates the conditional mean of son’s height (\(y\)) across values of father’s height (\(x\)).

How do we obtain this line mathematically? (proof/derivation here)

we don’t need geometric intuition for this; or for the mean

Which line?

The red line above is the prediction using least squares.

It closely approximates the conditional mean of son’s height (\(y\)) across values of father’s height (\(x\)).

How do we obtain this line mathematically? (proof/derivation here)

Bivariate Regression

If we choose slope \(b_1\) and intercept \(b_0\) such that they minimize \(\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\), we can obtain these formulae:

The slope:

\[b_1 = \frac{Cov(X,Y)}{Var(X)}\]

Expresses how much mean of \(y\) changes for a 1-unit change in \(x\)
When expressed as function of correlation coefficient \(r\), we see this rise (\(SD_y\)) over the run (\(SD_x\))

Covariance (does this look familiar?)

\(Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)

\(Cov(X,Y) = \overline{xy} - \bar{x}\bar{y}\)

Pearson Correlation

\(r(X,Y) = \frac{Cov(X,Y)}{SD(X)SD(Y)}\)

\(b_1 = \frac{Cov(X,Y)}{SD(X)SD(Y)}\frac{SD(Y)}{SD(X)} = \frac{Cov(X,Y)}{Var(X)}\)

(details on covariance and correlation)

Bivariate Regression

The Intercept:

\[b_0 = \overline{y} - \overline{x}\cdot b_1\]

Shows us that at \(\bar{x}\), the line goes through \(\bar{y}\). The regression line (of predicted values) goes through the point \((\bar{x}, \bar{y})\) or the point of averages.

Deriving Least Squares

But to go beyond bivariate regression, need to derive more general solution:

Rather than project the \(n \times 1\)-dimensional vector \(\mathbf{y}\) onto one dimension (as we did with the mean), we project it into \(p\) (number of coefficients/parameters) dimensional subspace. Hard to visualize, but we still end up minimizing the distance between our \(n\) dimensional vector \(\mathbf{\hat{y}}\) and the vector \(\mathbf{y}\).

If we have \(n\) of \(3\) and a bivariate regression, we find a \(\mathbf{\hat{y}}\) in \(2\) dimensions that is nearest \(\mathbf{y}\). (one dimension for \(b_0\), one dimensions for \(b_1\))

Deriving Least Squares

Given \(\mathbf{y}\), an \(n \times 1\) dimensional vector of all values \(y\) for \(n\) observations

and \(\mathbf{X}\), an \(n \times 2\) dimensional matrix (\(2\) columns, \(n\) observations). We call this the design matrix. A vector of \(\mathbf{1}\) (for an intercept), a vector \(x\) for our other variable.

\(\mathbf{\hat{y}}\) is an \(n \times 1\) dimensional vector of predicted values (for the mean of Y conditional on X) computed by \(\mathbf{X\beta}\). \(\mathbf{\beta}\) is a vector \(p\times 1\) of (coefficients) that we multiply by \(\mathbf{X}\).

We’ll assume there are only two coefficients in \(\mathbf{\beta}\): \((b_0,b_1)\) so that \(\hat{y_i} = b_0 + b_1 \cdot x_i\), so \(p = 2\)

Deriving Least Squares

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}; \beta = \begin{pmatrix} b_0 \\ b_1 \end{pmatrix}\]

Deriving Least Squares

\[\widehat{y_i} = b_0 + b_1 \cdot x_i\]

\[\widehat{y}_{n \times 1} = \mathbf{X}_{n \times p}\beta_{p \times 1}\]

\[\begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} = \begin{pmatrix} 1 \cdot b_0 + x_1\cdot b_1 \\ \vdots \\ 1\cdot b_0 + x_n \cdot b_1 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \mathbf{\widehat{y}}\]

\(\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}\) gives us the residuals (prediction errors).

Deriving Least Squares

We want to choose \(\mathbf{\beta}\) or \(b_0,b_1\) such that the distance between \(\mathbf{y}\) and \(\mathbf{\hat{y}}\) is minimized.

Like before, the distance is minimized when the vector of residuals \(\mathbf{y} - \mathbf{\hat{y}} = \mathbf{e}\) is orthogonal or \(\perp\) to \(\mathbf{X}\)

Deriving Least Squares

\(\mathbf{X}'_{p\times n}\mathbf{e}_{n\times1} = \begin{pmatrix} 0_1 \\ \vdots \\ 0_p \end{pmatrix} = \mathbf{0}_{p \times 1}\)

\(\mathbf{X}'(\mathbf{Y} - \mathbf{\hat{Y}}) = \mathbf{0}_{p \times 1}\)

\(\mathbf{X}'(\mathbf{Y} - \mathbf{X\beta}) = \mathbf{0}_{p \times 1}\)

\(\mathbf{X}'\mathbf{Y} - \mathbf{X}'\mathbf{X{\beta}} = \mathbf{0}_{p \times 1}\)

\(\mathbf{X}'\mathbf{Y} = \mathbf{X}'\mathbf{X{\beta}}\)

\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)

Deriving Least Squares

\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\]

This is the matrix formula for least squares regression.

If \(X\) is a column vector of \(1\)s, \(\beta\) is just the mean of \(Y\). (We just did this)

If \(X\) is a column of \(1\)s and a column of \(x\)s, it is bivariate regression. (algebraic proof showing equivalence here)

We can now add \(p > 2\): more variables

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(1\). the mean of the residuals is always zero (if we include an intercept). Because we included an intercept (\(b_0\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.

Why?

the mean of the residuals is always zero.

We choose \(\begin{pmatrix}b_0 \\ b_1 \end{pmatrix}\) such that \(e\) is orthogonal to \(\mathbf{X}\). One column of \(\mathbf{X}\) is all \(1\)s, to get the intercept (recall how we used vectors to get the mean). So \(e\) is orthogonal to \(\begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}\).

\[\mathbf{1}'e = 0\]

And if this is true, the \(\sum e_i = 0\) so \(\frac{1}{n}\sum e_i = 0\).

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.

Recall that \(Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e}\)

We chose \(\beta\) (\(a,b\)) such that \(X'e = 0\) so they would be orthogonal.

\(X'e = 0 \to \sum x_ie_i = 0 \to \overline{xe}=0\);

And, from above, we know that \(\overline{e}=0\);

so \(Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e} = 0 - \overline{x}0 = 0\).

\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.

This also means that residuals \(e\) are always perfectly uncorrelated (Pearson correlation) with all the columns in our matrix \(\mathbf{X}\): all the variables we include in the regression model.

Multivariate Least Squares

Multivariate Least Squares:

Previously we predicted \(Y\) as a linear function of \(x\):

\[\hat{y_i} = b_0 + b_1 \cdot x_i\]

Now, we can imagine predicting \(y\) as a linear function of many variables:

\[\hat{y_i} = b_0 + b_1 x_1 + b_2 x_2 + \ldots + b_k x_k\]

Multivariate Least Squares:

When we calculated the mean using matrix algebra, we projected the \(n\) dimensional vector \(Y\) onto a point on a one-dimensional line.
When we calculated the bivariate regression line, we projected the \(n\) dimensional vector \(Y\) onto a \(2\)-dimensional space (one for \(b_0\) and one for \(b_1\))
When we use multi-variate regression, we project the \(n\) dimensional vector \(Y\) onto a \(p\) dimensional space (one for each parameter/coefficient)

Multivariate Least Squares:

What is “projecting onto \(p\) dimensions”

When we project into two dimensions, these dimensions are precisely like the \(x\) and \(y\) axes on a graph: perpendicular/orthogonal to each other.

In multivariate regression, because we are going to project \(y\) onto \(p\) orthogonal dimensions in \(\mathbf{X}\). (\((\mathbf{X}'\mathbf{X})^{-1}\) transforms to orthogonal basis)

“under the hood”, regression creates a new version of \(\mathbf{X}\) where each column is orthogonal to the others

Mathematical Requirements:

Matrix \(X\) has “full rank”

This means that all of the columns of \(\mathbf{X}\) are linearly independent.
- cannot have two identical columns
- cannot have set of columns that sum up to another column multiplied by a scalar
If \(\mathbf{X}\) is not full rank, cannot be inverted, cannot do least squares.
- but we’ll see more on the intuition for this later.

\(n \geq p\): we need to have more data points than variables in our equation

no longer trivial with multiple regression

Multivariate Least Squares:

Examples: Linear Dependence?

1	1	0	0
1	0	1	0
1	0	0	1

Multivariate Least Squares:

Examples: Linear Dependence?

1	2	0	0
1	0	2	0
1	0	0	2

Multivariate Least Squares:

Examples: Linear Dependence?

1	0.50	0.25	0.25
1	0.25	0.50	0.25
1	0.25	0.25	0.50

Multivariate Least Squares:

When we include more than one variable in the equation, we cannot calculate slopes using simple algebraic expressions like \(\frac{Cov(X,Y)}{Var(X)}\).

Must use matrix algebra (this is why I introduced it)

We calculate least squares using same matrix equation (\((X'X)^{-1}X'Y\)) as in bivariate regression, but what is the math doing in the multivariate case?

Multivariate Least Squares:

When fitting the equation:

\(\hat{y_i} = b_0 + b_1x_i + b_2z_i\)

\(b_1 = \frac{Cov(x^*, Y)}{Var(x^*)}\)

Where \(x^* = x - \hat{x}\) from the regression: \(\hat{x} = c_0 + c_1 z\).

\(b_2 = \frac{Cov(z^*, Y)}{Var(z^*)}\)

Where \(z^* = z - \hat{z}\) from the regression: \(\hat{z} = d_0 + d_1 x\)

Does anything look familiar here?

Multivariate Least Squares:

More generally:

\[\hat{y} = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]

\(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)

where \(X_k^* = X_k - \hat{X_k}\) obtained from the regression:

\(X_{k} = c_0 + c_1 x_{1} + \ldots + c_{j} X_{j}\)

\(X_k^*\) is the residual from regressing \(X_k\) on all other \(\mathbf{X_{j \neq k}}\)

Multivariate Least Squares:

How do we make sense of \(X_k^*\) as residual \(X_k\) after regressing on all other \(\mathbf{X_{j \neq k}}\)?

It is the residual in same way as \(e\): \(X_k^*\) is orthogonal to all other variables in \(\mathbf{X_{j \neq k}}\).
- it is “perpendicular” to other variables, as are axes on a graph.
- It is perfectly uncorrelated (in the linear sense) with all other variables in the regression.

Multivariate Least Squares:

How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\) (if \(X_k^*\) as residual \(X_k\) after regressing on all other \(\mathbf{X_{j \neq k}}\)?)

The slope \(b_k\) is the change in \(Y\) with a one-unit change in the part of \(X_k\) that is uncorrelated with/orthogonal to the other variables in the regression.

Multivariate Least Squares:

How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)

Sometimes people say “the slope of \(X_k\) controlling for variables \(\mathbf{X_{j \neq k}}\)”.
- is it “holding other factors constant”/ceteris parabis? Not quite.
- better to think of it as “partialling out” the relationship with other variables in \(\mathbf{X}\). The \(X_k\) that does not co-vary with other variables
- better to think of it as variation in \(X_k\) residual on the mean \(X_k\) predicted by all other variables in \(X\)
- this residual variation has implications for how least squares weights observations

Multivariate Least Squares:

There are additional implications of defining the slope \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\):

Now we can see why columns of \(X\) must be linearly independent:

e.g. if \(X_1\) were linearly dependent on \(X_2\) and \(X_3\), then \(X_2\) and \(X_3\) perfectly predict \(X_1\).
If \(X_1\) is perfectly predicted by \(X_2\) and \(X_3\), then the residuals \(X_1^*\) will all be \(0\).
If \(X_1^*\) are all \(0\)s, then \(Var(X_k^*) = 0\), and \(b_k\) is undefined.

Key insights about regression

Bivariate regression not guaranteed to uncover true CEF if it is not linear:

but it is the best linear approximation of CEF, minimizing the same distance metric as defines the mean.

Key insights about regression

With the mean, \(\hat{y}\) is a projection of \(n\) dimensional \(y\) onto a line.

With bivariate regression, \(\hat{y}\) is a projection of \(y\) onto a plane - which is 2d:

in order to choose the vector on this 2d surface of \(X\), we need to have new basis vectors that are orthogonal.
\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\) transforms \(X\) such that each column is orthogonal to the other columns in \(X\)
This means \(x\) is transformed to be orthogonal to \(\mathbf{1}\) - have mean of \(0\).

Derivation of Bivariate Regression

Deriving Bivariate Regression Formula

We want to choose \(a, b\) such that \(\mathbf{\hat{y}} = a + b\cdot \mathbf{x}\) has minimum distance to \(\mathbf{y}\)

Another way of thinking of this is in terms of residuals, or the difference between true and predicted values using the equation of the line. (Prediction errors)

\(\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}\)

Minimizing the distance also means minimizing the sum of squared residuals or length of vector \(\mathbf{e}\)

Minimizing the Distance

(Proof itself is not something you need to memorize)

We need to solve this equation:

\[\min_{a,b} \sum\limits_i^n (y_i - a - b x_i)^2\] Choose \(a\) and \(b\) to minimize this value, given \(x_i\) and \(y_i\)

We can do this with calculus: solve for when first derivative is \(0\) (since this means distance will be at its minimum)

Minimizing the Distance

First, we take derivative with respect to \(a\): yields:

\(-2 \left[ \sum\limits_i^n (y_i - a - b x_i) \right] = 0\)

\(\sum\limits_i^n y_i - \sum\limits_i^n a - \sum\limits_i^n b x_i = 0\)

\(-\sum\limits_i^n a = -\sum\limits_i^n y_i + \sum\limits_i^n b x_i\)

\(\sum\limits_i^n a = \sum\limits_i^n y_i - b x_i\)

Minimizing the Distance

\(\sum\limits_i^n a = \sum\limits_i^n y_i - b x_i\)

Dividing both sides by \(n\), we get:

\(a = \bar{y} - b\bar{x}\)

Where \(\bar{y}\) is mean of \(y\) and \(\bar{x}\) is mean of \(x\).

Implication: regression line goes through the point of averages \(\bar{y} = a + b \bar{x}\)

Minimizing the Distance

Next, we take derivative with respect to \(b\):

\(-2 \left[ \sum\limits_i^n (y_i - a - b x_i) x_i\right] = 0\)

\(\sum\limits_i^n (y_i - (\bar{y} - b\bar{x}) - b x_i) x_i = 0\)

\(\sum\limits_i^n y_ix_i - \bar{y}x_i + b\bar{x}x_i - b x_ix_i= 0\)

Minimizing the Distance

\(\sum\limits_i^n (y_i - \bar{y})x_i = b\sum\limits_i^n (x_i - \bar{x})x_i\)

Dividing both sides by \(n\) gives us:

\(\frac{1}{n}\sum\limits_i^n y_ix_i - \bar{y}x_i = b\frac{1}{n}\sum\limits_i^n x_i^2 - \bar{x}x_i\)

\(\overline{yx} - \bar{y}\bar{x} = b \overline{xx} - \bar{x}\bar{x}\)

\(Cov(y,x) = b \cdot Var(x)\)

\(\frac{Cov(y,x)}{Var(x)} = b\)

Proof LS Solution is the Same

Just what are the matrices doing?

But we also want to know more intuitively what these matrix operations are doing! It isn’t magic.

We will walk through what exactly the matrix calculations do for us.

\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]

\[\mathbf{X}'\mathbf{X} = \begin{pmatrix} 1 & \ldots & 1 \\ x_1 & \ldots & x_n \end{pmatrix} \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}\]

\[= \begin{pmatrix} n & \sum_i x_i \\ \sum_i x_i & \sum_i x_i^2 \end{pmatrix} = n \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{x^2} \end{pmatrix}\]

\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]

\[\mathbf{X}'\mathbf{Y} = \begin{pmatrix} 1 & \ldots & 1 \\ x_1 & \ldots & x_n \end{pmatrix} \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\] \[= \begin{pmatrix} \sum_i y_i \\ \sum_i x_i y_i \end{pmatrix} = n \begin{pmatrix} \overline{y} \\ \overline{xy} \end{pmatrix}\]

Inverting Matrices

How do we get \(^{-1}\)? This is inverting a matrix.

Inverse \(A^{-1}\) of matrix \(A\) that is square or \(p\times p\) has the property:

\[A \times A^{-1} = A^{-1} \times A = I_{p \times p} = \begin{pmatrix} 1 & 0 & \ldots & 0 \\ 0 & \ddots & \ldots & 0 \\ 0 & \ldots & \ddots & 0 \\ 0 & \ldots & 0 & 1 \end{pmatrix}\]

This is an identity matrix with 1s on diagonal, 0s everywhere else.

Inverting Matrices

We need to get the determinant

For sake of ease, will show for a scalar and for a \(2 \times 2\) matrix:

\[det(a) = a\]

\[det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - cb\]

Inverting Matrices

Then we need to get the adjoint. It is the transpose of the matrix of cofactors (don’t ask me why):

\[adj(a) = 1\]

\[adj\begin{pmatrix} a & b \\ c & d \end{pmatrix} = \begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]

Inverting Matrices

The inverse of \(A\) is \(adj(A)/det(A)\)

\[A^{-1} = \frac{1}{ad - cb}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]

Deriving Least Squares

\[A^{-1} = \frac{1}{ad - cb}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]

\[(\mathbf{X}'\mathbf{X}) = n \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{x^2} \end{pmatrix}\]

\[(\mathbf{X}'\mathbf{X})^{-1} = \frac{n}{n^2(\overline{x^2} - \overline{x}^2)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix}\]

\[(\mathbf{X}'\mathbf{X})^{-1} = \frac{1}{n \cdot Var(x)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix}\]

Deriving Least Squares

We can put it together to get: \((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)

\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \frac{1}{n \cdot Var(x)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix} \frac{n}{1} \begin{pmatrix} \overline{y} \\ \overline{xy} \end{pmatrix}\]

\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x}\ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]

Deriving Least Squares

The slope:

\[\beta = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x} \ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]

\[b = \frac{\overline{xy} - \overline{x} \ \overline{y}}{Var(x)} = \frac{Cov(x,y)}{Var(x)}\]

\[b = \frac{Cov(x,y)}{Var(x)} = r \frac{SD_y}{SD_x}\]

Deriving Least Squares

The slope:

Expresses how much mean of \(Y\) changes for a 1-unit change in \(X\)

The Intercept:

\[\beta = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x} \ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]

\[a = \frac{\overline{x^2}\overline{y} -\overline{x} \ \overline{xy}}{Var(x)} = \frac{(Var(x) + \overline{x}^2)\overline{y} - \overline{x}(Cov(x,y) + \overline{x}\overline{y})}{Var(x)}\]

\[= \frac{Var(x)\overline{y} + \overline{x}^2\overline{y} - \overline{x}^2\overline{y} - \overline{x}Cov(x,y)}{Var(x)}\]

\[= \overline{y} - \overline{x}\frac{Cov(x,y)}{Var(x)}\]

\[a = \overline{y} - \overline{x}\cdot b\]

Deriving Least Squares

The Intercept:

\[a = \overline{y} - \overline{x}\cdot b\]

Shows us that at \(\bar{x}\), the line goes through \(\bar{y}\). The regression line (of predicted values) goes through the point \((\bar{x}, \bar{y})\) or the point of averages.

Covariance/Correlation

If we are thinking about how knowing the value of one variable informs us about the value of another… might want to think about correlation.

Covariance and Correlation

How are these variables associated?

Covariance and Correlation

How are these variables associated?

Covariance and Correlation

Covariance

\[Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\]

\[Cov(X,Y) = \overline{xy} - \bar{x}\bar{y}\]

Divide by \(n-1\) for sample covariance.

Covariance and Correlation

Variance

Variance is also the covariance of a variable with itself:

\[Var(X) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})^2\]

\[Var(X) = \overline{x^2} - \bar{x}^2\]

Covariance of military aged males and suffrage vote:

x1 = dat$mil_age
y1 = dat$suffrage_diff
mean(x1*y1) - (mean(x1)*mean(y1))

## [1] -61.77359

Covariance of enlistment rate and suffrage vote:

x2 = dat$enlist_rate
y2 = dat$suffrage_diff
mean(x2*y2) - (mean(x2)*mean(y2))

## [1] 0.002691231

Why is the \(Cov(MilAge,Suffrage)\) larger than \(Cov(Enlist,Suffrage)\)?

Why is the \(Cov(Width,Volume)\) larger than \(Cov(Width,Height)\)?

Scale of covariance reflects scale of the variables.
Can’t directly compare the two covariances

Covariance: Intuition

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Correlation

Correlation puts covariance on a standard scale

Covariance

\(Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)

Pearson Correlation

\(r(X,Y) = \frac{Cov(X,Y)}{SD(X)SD(Y)}\)

Dividing by product of standard deviations scales the covariance
\(|Cov(X,Y)| <= \sqrt{Var(X)*Var(Y))}\)

Correlation

Correlation coefficient must be between \(-1, 1\)
At \(-1\) or \(1\), all points are on a straight line
Negative value implies increase in \(X\) associated with decrease \(Y\).
If correlation is \(0\), the covariance must be?
If \(Var(X)=0\), then \(Cor(X,Y) = ?\)

Correlation: Interpretation

Correlation of \((x,y)\) is same as correlation of \((y,x)\)
Values closer to -1 or 1 imply “stronger” association
Correlations cannot be understood using ratios.
- Correlation of \(0.8\) is not “twice” as correlated as \(0.4\).
Pearson correlation agnostic about outliers/nonlinearities
Correlation is not causation

Practice:

Practice is two-fold: understand the concept and write functions in R

#name your function: my_mean
#function(x); x is an argument, we can add other arguments
my_mean = function(x) {
  n = length(x) #we'll plug in whatever value of x we give to the function 
  s = sum(x)
  return(s/n) #what value(s) should the function return?
}

Practice:

Get data_1 here: https://pastebin.com/LeSpqKKk

Without using cor() or cov() or var() or sd() functions in R

hint: generate your OWN functions

Inspect data_1
Calculate mean of \(X\); mean of \(Y\)
Calculate \(Var(X)\) and \(Var(Y)\)
Calculate correlation \((X,Y)\)

Practice:

Covariance and Correlation

Both measure linear association of two variables
Scale is either standardized (correlation) or in terms of products (covariance)
May be inappropriate in presence of non-linearity; outliers