POLI 572B

Michael Weaver

February 16, 2024

Least Squares

Objectives

  • Recap key ideas on estimating causal effects
  • What kind of tools do we need to estimate?
  • Why least squares is a good-enough tool

Introduction:

Seen several approaches to estimating causal effects

  • all causal estimands we have seen are average effects: ACE, ATT, etc.
  • all estimands involves comparing average outcome in a treated state against average outcome in untreated state

Introduction

Estimation of these causal effects involves:

  • plugging average outcome in some set of untreated cases for the unobserved average counterfactual outcome among treated cases
  • and vice versa, for ACE
  • to the board

Introduction

We need a tool that lets us plug in and compare averages.

  • Least squares is a powerful, if limited tool to do this.

Why least squares?

  • We need averages \(\to\) LS is a generalization of the mean
  • A “simple”, yet flexible tool to plug in these values
  • Ubiquitous, often “good enough”

Why least squares?

In order to use least squares in causal inference, need to understand what the math of regression is doing, as it pertains to:

  • given our need for average, how is regression related to the mean?
  • how does regression “plug-in” values?
  • how does regression “control” for variables?
  • how does math of regression relate back to the causal estimands of interest?
  • As we move from design-based toward model-based inference, we need to be in control of the tool we use.

Why least squares math?

No causality

  • no effort to prove causality; no assumptions about causal model.

No parameters/estimands

  • no statistical model; no random variables; no potential outcomes

Regression/Least Squares as Algorithm

  • algorithm with mathematical properties
  • without further assumption, plug in data, generate predictions
  • limited interpretation

Objectives

Mechanics of Bivariate Regression

  • the mean (revisited)
  • Relationship between variables
    • Covariance; Correlation
  • Conditional Expectation Function
  • Least Squares
    • algorithm to get CEF
    • mathematical properties of algorithm
    • no assumptions

The Mean: Revisited

Why do we use the mean to summarize values?

  • For random variables, it is “best guess” of what value we will draw
  • It is a kind of prediction
  • In what sense is it the “best”?

Squared Deviations

Why are we always squaring differences?

Variance

\(\frac{1}{n}\sum\limits_{i = 1}^{n} (x_i - \bar{x})^2\)

Covariance

\(\frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)

Mean Squared Error

\(\frac{1}{n}\sum\limits_{i=1}^n(\hat{y_i} - y_i)^2\)

Squared Deviations

It is linked to distance

What is the distance between two points?

In \(2\) dimensional space: \((p_1,p_2)\), \((q_1,q_2)\)

\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]


In \(k\) dimensional space: \((p_1,p_2, \ldots, p_k)\), \((q_1,q_2, \ldots ,q_k)\)

\(d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_k - q_k)^2}\)

What is the distance between these two points?

What is the distance between these two points?

\(p = (3,0); q = (0,4)\)

\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]

What is the distance between these two points?

\(p = (3,0); q = (0,4)\)

\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]

\[d(p,q) = \sqrt{(3 - 0)^2 + (0 - 4)^2}\] \[d(p,q) = \sqrt{3^2 + (-4)^2} = \ ?\]

What is the distance between these two points?

Remember Pythagoras?

What is the distance between two points?

In \(2\) dimensional space: \((p_1,p_2)\), \((q_1,q_2)\)

\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]


In \(k\) dimensional space: \((p_1,p_2, \ldots, p_k)\), \((q_1,q_2, \ldots ,q_k)\)

\(d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_k - q_k)^2}\)

The Mean

One way of thinking about the mean is as a prediction.

We observe many values in \(Y\): \(y_1 \dots y_n\). We want to choose a single value, \(\hat{y}\), that is the best prediction for all the values in \(Y\).

If we say that the “best” prediction is one with with the shortest distance between the prediction \(\hat{y}\) and the actual values of \(y_1 \dots y_n\)

then the algorithm that gives us this best prediction is the mean.

  • since distances are sums of squared differences, this is why mean minimizes the variance of prediction errors (\(y_i - \hat{y}\)).

Deriving the mean:

Imagine we have a variable \(Y\) that we observe as a sample of size \(n\). We can represent this variable as a vector in \(n\) dimensional space.

\[Y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\]

More generally, we imagine our observed data \(Y\) is a vector with one dimension for each data point (observation)

What is a vector?

  • a vector is an array of numbers of length \(n\).

  • can be portrayed graphically as an arrow from the origin (the point \((0,0)\) or \((0,0,\dots,0)\)) to a point in \(n\) dimensional space

  • vectors can be added, vectors can be multiplied by a number to extend/shorten their length (each element multiplied by same number)

Deriving the mean:

Deriving the mean:

We want to pick one number (a scalar) \(\hat{y}\) to predict all of the values in our vector \(Y\).

This is equivalent to doing this:

\[Y = \begin{pmatrix}3 \\ 5 \end{pmatrix} \approx \begin{pmatrix}\hat{y} \\ \hat{y} \end{pmatrix} = \hat{y} \begin{pmatrix}1 \\ 1 \end{pmatrix}\]

Multiplying a \((1,1)\) vector by a constant.

  • \(\hat{y}\) will be on the line that runs through the origin \((0,0)\) and \((1,1)\)
  • Why \((1,1)\)?

Choose \(\hat{y}\) on the blue line at point that minimizes the distance to \(y\).

Deriving the mean:

\(y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\) can be decomposed into two separate vectors: a vector containing our prediction (\(\hat{y}\)):

\(\begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix} = \hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\)

and another vector \(\mathbf{e}\), which is difference between the prediction vector and the vector of observations (the residual or prediction error):

\(\mathbf{e} = \begin{pmatrix}3 \\ 5 \end{pmatrix} - \begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix}\)

Deriving the mean:

This means our goal is to minimize \(\mathbf{e}\), residuals.

How do we find the closest distance? The length of \(\mathbf{e}\) is calculated by taking:

\[len(\mathbf{e})= \sqrt{(3-\hat{y})^2 + (5 - \hat{y})^2}\]

When is the length of \(\mathbf{e}\) minimized?

  • when angle between \(\hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\) and \(\mathbf{e}\) is \(90^{\circ}\).

Deriving the mean:

Deriving the mean:

Values \(Y\) are sample of size \(n\) are represented by a vector in \(n\) dimensional space. \(\hat{y}\) is a prediction of \(y\).

  • We choose a value \(\hat{y}\) in one dimensional sub-space (typically on a line through: \(\begin{pmatrix} 1_1 \ 1_2 \ldots 1_n \end{pmatrix}\))

  • Such that difference between \(\hat{y}\) and \(Y\)\(\mathbf{e}\), residual—is shortest

Deriving the mean:

What if we have more information about each case \(i\)? Let’s call this information \(X\), with \(x_1 \dots x_n\).

How can we use this to improve our predictions, \(\hat{y}\), of \(Y\)?

  • We could choose a value \(\hat{y}\) in two dimensional plane (on 2D rectangle that passes through: \(\begin{pmatrix} 1_1 \ 1_2 \ldots 1_n \end{pmatrix}\)) and \(\begin{pmatrix} x_1 \ x_2 \ldots x_n \end{pmatrix}\))

  • Such that difference between \(\hat{y}\) and \(y\)\(\mathbf{e}\), residual—is shortest

  • As we will see, this is least squares

Digression

If we are thinking about how knowing the value of one variable informs us about the value of another… might want to think about correlation.

Covariance and Correlation

How are these variables associated?

Covariance and Correlation

How are these variables associated?

Covariance and Correlation

Covariance

\[Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\]

\[Cov(X,Y) = \overline{xy} - \bar{x}\bar{y}\]

  • Divide by \(n-1\) for sample covariance.

Covariance and Correlation

Variance

Variance is also the covariance of a variable with itself:

\[Var(X) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})^2\]

\[Var(X) = \overline{x^2} - \bar{x}^2\]

Covariance of military aged males and suffrage vote:

x1 = dat$mil_age
y1 = dat$suffrage_diff
mean(x1*y1) - (mean(x1)*mean(y1))
## [1] -61.77359

Covariance of enlistment rate and suffrage vote:

x2 = dat$enlist_rate
y2 = dat$suffrage_diff
mean(x2*y2) - (mean(x2)*mean(y2))
## [1] 0.002691231

Why is the \(Cov(MilAge,Suffrage)\) larger than \(Cov(Enlist,Suffrage)\)?

Why is the \(Cov(Width,Volume)\) larger than \(Cov(Width,Height)\)?

  • Scale of covariance reflects scale of the variables.

  • Can’t directly compare the two covariances

Covariance: Intuition

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Correlation

Correlation puts covariance on a standard scale

Covariance

\(Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)

Pearson Correlation

\(r(X,Y) = \frac{Cov(X,Y)}{SD(X)SD(Y)}\)

  • Dividing by product of standard deviations scales the covariance

  • \(|Cov(X,Y)| <= \sqrt{Var(X)*Var(Y))}\)

Correlation

  • Correlation coefficient must be between \(-1, 1\)

  • At \(-1\) or \(1\), all points are on a straight line

  • Negative value implies increase in \(X\) associated with decrease \(Y\).

  • If correlation is \(0\), the covariance must be?

  • If \(Var(X)=0\), then \(Cor(X,Y) = ?\)

Correlation: Interpretation

  • Correlation of \((x,y)\) is same as correlation of \((y,x)\)

  • Values closer to -1 or 1 imply “stronger” association

  • Correlations cannot be understood using ratios.

    • Correlation of \(0.8\) is not “twice” as correlated as \(0.4\).
  • Pearson correlation agnostic about outliers/nonlinearities

  • Correlation is not causation

Practice:

Practice is two-fold: understand the concept and write functions in R

#name your function: my_mean
#function(x); x is an argument, we can add other arguments
my_mean = function(x) {
  n = length(x) #we'll plug in whatever value of x we give to the function 
  s = sum(x)
  return(s/n) #what value(s) should the function return?
}

Practice:

Get data_1 here: https://pastebin.com/LeSpqKKk

Without using cor() or cov() or var() or sd() functions in R

hint: generate your OWN functions

  1. Inspect data_1
  2. Calculate mean of \(X\); mean of \(Y\)
  3. Calculate \(Var(X)\) and \(Var(Y)\)
  4. Calculate correlation \((X,Y)\)

Practice:

Covariance and Correlation

  • Both measure linear association of two variables
  • Scale is either standardized (correlation) or in terms of products (covariance)
  • May be inappropriate in presence of non-linearity; outliers

Conditional Expectation Function

Generalizing the Mean:

The mean is useful…

… but often we want generate better predictions of \(Y\), using additional information \(X\).

Generalizing the Mean:

The mean is useful…

… but often we want to know if the mean of something \(Y\) is different across different values of something else \(X\).

To put it another way: the mean of \(Y\) is the \(E(Y)\) (if we are talking about random variables). Sometimes we want to know \(E[Y | X]\)

  • In causal inference, we want \(E[Y | Z = z]\) or \(E[Y | D = d, X = x]\)
  • differences in means in experiments, plug-in missing counterfactuals in conditioning, plug-in counterfactual trends in DiD

Generalizing the Mean:

We are interested in finding the conditional expectation function (Angrist and Pischke)

expectation: because it is about the mean: \(E[Y]\)

conditional: because it is conditional on values of \(X\)\(E[Y |X]\)

function: because \(E[Y | X] = f(X)\), there is some relationship we can look at between values of \(X\) and \(E[Y]\).

\[E[Y | X = x]\]

Generalizing the Mean:

Another way of thinking about what the conditional expectation function is:

\[E[Y | X = x] = \hat{y} | X = x\]

What is the predicted value \(\hat{y}\), where \(X = x\), such that \(\hat{y}\) has the smallest distance to \(Y\)?

Generalizing the Mean:

These points emphasize the conditional and expectation part of the CEF.

The difficulty is: how do we find the function?

  • a function takes some value of \(X\) and uniquely maps it to some value of \(E[Y]\)
  • we need to “learn” this function from the data
  • depending on how much data we have, we have to make some choice of how to interpolate/extrapolate when “learning” this function

Generalizing the Mean:

There are many ways to define and estimate the function in the conditional expectation function

  • one easy choice is to assume that the CEF is linear.
  • That is to say \(E[Y]\) is linear in \(X\).
  • The function takes the form of an equation of a line.
  • This leads us to bivariate regression

Equation of a line

Equation of a line

Equation of a line

\(slope = \frac{rise}{run} = \frac{y2-y1}{x2-x1}\)
  • Change in \(y\) with a 1 unit change in \(x\).

Equation of a line

Equation of a line

\(intercept = (y | x=0)\)
  • Value of \(y\) when \(x = 0\). Where the line crosses the \(y\)-axis.

Equation of a line

\(y = intercept + slope*x\)

or, by convention:

\(y = a + bx\)

Generalizing the Mean:

How do we choose the line that best captures:

\[E[Y] = a + b\cdot X\]

or

\[\hat{y} = a + b\cdot X\]

What line fits this?

Which line?

Which line?

Which line?

Graph of Averages