POLI 572B

Michael Weaver

February 28, 2025

Least Squares Continued

Objectives

Recap: Math of Least Squares
Least Squares $\to$ CEF
Interpretation of Coefficients
- predicted values and residuals
- design matrix $\to$ coefficients
- dummy variables
- continuous variables
- “controlling”
- non-linearity

Recap

Bivariate Regression

\[\hat{y_i} = b_0 + b_1x_i\]

The slope:

\[b_1 = \frac{Cov(x,y)}{Var(x)}\]

The Intercept:

\[b_0 = \overline{y} - \overline{x}\cdot b_1\]

Multivariate Least Squares:

Mean and least squares are orthogonal projection of $y$ on $X$.

Previously we predicted $Y$ as a linear function of $x$:

\[\hat{y_i} = b_0 + b_1 \cdot x_i\]

Now, we can imagine predicting $y$ as a linear function of many variables:

\[\hat{y_i} = b_0 + b_1 x_1 + b_2 x_2 + \ldots + b_k x_k\]

Deriving Least Squares

Given $\mathbf{y}$, an $n \times 1$ dimensional vector of all values $y$ for $n$ observations

and $\mathbf{X}$, an $n \times 2$ dimensional matrix ($2$ columns, $n$ observations). We call this the design matrix. A vector of $\mathbf{1}$ (for an intercept), a vector $x$ for our other variable.

$\mathbf{\hat{y}}$ is an $n \times 1$ dimensional vector of predicted values (for the mean of Y conditional on X) computed by $\mathbf{X\beta}$. $\mathbf{\beta}$ is a vector $p\times 1$ of (coefficients) that we multiply by $\mathbf{X}$.

We’ll assume there are only two coefficients in $\mathbf{\beta}$: $(b_0,b_1)$ so that $\hat{y_i} = b_0 + b_1 \cdot x_i$, so $p = 2$

Deriving Least Squares

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}; \beta = \begin{pmatrix} b_0 \\ b_1 \end{pmatrix}\]

Deriving Least Squares

\[\widehat{y_i} = b_0 + b_1 \cdot x_i\]

\[\widehat{y}_{n \times 1} = \mathbf{X}_{n \times p}\beta_{p \times 1}\]

\[\begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} = \begin{pmatrix} 1 \cdot b_0 + x_1\cdot b_1 \\ \vdots \\ 1\cdot b_0 + x_n \cdot b_1 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \mathbf{\widehat{y}}\]

$\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}$ gives us the residuals (prediction errors).

\[\widehat{y_i} = b_0 + b_1 \cdot x_{1i} + b_2 \cdot x_{2i} \]

\[\mathbf{X}\beta = \begin{pmatrix} 1 & x_{11} & x_{21} \\ \vdots & \vdots & \vdots \\ 1 & x_{1n} & x_{2n} \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \\ b_2 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \hat{Y}\]

Mathematical Requirements:

Matrix $X$ has “full rank”

This means that all of the columns of $\mathbf{X}$ are linearly independent.
- cannot have two identical columns
- cannot have set of columns that sum up to another column multiplied by a scalar
If $\mathbf{X}$ is not full rank, cannot be inverted, cannot do least squares.
- but we’ll see more on the intuition for this later.

$n \geq p$: we need to have more data points than variables in our equation

no longer trivial with multiple regression

Regression and CEF

Conditional Expectation Function

the conditional expectation function (Angrist and Pischke)

expectation: because it is about the mean: $E[Y]$

conditional: because it is conditional on values of $X$ … $E[Y |X]$

function: because $E[Y | X] = f(X)$, there is some mathematical mapping between values of $X$ and $E[Y]$.

\[E[Y | X = x] = f(x)\]

Conditional Expectation Function

Another way of thinking about what the conditional expectation function is:

\[E[Y | X = x] = \hat{y} | X = x\]

What is our predicted value $\hat{y}$, where $X = x$, such that our prediction of $Y$, $\hat{y}$ has the least error?

With the CEF, we choose a particular definition of what it means to have the “least error”: minimum (Euclidean) distance

Conditional Expectation Function

With least squares, we find a linear approximation of the CEF

\[\hat{y_i} = b_0 + b_1x_i\]

\[E[Y | X = x_i] = b_0 + b_1x_i\]

We want to understand:

What do we mean by “linear”?
What is the relationship between $\hat{y}$ and $E[Y|X=x]$?
How to generate predicted values
Properties of ‘residuals’ $\mathbf{e}$/prediction errors

First, copy this code into R:

earnings = read.csv('https://raw.githubusercontent.com/mdweaver/mdweaver.github.io/refs/heads/master/poli572b/lecture_10/example_7.csv')

Data contains:

AGE: the age of survey respondents in years

INCEARN: the annual earnings (on average) for that age group

respondents: Number of survey respondents in this category

In pairs:

Create the design matrix X for this equation $INCEARN_i = b_0 + b_1 AGE_i$ (hint: cbind)
Using the matrix equation, solve for coefficients $\beta$ in R: (hint: t(), solve(), %*%)
Using lm([your y] ~ [your x], data = [your data]), solve for coefficients $\beta$ and compare.
Using matrix equation, calculate $\hat{y}_i$ (hint: multiply $\mathbf{X}\beta$)
Calculate $e$, the residuals using $y$ and $\hat{y}$
Plot points for $Y$ by AGE; plot a line of $\hat{y}$ by AGE: is this a perfect fit of the CEF?

CEF is not linear

Does the regression line fit all values of $E[Y|X=x]$ equally well?

Take aways:

Least squares is “linear” in that we approximate $Y$ using a linear combination of $X$ (sum $X$ multiplied by coefficients: in the span/column space of $X$)

Graph of Averages

When we use individual data…

X = cbind(1, acs_data$AGE)
y = acs_data$INCEARN

beta = solve(t(X) %*% X) %*% t(X) %*% y

beta

##           [,1]
## [1,] 17501.992
## [2,]  3668.629

Do your estimates agree? Why or why not?

Take aways:

Least squares is “linear” in that we approximate $Y$ using a linear combination of $X$ (sum $X$ multiplied by coefficients: in the span/column space of $X$)
It is the best linear approximation: prediction $\hat{y}$ is closest to $y$ in $n$ dimensional space. It will make better predictions where there are more observations with specific value of $X = x$

We were not accounting for the number of people in each value of $X_i$

#With aggregate data...
X = cbind(1, earnings$AGE)
y = earnings$INCEARN
w = earnings$respondents 
w = w * diag(length(w))
beta = solve(t(X) %*% w %*% X) %*% t(X) %*% w %*% y

beta

##           [,1]
## [1,] 23404.565
## [2,]  3535.247

Least Squares and CEF

Least Squares predictions $\widehat{Y} = X'\beta$ gives the linear approximation of $E(Y_i | X_i)$ that has the smallest mean-squared error (minimum distance to the true $E(Y_i | X_i)$).

It is “best” linear approximation in that the overall set of predictions $\hat{y}$ has least distance to $y$
Least Squares could equivalently fit a line for the mean of Y: $E(Y) | X$ rather than prediction for each $Y_i$. We could use $E(Y_i | X_i)$ as the dependent variable, but we need to account for number of observations.

Least Squares and CEF

Alternatively, least squares is a particular weighted average of derivative of non-linear CEF (board)

weights greater closer to median of $x$

Now, in pairs:

What is the mean of residuals e?
Using the cor function: get the correlation between residuals e and AGE

Key facts about regression:

The mathematical procedures we use in regression ensure that:

$1$. the mean of the residuals is always zero (if we include an intercept). Because we included an intercept ($b_0$), and the regression line goes through the point of averages, the mean of the residuals is always 0. $\overline{e} = 0$. This is also true of residuals of the mean.

Why?

the mean of the residuals is always zero.

We choose $\begin{pmatrix}b_0 \\ b_1 \end{pmatrix}$ such that $e$ is orthogonal to $\mathbf{X}$. One column of $\mathbf{X}$ is all $1$s, to get the intercept (recall how we used vectors to get the mean). So $e$ is orthogonal to $\begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}$.

\[\mathbf{1}'e = 0\]

And if this is true, the $\sum e_i = 0$ so $\frac{1}{n}\sum e_i = 0$.

Key facts about regression:

The mathematical procedures we use in regression ensure that:

$2$. $Cov(X,e) = 0$. This is true by definition of how we derived least squares.

Recall that $Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e}$

We chose $\beta$ ($a,b$) such that $X'e = 0$ so they would be orthogonal.

$X'e = 0 \to \sum x_ie_i = 0 \to \overline{xe}=0$;

And, from above, we know that $\overline{e}=0$;

so $Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e} = 0 - \overline{x}0 = 0$.

$2$. $Cov(X,e) = 0$. This is true by definition of how we derived least squares.

This also means that residuals $e$ are always perfectly uncorrelated (Pearson correlation) with all the columns in our matrix $\mathbf{X}$: all the variables we include in the regression model.

Interpreting Regressions

We want to:

Understand link between design matrix, $\hat{y}$, and coefficients
Understand dummy variables

Least Squares and Binary Variables

Suppose we fit the following equation using least squares?:

$Earnings_i = b_0$

What is the design matrix?
What is the coefficient $b_0$ (solve for it; what is it equivalent to?)
What is $\hat{y}$ in this case?

Least Squares and Group Means

Suppose we want to estimate earnings as a function of gender

$Earnings_i = b_0 + b_1 \ Female_i$

assuming, as the survey data does, the gender binary

What does the design matrix $X$ in this regression look like?
What is the interpretation of $\hat{y}$ when $Female_i = 0$?
What is the interpretation of $\hat{y}$ when $Female_i = 1$?
What does the $b_0$ tell us?
What does the $b_1$ tell us?

m1 = lm(INCEARN ~ FEMALE, acs_data)
summary(m1) %>% .$coefficients

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 201941.70   1656.080 121.93958 0.000000e+00
## FEMALE      -57269.02   2883.739 -19.85929 4.132939e-86

What would change if we changed the design matrix so that instead of a vector of $1$ and indicated Female ($1$) vs Male ($0$), we used a vector of $2$ and indicated Female ($2$) vs Male ($0$)?

What is the interpretation of $\hat{y}$ when $Female_i = 0$?
What is the interpretation of $\hat{y}$ when $Female_i = 2$?
What does the $b_0$ tell us?
What does the $b_1$ tell us?

Least Squares and Group Means

$Earnings_i = b_0 + b_1 \ Female_i + \ b_2 Male_i$

What does the design matrix $X$ in this regression look like?
What is the interpretation of $\hat{y}$ when $Female_i = 0$ and $Male_i = 1$?
What is the interpretation of $\hat{y}$ when $Female_i = 1$ and $Male_i = 0$?
What does the $b_0$ tell us?
What does the $b_1$ tell us?
What does the $b_2$ tell us?

m2 = lm(INCEARN ~ FEMALE + MALE, acs_data)

Least Squares and Group Means

$Earnings_i = b_0 \ Female_i + \ b_1 Male_i$

What does the design matrix $X$ in this regression look like?
What is the interpretation of $\hat{y}$ when $Female_i = 0$ and $Male_i = 1$?
What is the interpretation of $\hat{y}$ when $Female_i = 1$ and $Male_i = 0$?
What does the $b_0$ tell us?
What does the $b_1$ tell us?

m3 = lm(INCEARN ~ -1 + FEMALE + MALE, acs_data)

Least Squares and Group Means

Would the $\hat{y}$ be different in these three models?

$Earnings_i = b_0 + b_1 Female_i $

#Design matrix
model.matrix(m1) %>% head

##   (Intercept) FEMALE
## 1           1      0
## 2           1      0
## 3           1      0
## 4           1      1
## 5           1      0
## 6           1      1

#Interpret:
coefficients(m1)

## (Intercept)      FEMALE 
##   201941.70   -57269.02

$Earnings_i = b_0 + b_1 \ Female_i + \ b_2 Male_i$

#Design matrix
model.matrix(m2) %>% head

##   (Intercept) FEMALE MALE
## 1           1      0    1
## 2           1      0    1
## 3           1      0    1
## 4           1      1    0
## 5           1      0    1
## 6           1      1    0

#Interpret:
coefficients(m2)

## (Intercept)      FEMALE        MALE 
##   201941.70   -57269.02          NA

$Earnings_i = b_0 \ Female_i + \ b_1 Male_i$

#Design matrix
model.matrix(m3) %>% head

##   FEMALE MALE
## 1      0    1
## 2      0    1
## 3      0    1
## 4      1    0
## 5      0    1
## 6      1    0

#Interpret:
coefficients(m3)

##   FEMALE     MALE 
## 144672.7 201941.7

Least Squares and Group Means

Lessons from these examples:

Despite different coefficients, least squares can make identical predictions $\hat{y}$; changes interpretation of coefficients
We need to specify design matrix/model to get estimates of interest. (e.g. difference in means)
If we have exclusive indicator variables for belonging to distinct categories (sometimes called “dummy variables”), must drop one group if we fit an intercept. (Why?)
If we fit group means of $y$: residuals are deviations from group means (centered at 0 within each group). (Why?)

Dummy Variables

If there is an intercept

one category must be excluded (“reference group”); otherwise linear dependence. (R will do this by default)
coefficient is difference in means between the group indicated by the dummy and the intercept (“reference group” mean)

If there is no intercept:

all categories must be included
separate intercepts for each group: coefficient is the mean of the group indicated by the dummy (when all other variables are $0$).

Exercise:

How did different regions in the UK vote on Brexit?

Download the brexit data:

brexit = read.csv("https://raw.githubusercontent.com/mdweaver/mdweaver.github.io/refs/heads/master/poli572b/lecture_10/analysisData.csv")

Pct_Lev is the vote share in favor of ‘Leaving’ the EU
Region is a character indicating the area within the UK.
Observations are voting districts

Exercise:

Using the lm function: (lm(y ~ x, data = data))

Tabulate Region using table to see what values there are.
Regress Pct_Lev on Region
Interpret the coefficients. (summary(your_model))
(Without looking at them) What would the predicted values $\widehat{Y}$ from this regression be - with respect to the regions?
How should we interpret the residuals $e$?
Now, run the following:

regions = brexit$Region %>% unique %>% sort
brexit$Region2 = factor(brexit$Region, levels = regions[c(12, 1:11)])

Repeat the regression. How has the interpretation of coefficients changed?

Least Squares with Controls

Interpreting w/ Controls

So far, we have considered fitting group means.

indicator/dummy variables have been for exclusive membership in groups
what happens if we control for membership in different groups?

Interpreting w/ Controls

We want to estimate differences in earnings across gender, but we want to control for the sector of employment.

In this data, we only have doctors and lawyers, so we can estimate the following:

$Earnings_i = b_0 + b_1 \ Female_i + b_2 \ Medicine_i$

Where $Female_i$ is $1$ if person is female, $0$ if male. $Medicine_i$ is $1$ if they are a doctor $0$ if they are a lawyer.

Interpreting w/ Controls

Linear algebra showed us $b_1$ tells us relationship between $Female_i$ orthogonal to $Medicine_i$ and intercept, and $Earnings$

What is residual $Female_i^*$ in this case? (Think about residuals in models with group dummies above)

Residual $Female_i^*$ will be deviation from mean level of $Female$ in law and medicine

Interpreting w/ Controls

Slope is $\frac{Cov(x^*, y)}{Var(x^*)}$: Variance of residuals might differ across $Medicine_i$.

What would happen to $Var(Female^*)$ among doctors if ALL doctors were female?

No variation left - no observations differ from the mean of $1$, do not contribute to the slope (zero weight)

Example

Let’s now find the (approximate) linear conditional expectation function of earnings, across gender and profession:

ls = lm(INCEARN ~ FEMALE + MEDICINE , acs_data)

Example

## 
## Call:
## lm(formula = INCEARN ~ FEMALE + MEDICINE, data = acs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -226109  -93592  -35873   81891  762891 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   176243       2034   86.66   <2e-16 ***
## FEMALE        -55370       2824  -19.61   <2e-16 ***
## MEDICINE       55866       2670   20.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 132700 on 9997 degrees of freedom
## Multiple R-squared:  0.07833,    Adjusted R-squared:  0.07814 
## F-statistic: 424.8 on 2 and 9997 DF,  p-value: < 2.2e-16

How do we make sense of, e.g., the slope on FEMALE?

Example

Let’s now find the (approximate) linear conditional expectation function of earnings, across gender and hours worked:

We’re not controlling for a binary indicator, but a continuous variable.

ls = lm(INCEARN ~ FEMALE + UHRSWORK , acs_data)

Interpreting w/ Controls

Linear algebra showed us $b_1$ tells us relationship between $Female_i$ orthogonal to $Hours_i$ and intercept, and $Earnings$

What is the residual/orthogonal $Female_i^*$ in this case? ()

Residual $Female_i^*$ will have perfectly $0$ covariance/correlation with $Hours_i$

Example

## 
## Call:
## lm(formula = INCEARN ~ FEMALE + UHRSWORK, data = acs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -243715  -98889  -38988   90987  796111 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 136821.6     6268.7   21.83   <2e-16 ***
## FEMALE      -53694.3     2886.5  -18.60   <2e-16 ***
## UHRSWORK      1241.4      115.3   10.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134800 on 9997 degrees of freedom
## Multiple R-squared:  0.04898,    Adjusted R-squared:  0.04879 
## F-statistic: 257.4 on 2 and 9997 DF,  p-value: < 2.2e-16

How do we make sense of, e.g., the slope on FEMALE?

Example

Gender and age orthogonal but NOT independent

Interpreting w/ Controls

Slope is $\frac{Cov(x^*, y)}{Var(x^*)}$: Variance of residuals might differ across $Hours_i$.

What happens to $Var(Female^*)$ for values of hours worked where $Hours_i$ badly predicts $Female_i$?

Greater variance! Increased weight on those observations

Example

Interpreting w/ Controls

What could we do if we really wanted to “hold hours worked constant”?

Compare earnings by gender within groups where hours are the same (think about our indicator variables from earlier)?
We can ensure that gender is exactly unrelated to each year of age by fitting an intercept for each age.

In this case, we have 10000 observations, so this is doable. Not always so lucky

Non-Linearity with Least Squares

Non-linearity with Least Squares

If the the CEF, $E[Y|X]$ is not linear in $X$, we can still use least squares to model this relationship.

The easiest choice is to use a polynomial expansion of $X$.

If a straight-line relationship between $X$ and $Y$ is clearly wrong, we can model a “U”-shape by adding a squared term of $X$:

$Earnings_i = b_0 + b_1 Female + b_2 Hours + b_3 Hours^2$

It is “linear” in that we still multiply values of $X$ by $\beta$ and sum, but we use non-linear transformations of the data.

Can we do better?

m = lm(INCEARN ~ FEMALE + poly(UHRSWORK, 3, raw = T) , acs_data)
summary(m)$coefficients

##                                  Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)                 -5.546584e+05 8.532942e+04  -6.500201 8.404065e-11
## FEMALE                      -4.987018e+04 2.870964e+03 -17.370536 1.305089e-66
## poly(UHRSWORK, 3, raw = T)1  3.302792e+04 4.423739e+03   7.466063 8.952876e-14
## poly(UHRSWORK, 3, raw = T)2 -4.500185e+02 7.373786e+01  -6.102950 1.079941e-09
## poly(UHRSWORK, 3, raw = T)3  1.934612e+00 3.946976e-01   4.901505 9.660031e-07

Non-linearity with Least Squares

We can incorporate all kinds of non-linearity:

polynomials (up to $n-1$ degrees)
logarithms (natural log very common)
exponents
square-roots

But there are trade-offs…

Non-linearity with Least Squares

Choice of non-linearity may be arbitrary:

unless we have good theoretical motivation, which non-linearity do we choose?
how do we decide which non-linear function is “best”?
worst case scenario: we can fit to arbitrary precision.

Linear fit

Perfect Polynomial fit…

Perfect Polynomial fit?

Non-linearity with Least Squares

We risk overfitting the data, leading to very bad extrapolations/interpolations.

Choice of non-linearity:

In general, polynomials not a great idea
natural splines fit cubic polynomials in pieces (better)
machine learning tools (GAM, KRLS) fit arbitrary functions
But we have to cross-validate to assess out-of-sample predictions

Non-linearity with Least Squares

Non-linearity makes interpreting coefficients difficult.

If we include $x$ and $x^2$, we can’t directly interpret these
Non linearity means slope is not constant.
Coefficients do not summarize the average slope
We need to calculate average partial derivative/average partial effect

See Aronow and Miller, Chapter 4.3.4

Basically, if you non-linearly transform $Y$ or $X$, you need to check how to interpret.

Key Takeaways:

If we ‘control’ for binary variables, we can exactly “hold things constant”.
- this is only possible when $n >> g$: many more observations than ‘groups’ we want to hold constant
If we ‘control’ for continuous variables, least squares only ensures orthogonality between variable of interest and control.
- orthogonality $\neq$ independence
- this ensures only zero linear correlation.
We can split the difference w/ non-linear transformations
- how to choose? how to avoid overfitting?
- difficult to interpret coefficients for transformed variables.