POLI 572B

Michael Weaver

February 27, 2026

Least Squares Continued

Objectives

  • Recap: Math of Least Squares
  • Least Squares \(\to\) CEF
    • linearity?
    • best linear approximation?
    • properties of residuals?
  • Interpretation of Coefficients
    • design matrix \(\to\) coefficients \(\to\) predicted values
    • dummy variables
    • continuous variables
    • “controlling”
    • non-linearity

Recap

Conditional Expectation Function

the conditional expectation function (Angrist and Pischke)

expectation: because it is about the mean: \(E[Y]\)

conditional: because it is conditional on values of \(X\)\(E[Y |X]\)

function: because \(E[Y | X] = f(X)\), there is some mathematical mapping between values of \(X\) and \(E[Y]\).

\[E[Y | X = x] = f(x)\]

Conditional Expectation Function

Another way of thinking about what the conditional expectation function is as a prediction function:

A prediction function takes in values of \(X\) and gives us a prediction of \(Y\)\(\hat{y}\).

  • What makes for a good prediction? Least error.
  • What is our “rule” for judging errors?
  • If the rule is: prediction \(\hat{y}\) have the smallest distance to the true values of \(Y\)
  • this leads to the CEF \[E[Y | X = x] = \hat{y} | X = x\]

Orthogonality

Euclidean Distance between \(y\) and \(\hat{y}\) (true and predicted values) is minimized when prediction errors \(\mathbf{e}\) are orthogonal to predictors \(X\)

  • orthogonal: vectors at 90 degree angle
  • dot product of vectors is \(0\) \(\to\) covariance/correlation is \(0\)

Deriving the mean:

Bivariate Regression

Bivariate regression chooses \(\hat{y}\) as closest possible prediction of \(y\) with the form

\[\mathbf{\hat{y}} = b_0 + b_1\cdot \mathbf{x}\]

An intercept \(b_0\) and coefficient \(b_1\) multiplied by \(\mathbf{x}\)


A conditional expectation function:

\[E(Y | X = x) = b_0 + b_1\cdot x\]

Bivariate Regression

\[\hat{y_i} = b_0 + b_1x_i\]

The slope:

\[b_1 = \frac{Cov(x,y)}{Var(x)}\]

The Intercept:

\[b_0 = \overline{y} - \overline{x}\cdot b_1\]

To go into multivariate regression, we need linear algebra

Deriving Least Squares

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}; \beta = \begin{pmatrix} b_0 \\ b_1 \end{pmatrix}\]

\(\mathbf{X}\) is called our design matrix: maps to our equation

Deriving Least Squares

\[\widehat{y_i} = b_0 + b_1 \cdot x_i\]

\[\widehat{y}_{n \times 1} = \mathbf{X}_{n \times p}\beta_{p \times 1}\]

\[\begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} = \begin{pmatrix} 1 \cdot b_0 + x_1\cdot b_1 \\ \vdots \\ 1\cdot b_0 + x_n \cdot b_1 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \mathbf{\widehat{y}}\]

\(\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}\) gives us the residuals (prediction errors).

Solving for \(\beta\)

Forcing residuals to be orthogonal to \(X\) leads to…

\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\]

Original \(n\)d space

Column space of \(\mathbf{X}\): orthogonal basis

Multivariate Least Squares:

Mean and least squares are orthogonal projection of \(y\) on \(X\).

Previously we predicted \(Y\) as a linear function of \(x\):

\[\hat{y_i} = b_0 + b_1 \cdot x_i\]

Now, we can imagine predicting \(y\) as a linear function of many variables:

\[\hat{y_i} = b_0 + b_1 x_1 + b_2 x_2 + \ldots + b_k x_k\]

Multivariate Least Squares:

  • When we calculated the mean using matrix algebra, we orthogonally projected the \(n\) dimensional vector \(Y\) onto a point on a one-dimensional line.
  • When we calculated the bivariate regression line, we orthogonally projected the \(n\) dimensional vector \(Y\) onto a \(2\)-dimensional space (one for \(b_0\) and one for \(b_1\))
  • When we use multi-variate regression, we orthogonally project the \(n\) dimensional vector \(Y\) onto a \(p\) dimensional space (one for each coefficient)

Multivariate Least Squares:

What is “orthogonally projecting onto \(p\) dimensions”

When we project into two dimensions, these dimensions are precisely like the \(x\) and \(y\) axes on a graph: perpendicular/orthogonal to each other.

In multivariate regression, because we are going to project \(y\) onto \(p\) orthogonal dimensions in \(\mathbf{X}\). (\((\mathbf{X}'\mathbf{X})^{-1}\) transforms to orthogonal basis)

  • “under the hood”, regression creates a new version of \(\mathbf{X}\) where each column is orthogonal to the others

Mathematical Requirements:

  1. Matrix \(X\) has “full rank”
  • This means that all of the columns of \(\mathbf{X}\) are linearly independent.
    • cannot have two identical columns
    • cannot have set of columns that sum up to another column multiplied by a scalar
  • If \(\mathbf{X}\) is not full rank, cannot be inverted, cannot do least squares.
    • but we’ll see more on the intuition for this later.
  1. \(n \geq p\): we need to have more data points than variables in our equation
  • no longer trivial with multiple regression

Multivariate Least Squares:

Examples: Linear Dependence?

1 1 0 0
1 0 1 0
1 0 0 1

Multivariate Least Squares:

Examples: Linear Dependence?

1 2 0 0
1 0 2 0
1 0 0 2

Multivariate Least Squares:

Examples: Linear Dependence?

1 0.50 0.25 0.25
1 0.25 0.50 0.25
1 0.25 0.25 0.50

Regression and CEF

What is the relationship between least squares and CEF?

Conditional Expectation Function

With least squares, we find a linear approximation of the CEF

\[\hat{y_i} = b_0 + b_1x_i\]

\[E[Y | X = x_i] = b_0 + b_1x_i\]

We want to understand:

  1. What do we mean by “linear”?
  2. What do we mean by “best” linear approximation? (What is the relationship between \(\hat{y}\) and \(E[Y|X=x]\)?)
  3. What are the properties of ‘residuals’ \(\mathbf{e}\)/prediction errors

First, copy this code into R:

earnings = read.csv('https://raw.githubusercontent.com/mdweaver/mdweaver.github.io/refs/heads/master/poli572b/lecture_10/example_7.csv')

Data contains average earnings by age group: (so it is the CEF of Earnings by Age)

AGE: the age of survey respondents in years

INCEARN: the annual earnings (on average) for that age group

respondents: Number of survey respondents in this category

In pairs:

  1. Create the design matrix X for this equation \(INCEARN_i = b_0 + b_1 AGE_i\) (hint: cbind)
  2. Using the matrix equation, solve for coefficients \(\beta\) in R: (hint: t(), solve(), %*%)
  3. Using lm([your y] ~ [your x], data = [your data]), solve for coefficients \(\beta\) and compare.
  4. Using matrix equation, calculate \(\hat{y}_i\) (hint: multiply \(\mathbf{X}\beta\))
  5. Calculate \(e\), the residuals using \(y\) and \(\hat{y}\)
  6. Plot points for \(Y\) by AGE; plot a line of \(\hat{y}\) by AGE: is this a perfect fit of the CEF?

Is the CEF actually linear?

Does the regression line fit all values of \(E[Y|X=x]\) equally well?

Least Squares and CEF, take-away (1)

  • Least squares is “linear” in that we approximate CEF of \(Y\) using a linear combination of \(X\) (sum \(X\) multiplied by coefficients: in the span/column space of \(X\))

  • linear in the sense that we can multiply each element in \(X\) by a scalar (coefficient) and add them up

Graph of Averages

Sometimes CEF is mostly linear. Are some points better approximated than others?

When we use individual data to look at age and earnings…

X = cbind(1, acs_data$AGE)
y = acs_data$INCEARN

beta = solve(t(X) %*% X) %*% t(X) %*% y

beta
##           [,1]
## [1,] 17501.992
## [2,]  3668.629

Do your estimates agree? Why or why not?

We were not accounting for the number of people in each value of \(X_i\)

#With aggregate data...
X = cbind(1, earnings$AGE)
y = earnings$INCEARN
w = earnings$respondents 
w = w * diag(length(w))
beta = solve(t(X) %*% w %*% X) %*% t(X) %*% w %*% y

beta
##           [,1]
## [1,] 23404.565
## [2,]  3535.247

Least Squares and CEF, take-away (2a)

Even if true CEF is non-linear…

Least Squares predictions \(\widehat{Y} = X'\beta\) gives the linear approximation of \(E(Y_i | X_i)\) that has the minimum distance to the true \(E(Y_i | X_i)\) in \(n\)d space.

Because it minimizes distance in \(n\)d space:

  • it makes better predictions for values of \(X = x\) where there are more observations
  • could equivalently fit a line for the mean of Y: \(E(Y) | X\) rather than prediction for each \(Y_i\). We could use \(E(Y_i | X_i)\) as the dependent variable, but we need to account for number of observations.

Least Squares and CEF, take-away (2b)

Even if true CEF is non-linear…

Alternatively, least squares can be interpreted as a particular weighted average of derivative of non-linear CEF (see Mostly Harmless) (board)

  • weights greater closer to median of \(x\)

Now, in pairs:

  1. What is the mean of residuals e?
  2. Using the cor function: get the correlation between residuals e and AGE

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(1\). the mean of the residuals is always zero (if we include an intercept). Because we included an intercept (\(b_0\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.

Why?

the mean of the residuals is always zero.

We choose \(\begin{pmatrix}b_0 \\ b_1 \end{pmatrix}\) such that \(e\) is orthogonal to \(\mathbf{X}\). One column of \(\mathbf{X}\) is all \(1\)s, to get the intercept (recall how we used vectors to get the mean). So \(e\) is orthogonal to \(\begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}\).

\[\mathbf{1}'e = 0\]

And if this is true, the \(\sum e_i = 0\) so \(\frac{1}{n}\sum e_i = 0\).

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.

Recall that \(Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e}\)

We chose \(\beta\) (\(a,b\)) such that \(X'e = 0\) so they would be orthogonal.

\(X'e = 0 \to \sum x_ie_i = 0 \to \overline{xe}=0\);

And, from above, we know that \(\overline{e}=0\);

so \(Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e} = 0 - \overline{x}0 = 0\).

\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.

This also means that residuals \(e\) are always perfectly uncorrelated (Pearson correlation) with all the columns in our matrix \(\mathbf{X}\): all the variables we include in the regression model.

Interpreting Regressions

We want to understand:

  • How to generate predicted values: \(\hat{y}\)
  • The link between design matrix, \(\hat{y}\), and coefficients
  • How to interpret coefficients for dummy variables
  • How to interpret coefficients for continuous variables

Regression and Predicted Values

\[\widehat{y_i} = b_0 + b_1 \cdot x_i\]

\[\widehat{y}_{n \times 1} = \mathbf{X}_{n \times p}\beta_{p \times 1}\]

\[\begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} = \begin{pmatrix} 1 \cdot b_0 + x_1\cdot b_1 \\ \vdots \\ 1\cdot b_0 + x_n \cdot b_1 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \mathbf{\widehat{y}}\]

Least Squares and Binary Variables

Suppose we fit the following equation using least squares?:

\(Earnings_i = b_0\)

  • What is the design matrix?
  • How do we calculate \(\hat{y}\) for any case \(i\)?
  • What is the coefficient \(b_0\) (what is it equivalent to?)

Least Squares and Group Means

Suppose we want to estimate earnings as a function of gender

\(Earnings_i = b_0 + b_1 \ Female_i\)

  1. What does the design matrix \(X\) in this regression look like?
  2. How do we calculate \(\hat{y}\) when \(Female_i = 0\)?
  3. What is the interpretation of \(\hat{y}\) when \(Female_i = 0\)?
  4. How do we calculate \(\hat{y}\) when \(Female_i = 1\)?
  5. What is the interpretation of \(\hat{y}\) when \(Female_i = 1\)?
  6. What does the \(b_0\) tell us?
  7. What does the \(b_1\) tell us?
m1 = lm(INCEARN ~ FEMALE, acs_data)
summary(m1) %>% .$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 201941.70   1656.080 121.93958 0.000000e+00
## FEMALE      -57269.02   2883.739 -19.85929 4.132939e-86

What would change if we changed the design matrix so that instead of a vector of \(1\)s and a vector for Female (\(1\)) vs Male (\(0\)), we used a vector of \(2\)s and indicate Female (\(2\)) vs Male (\(0\))?

  1. How do we calculate \(\hat{y}\) when \(Female_i = 0\)?
  2. What is the interpretation of \(\hat{y}\) when \(Female_i = 0\)?
  3. How do we calculate \(\hat{y}\) when \(Female_i = 2\)?
  4. What is the interpretation of \(\hat{y}\) when \(Female_i = 2\)?
  5. What does the \(b_0\) tell us?
  6. What does the \(b_1\) tell us?

Least Squares and Group Means

\(Earnings_i = b_0 + b_1 \ Female_i + \ b_2 Male_i\)

  1. What does the design matrix \(X\) in this regression look like?
  2. How do we calculate \(\hat{y}\) when \(Female_i = 0\) and \(Male_i = 1\)?
  3. What is the interpretation of \(\hat{y}\) when \(Female_i = 0\) and \(Male_i = 1\)?
  4. How do we calculate \(\hat{y}\) when \(Female_i = 1\) and \(Male_i = 0\)?
  5. What is the interpretation of \(\hat{y}\) when \(Female_i = 1\) and \(Male_i = 0\)?
  6. What does the \(b_0\) tell us?
  7. What does the \(b_1\) tell us?
  8. What does the \(b_2\) tell us?
m2 = lm(INCEARN ~ FEMALE + MALE, acs_data)

Least Squares and Group Means

\(Earnings_i = b_0 \ Female_i + \ b_1 Male_i\)

  1. What does the design matrix \(X\) in this regression look like?
  2. How do we calculate \(\hat{y}\) when \(Female_i = 0\) and \(Male_i = 1\)?
  3. What is the interpretation of \(\hat{y}\) when \(Female_i = 0\) and \(Male_i = 1\)?
  4. How do we calculate \(\hat{y}\) when \(Female_i = 1\) and \(Male_i = 0\)?
  5. What is the interpretation of \(\hat{y}\) when \(Female_i = 1\) and \(Male_i = 0\)?
  6. What does the \(b_0\) tell us?
  7. What does the \(b_1\) tell us?
m3 = lm(INCEARN ~ -1 + FEMALE + MALE, acs_data)

Least Squares and Group Means

Would the \(\hat{y}\) be different in these three models?

\(Earnings_i = b_0 + b_1 \ Female_i\)

#Design matrix
model.matrix(m1) %>% head
##   (Intercept) FEMALE
## 1           1      0
## 2           1      0
## 3           1      0
## 4           1      1
## 5           1      0
## 6           1      1
#Interpret:
coefficients(m1)
## (Intercept)      FEMALE 
##   201941.70   -57269.02

\(Earnings_i = b_0 + b_1 \ Female_i\), multiplying design matrix by \(2\)

#Design matrix
mm_m1_2 %>% head
##   (Intercept) FEMALE
## 1           2      0
## 2           2      0
## 3           2      0
## 4           2      2
## 5           2      0
## 6           2      2
#Interpret:
coefficients(m1.2)
## mm_m1_2(Intercept)      mm_m1_2FEMALE 
##          100970.85          -28634.51

\(Earnings_i = b_0 + b_1 \ Female_i + \ b_2 Male_i\)

#Design matrix
model.matrix(m2) %>% head
##   (Intercept) FEMALE MALE
## 1           1      0    1
## 2           1      0    1
## 3           1      0    1
## 4           1      1    0
## 5           1      0    1
## 6           1      1    0
#Interpret:
coefficients(m2)
## (Intercept)      FEMALE        MALE 
##   201941.70   -57269.02          NA

\(Earnings_i = b_0 \ Female_i + \ b_1 Male_i\)

#Design matrix
model.matrix(m3) %>% head
##   FEMALE MALE
## 1      0    1
## 2      0    1
## 3      0    1
## 4      1    0
## 5      0    1
## 6      1    0
#Interpret:
coefficients(m3)
##   FEMALE     MALE 
## 144672.7 201941.7

Least Squares and Group Means

Lessons from these examples:

  1. Despite different coefficients, least squares can make identical predictions \(\hat{y}\); care in interpretation of coefficients

  2. We need to specify design matrix/model to get estimates of interest. (e.g. difference in means, vs. group averages)

  3. If we have exclusive indicator variables for belonging to distinct categories (sometimes called “dummy variables”), must drop one group if we fit an intercept. (Why?)

  4. If we fit group means of \(y\): residuals are deviations from group means (centered at 0 within each group). (Why?)

Dummy Variables

Binary variables (0/1) for membership in exclusive categories. (e.g. province of residence)

If there is an intercept

  • one category must be excluded (“reference group”); otherwise linear dependence. (R will do this by default)
  • coefficient is difference in means between the group indicated by the dummy and the intercept (“reference group” mean)

If there is no intercept:

  • all categories must be included
  • separate intercepts for each group: coefficient is the mean of the group indicated by the dummy (when all other variables are \(0\)).

Exercise:

How did different regions in the UK vote on Brexit?

  • Download the brexit data:
brexit = read.csv("https://raw.githubusercontent.com/mdweaver/mdweaver.github.io/refs/heads/master/poli572b/lecture_10/analysisData.csv")
  • Pct_Lev is the vote share in favor of ‘Leaving’ the EU
  • Region is a character indicating the area within the UK.
  • Observations are voting districts

Exercise:

Using the lm function: (lm(y ~ x, data = data))

  1. Tabulate Region using table to see what values there are.
  2. Regress Pct_Lev on Region
  3. (Look at the results: summary(your_model)) Write the design matrix.
  4. (Without looking at them) What would the predicted values \(\widehat{Y}\) from this regression be - with respect to the regions?
  5. Interpret the coefficients.
  6. How should we interpret the residuals \(e\)?
  7. Now, run the following:
regions = brexit$Region %>% unique %>% sort
brexit$Region2 = factor(brexit$Region, levels = regions[c(12, 1:11)])

Repeat the regression. How has the interpretation of coefficients changed?

Least Squares and Continuous Variables

Return to the earnings data…

m = lm(INCEARN ~ AGE, data = earnings, weights = earnings$respondents)
summary(m)$coefficients
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 23404.565 21609.7018 1.083058 2.856073e-01
## AGE          3535.247   464.7265 7.607156 3.766635e-09
  • How do we generate predicted values of earnings?
  • How do we interpret (Intercept)?
  • does this make sense?
  • How do we interpret AGE?

Least Squares and Continuous Variables

In pairs:

  1. Create a new variable in earnings: AGE_c which is AGE minus mean of AGE
  2. Regress INCEARN on AGE_c
  3. Examine the coefficients:
  • How do we interpret (Intercept)?
  • How do we interpret AGE?

Least Squares with Controls

Interpreting w/ Controls

So far, we have considered fitting group means.

  • indicator/dummy variables have been for exclusive membership in groups
  • what happens if we control for membership in different groups?

Interpreting w/ Controls

We want to estimate differences in earnings across gender, but we want to control for the sector of employment.

In this data, we only have doctors and lawyers, so we can estimate the following:

\(Earnings_i = b_0 + b_1 \ Female_i + b_2 \ Medicine_i\)

Where \(Female_i\) is \(1\) if person is female, \(0\) if male. \(Medicine_i\) is \(1\) if they are a doctor \(0\) if they are a lawyer.

Interpreting w/ Controls

  1. Geometric approach showed us \(b_1\) tells us relationship between \(Female_i^*\) — component of \(Female_i\) orthogonal to \(Medicine_i\) and intercept — and \(Earnings\)
  • What will be the value of residual \(Female_i^*\) in this case? (Think about residuals in models with group dummies above)
  • Residual \(Female_i^*\) will be deviation from mean value of \(Female\) in law and medicine

Interpreting w/ Controls

  1. Slope is \(\frac{Cov(x^*, y)}{Var(x^*)}\): Variance of residuals might differ across \(Medicine_i\).
  • What would happen to \(Var(Female^*)\) among doctors if ALL doctors were female?
  • What would be the mean of \(Female\) among doctors?
  • What would \(Female^*\) be?
  • No variation left - no observations differ from the mean of \(1\), do not contribute to the slope (zero weight)

Example

Let’s now find the linear conditional expectation function of earnings, across gender and profession:

ls = lm(INCEARN ~ FEMALE + MEDICINE , acs_data)

Example

## 
## Call:
## lm(formula = INCEARN ~ FEMALE + MEDICINE, data = acs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -226109  -93592  -35873   81891  762891 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   176243       2034   86.66   <2e-16 ***
## FEMALE        -55370       2824  -19.61   <2e-16 ***
## MEDICINE       55866       2670   20.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 132700 on 9997 degrees of freedom
## Multiple R-squared:  0.07833,    Adjusted R-squared:  0.07814 
## F-statistic: 424.8 on 2 and 9997 DF,  p-value: < 2.2e-16

How do we make sense of the slope on FEMALE?

Example

Example

Example

Let’s now find the (approximate) linear conditional expectation function of earnings, across gender and hours worked:

We’re not controlling for a binary indicator, but a continuous variable.

ls = lm(INCEARN ~ FEMALE + UHRSWORK , acs_data)

Interpreting w/ Controls

  1. Linear algebra showed us \(b_1\) tells us relationship between \(Female_i\) orthogonal to \(Hours_i\) and intercept, and \(Earnings\)
  • What is the residual/orthogonal \(Female_i^*\) in this case? ()
  • Residual \(Female_i^*\) will have perfectly \(0\) covariance/correlation with \(Hours_i\)

Exercise

Using the brexit data:

  1. Using lm, regress Pct_Lev on total_precip_polling and Region
  2. Using lm, regress total_precip_polling on Region and get residuals (m$residuals) as precip_star
  3. Regress precip_star on Region.

Example

## 
## Call:
## lm(formula = INCEARN ~ FEMALE + UHRSWORK, data = acs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -243715  -98889  -38988   90987  796111 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 136821.6     6268.7   21.83   <2e-16 ***
## FEMALE      -53694.3     2886.5  -18.60   <2e-16 ***
## UHRSWORK      1241.4      115.3   10.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134800 on 9997 degrees of freedom
## Multiple R-squared:  0.04898,    Adjusted R-squared:  0.04879 
## F-statistic: 257.4 on 2 and 9997 DF,  p-value: < 2.2e-16

How do we make sense of, e.g., the slope on FEMALE?

Example

Gender and age orthogonal but NOT independent

Interpreting w/ Controls

  1. Slope is \(\frac{Cov(x^*, y)}{Var(x^*)}\): Variance of residuals might differ across \(Hours_i\).
  • What happens to \(Var(Female^*)\) for values of hours worked where \(Hours_i\) badly predicts \(Female_i\)?
  • Greater variance! Increased weight on those observations

Example

Interpreting w/ Controls

What could we do if we really wanted to “hold hours worked constant”?

  • Compare earnings by gender within groups where hours are the same (think about our indicator variables from earlier)?
  • We can ensure that gender is exactly unrelated to each year of age by fitting an intercept for each age.
  • In this case, we have 10000 observations, so this is doable. Not always so lucky

Non-Linearity with Least Squares

Non-linearity with Least Squares

If the the CEF, \(E[Y|X]\) is not linear in \(X\), we can still use least squares to model this relationship.

The easiest choice is to use a polynomial expansion of \(X\).

If a straight-line relationship between \(X\) and \(Y\) is clearly wrong, we can model a “U”-shape by adding a squared term of \(X\):

\(Earnings_i = b_0 + b_1 Female + b_2 Hours + b_3 Hours^2\)

It is “linear” in that we still multiply values of \(X\) by \(\beta\) and sum, but we use non-linear transformations of the data.

Can we do better?

m = lm(INCEARN ~ FEMALE + poly(UHRSWORK, 3, raw = T) , acs_data)
summary(m)$coefficients
##                                  Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)                 -5.546584e+05 8.532942e+04  -6.500201 8.404065e-11
## FEMALE                      -4.987018e+04 2.870964e+03 -17.370536 1.305089e-66
## poly(UHRSWORK, 3, raw = T)1  3.302792e+04 4.423739e+03   7.466063 8.952876e-14
## poly(UHRSWORK, 3, raw = T)2 -4.500185e+02 7.373786e+01  -6.102950 1.079941e-09
## poly(UHRSWORK, 3, raw = T)3  1.934612e+00 3.946976e-01   4.901505 9.660031e-07

Non-linearity with Least Squares

We can incorporate all kinds of non-linearity:

  • polynomials (up to \(n-1\) degrees)
  • logarithms (natural log very common)
  • exponents
  • square-roots
  • sine/cosine

But there are trade-offs…

Non-linearity with Least Squares

  1. Choice of non-linearity may be arbitrary:
  • unless we have good theoretical motivation, which non-linearity do we choose?
  • how do we decide which non-linear function is “best”?
  • worst case scenario: we can fit to arbitrary precision.

Linear fit

These prediction errors are intolerable!

Perfect Polynomial fit…

Perfect Polynomial fit?

Non-linearity with Least Squares

We risk overfitting the data, leading to very bad extrapolations/interpolations.

  1. Choice of non-linearity:
  • In general, polynomials not a great idea
  • natural splines fit cubic polynomials in pieces (better)
  • machine learning tools (GAM, KRLS) fit arbitrary functions
  • But we have to cross-validate to assess out-of-sample predictions

Non-linearity with Least Squares

  1. Non-linearity makes interpreting coefficients difficult.
  • If we include \(x\) and \(x^2\), we can’t directly interpret these
  • Non linearity means slope is not constant.
  • Coefficients do not summarize the average slope
  • We need to calculate average partial derivative/average partial effect

See Aronow and Miller, Chapter 4.3.4

Basically, if you non-linearly transform \(Y\) or \(X\), you need to check how to interpret.

Key Takeaways:

  • If we ‘control’ for binary variables, we can exactly “hold things constant”.
    • this is only possible when \(n >> g\): many more observations than ‘groups’ we want to hold constant
  • If we ‘control’ for continuous variables, least squares only ensures orthogonality between variable of interest and control.
    • orthogonality \(\neq\) independence
    • this ensures only zero linear correlation.
  • We can split the difference w/ non-linear transformations
    • how to choose? how to avoid overfitting?
    • difficult to interpret coefficients for transformed variables.