\[\hat{y_i} = b_0 + b_1x_i\]
\[b_1 = \frac{Cov(x,y)}{Var(x)}\]
\[b_0 = \overline{y} - \overline{x}\cdot b_1\]
Mean and least squares are orthogonal projection of \(y\) on \(X\).
Previously we predicted \(Y\) as a linear function of \(x\):
\[\hat{y_i} = b_0 + b_1 \cdot x_i\]
Now, we can imagine predicting \(y\) as a linear function of many variables:
\[\hat{y_i} = b_0 + b_1 x_1 + b_2 x_2 + \ldots + b_k x_k\]
Given \(\mathbf{y}\), an \(n \times 1\) dimensional vector of all values \(y\) for \(n\) observations
and \(\mathbf{X}\), an \(n \times 2\) dimensional matrix (\(2\) columns, \(n\) observations). We call this the design matrix. A vector of \(\mathbf{1}\) (for an intercept), a vector \(x\) for our other variable.
\(\mathbf{\hat{y}}\) is an \(n \times 1\) dimensional vector of predicted values (for the mean of Y conditional on X) computed by \(\mathbf{X\beta}\). \(\mathbf{\beta}\) is a vector \(p\times 1\) of (coefficients) that we multiply by \(\mathbf{X}\).
We’ll assume there are only two coefficients in \(\mathbf{\beta}\): \((b_0,b_1)\) so that \(\hat{y_i} = b_0 + b_1 \cdot x_i\), so \(p = 2\)
\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}; \beta = \begin{pmatrix} b_0 \\ b_1 \end{pmatrix}\]
\[\widehat{y_i} = b_0 + b_1 \cdot x_i\]
\[\widehat{y}_{n \times 1} = \mathbf{X}_{n \times p}\beta_{p \times 1}\]
\[\begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} = \begin{pmatrix} 1 \cdot b_0 + x_1\cdot b_1 \\ \vdots \\ 1\cdot b_0 + x_n \cdot b_1 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \mathbf{\widehat{y}}\]
\(\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}\) gives us the residuals (prediction errors).
\[\widehat{y_i} = b_0 + b_1 \cdot x_{1i} + b_2 \cdot x_{2i} \]
\[\mathbf{X}\beta = \begin{pmatrix} 1 & x_{11} & x_{21} \\ \vdots & \vdots & \vdots \\ 1 & x_{1n} & x_{2n} \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \\ b_2 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \hat{Y}\]
the conditional expectation function (Angrist and Pischke)
expectation: because it is about the mean: \(E[Y]\)
conditional: because it is conditional on values of \(X\) … \(E[Y |X]\)
function: because \(E[Y | X] = f(X)\), there is some mathematical mapping between values of \(X\) and \(E[Y]\).
\[E[Y | X = x] = f(x)\]
Another way of thinking about what the conditional expectation function is:
\[E[Y | X = x] = \hat{y} | X = x\]
What is our predicted value \(\hat{y}\), where \(X = x\), such that our prediction of \(Y\), \(\hat{y}\) has the least error?
With least squares, we find a linear approximation of the CEF
\[\hat{y_i} = b_0 + b_1x_i\]
\[E[Y | X = x_i] = b_0 + b_1x_i\]
We want to understand:
First, copy this code into R:
earnings = read.csv('https://raw.githubusercontent.com/mdweaver/mdweaver.github.io/refs/heads/master/poli572b/lecture_10/example_7.csv')
Data contains:
AGE
: the age of survey respondents in years
INCEARN
: the annual earnings (on average) for that age
group
respondents
: Number of survey respondents in this
category
In pairs:
X
for this equation \(INCEARN_i = b_0 + b_1 AGE_i\) (hint:
cbind
)t()
,
solve()
, %*%
)lm([your y] ~ [your x], data = [your data])
,
solve for coefficients \(\beta\) and
compare.AGE
; plot a line of \(\hat{y}\) by AGE
: is this a
perfect fit of the CEF?CEF is not linear
Does the regression line fit all values of \(E[Y|X=x]\) equally well?
When we use individual data…
## [,1]
## [1,] 17501.992
## [2,] 3668.629
Do your estimates agree? Why or why not?
We were not accounting for the number of people in each value of \(X_i\)
#With aggregate data...
X = cbind(1, earnings$AGE)
y = earnings$INCEARN
w = earnings$respondents
w = w * diag(length(w))
beta = solve(t(X) %*% w %*% X) %*% t(X) %*% w %*% y
beta
## [,1]
## [1,] 23404.565
## [2,] 3535.247
Least Squares predictions \(\widehat{Y} = X'\beta\) gives the linear approximation of \(E(Y_i | X_i)\) that has the smallest mean-squared error (minimum distance to the true \(E(Y_i | X_i)\)).
Alternatively, least squares is a particular weighted average of derivative of non-linear CEF (board)
Now, in pairs:
e
?cor
function: get the correlation between
residuals e
and AGE
The mathematical procedures we use in regression ensure that:
\(1\). the mean of the residuals is always zero (if we include an intercept). Because we included an intercept (\(b_0\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.
the mean of the residuals is always zero.
We choose \(\begin{pmatrix}b_0 \\ b_1 \end{pmatrix}\) such that \(e\) is orthogonal to \(\mathbf{X}\). One column of \(\mathbf{X}\) is all \(1\)s, to get the intercept (recall how we used vectors to get the mean). So \(e\) is orthogonal to \(\begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}\).
\[\mathbf{1}'e = 0\]
And if this is true, the \(\sum e_i = 0\) so \(\frac{1}{n}\sum e_i = 0\).
The mathematical procedures we use in regression ensure that:
\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.
Recall that \(Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e}\)
We chose \(\beta\) (\(a,b\)) such that \(X'e = 0\) so they would be orthogonal.
\(X'e = 0 \to \sum x_ie_i = 0 \to \overline{xe}=0\);
And, from above, we know that \(\overline{e}=0\);
so \(Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e} = 0 - \overline{x}0 = 0\).
\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.
This also means that residuals \(e\) are always perfectly uncorrelated (Pearson correlation) with all the columns in our matrix \(\mathbf{X}\): all the variables we include in the regression model.
We want to:
Suppose we fit the following equation using least squares?:
\(Earnings_i = b_0\)
Suppose we want to estimate earnings as a function of gender
\(Earnings_i = b_0 + b_1 \ Female_i\)
assuming, as the survey data does, the gender binary
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 201941.70 1656.080 121.93958 0.000000e+00
## FEMALE -57269.02 2883.739 -19.85929 4.132939e-86
What would change if we changed the design matrix so that instead of a vector of \(1\) and indicated Female (\(1\)) vs Male (\(0\)), we used a vector of \(2\) and indicated Female (\(2\)) vs Male (\(0\))?
\(Earnings_i = b_0 + b_1 \ Female_i + \ b_2 Male_i\)
\(Earnings_i = b_0 \ Female_i + \ b_1 Male_i\)
Would the \(\hat{y}\) be different in these three models?
$Earnings_i = b_0 + b_1 Female_i $
## (Intercept) FEMALE
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 1
## 5 1 0
## 6 1 1
## (Intercept) FEMALE
## 201941.70 -57269.02
\(Earnings_i = b_0 + b_1 \ Female_i + \ b_2 Male_i\)
## (Intercept) FEMALE MALE
## 1 1 0 1
## 2 1 0 1
## 3 1 0 1
## 4 1 1 0
## 5 1 0 1
## 6 1 1 0
## (Intercept) FEMALE MALE
## 201941.70 -57269.02 NA
\(Earnings_i = b_0 \ Female_i + \ b_1 Male_i\)
## FEMALE MALE
## 1 0 1
## 2 0 1
## 3 0 1
## 4 1 0
## 5 0 1
## 6 1 0
## FEMALE MALE
## 144672.7 201941.7
Lessons from these examples:
Despite different coefficients, least squares can make identical predictions \(\hat{y}\); changes interpretation of coefficients
We need to specify design matrix/model to get estimates of interest. (e.g. difference in means)
If we have exclusive indicator variables for belonging to distinct categories (sometimes called “dummy variables”), must drop one group if we fit an intercept. (Why?)
If we fit group means of \(y\): residuals are deviations from group means (centered at 0 within each group). (Why?)
If there is an intercept
R
will do this by
default)If there is no intercept:
How did different regions in the UK vote on Brexit?
brexit
data:brexit = read.csv("https://raw.githubusercontent.com/mdweaver/mdweaver.github.io/refs/heads/master/poli572b/lecture_10/analysisData.csv")
Pct_Lev
is the vote share in favor of ‘Leaving’ the
EURegion
is a character indicating the area within the
UK.Using the lm
function:
(lm(y ~ x, data = data)
)
Region
using table
to see what
values there are.Pct_Lev
on Region
summary(your_model)
)regions = brexit$Region %>% unique %>% sort
brexit$Region2 = factor(brexit$Region, levels = regions[c(12, 1:11)])
Repeat the regression. How has the interpretation of coefficients changed?
So far, we have considered fitting group means.
We want to estimate differences in earnings across gender, but we want to control for the sector of employment.
In this data, we only have doctors and lawyers, so we can estimate the following:
\(Earnings_i = b_0 + b_1 \ Female_i + b_2 \ Medicine_i\)
Where \(Female_i\) is \(1\) if person is female, \(0\) if male. \(Medicine_i\) is \(1\) if they are a doctor \(0\) if they are a lawyer.
Let’s now find the (approximate) linear conditional expectation function of earnings, across gender and profession:
##
## Call:
## lm(formula = INCEARN ~ FEMALE + MEDICINE, data = acs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -226109 -93592 -35873 81891 762891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 176243 2034 86.66 <2e-16 ***
## FEMALE -55370 2824 -19.61 <2e-16 ***
## MEDICINE 55866 2670 20.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 132700 on 9997 degrees of freedom
## Multiple R-squared: 0.07833, Adjusted R-squared: 0.07814
## F-statistic: 424.8 on 2 and 9997 DF, p-value: < 2.2e-16
How do we make sense of, e.g., the slope on FEMALE
?
Let’s now find the (approximate) linear conditional expectation function of earnings, across gender and hours worked:
We’re not controlling for a binary indicator, but a continuous variable.
##
## Call:
## lm(formula = INCEARN ~ FEMALE + UHRSWORK, data = acs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -243715 -98889 -38988 90987 796111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 136821.6 6268.7 21.83 <2e-16 ***
## FEMALE -53694.3 2886.5 -18.60 <2e-16 ***
## UHRSWORK 1241.4 115.3 10.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134800 on 9997 degrees of freedom
## Multiple R-squared: 0.04898, Adjusted R-squared: 0.04879
## F-statistic: 257.4 on 2 and 9997 DF, p-value: < 2.2e-16
How do we make sense of, e.g., the slope on FEMALE
?
Gender and age orthogonal but NOT independent
What could we do if we really wanted to “hold hours worked constant”?
Compare earnings by gender within groups where hours are the same (think about our indicator variables from earlier)?
We can ensure that gender is exactly unrelated to each year of age by fitting an intercept for each age.
If the the CEF, \(E[Y|X]\) is not linear in \(X\), we can still use least squares to model this relationship.
The easiest choice is to use a polynomial expansion of \(X\).
If a straight-line relationship between \(X\) and \(Y\) is clearly wrong, we can model a “U”-shape by adding a squared term of \(X\):
\(Earnings_i = b_0 + b_1 Female + b_2 Hours + b_3 Hours^2\)
It is “linear” in that we still multiply values of \(X\) by \(\beta\) and sum, but we use non-linear transformations of the data.
Can we do better?
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.546584e+05 8.532942e+04 -6.500201 8.404065e-11
## FEMALE -4.987018e+04 2.870964e+03 -17.370536 1.305089e-66
## poly(UHRSWORK, 3, raw = T)1 3.302792e+04 4.423739e+03 7.466063 8.952876e-14
## poly(UHRSWORK, 3, raw = T)2 -4.500185e+02 7.373786e+01 -6.102950 1.079941e-09
## poly(UHRSWORK, 3, raw = T)3 1.934612e+00 3.946976e-01 4.901505 9.660031e-07
We can incorporate all kinds of non-linearity:
But there are trade-offs…
Linear fit
Perfect Polynomial fit…
Perfect Polynomial fit?
We risk overfitting the data, leading to very bad extrapolations/interpolations.
See Aronow and Miller, Chapter 4.3.4
Basically, if you non-linearly transform \(Y\) or \(X\), you need to check how to interpret.