February 11, 2020

Objectives

(1) Sampling Error

• Sampling Bias vs Random Sampling Error
• Random Sampling

Polling the US 2020 Election

Everyone wanted to know: what fraction of people will vote for Biden? What fraction will vote for Trump?

Sampling

Sometimes we cannot answer descriptive claims directly

We would have to observe too many cases.

Example:

“How many Americans plan to vote for Donald Trump?”

We can’t interview all eligible voters

Sampling

Key terms:

population: full set of cases (countries, individuals, etc.) we’re interested in describing

sample: a subset of the population that we observe and measure

inference: description of the (unmeasured) population we make based on a (measured) sample

and there is uncertainty about what is true about the population, because we only measure a sample

Example:

Measuring Presidential vote choice in the US:

The population:

• All registered voters in the US

The sample:

• 1000 registered voters chosen at random (but how many people did they call who didn’t answer?)

The inference:

• 52% of Americans intended to vote for Biden, 42% for Trump, with some uncertainty due to sampling

Sampling

For sampling to give us a good answer, we need to

1. Ensure the sample is representative of the population (does not differ from the population)
2. Know the level of uncertainty associated with our inference

sampling error:

The difference between the value of the measure for the sample and the true value of the measure for the population

$\mathrm{Value}_{sample} - \mathrm{Value}_{population} \neq 0 \xrightarrow{then} \mathrm{sampling \ error}$

sampling error:

If cases in sample not representative of the population, we get:

$$1$$. sampling bias: sampling error is consistently in the same direction. Consistently include too many people of some type, exclude too many people of another type.

Not Sampling Error:

If people who supported Trump incorrectly reported that they would vote for Biden or a third party candidate

• That is measurement error.

sampling error:

$$2$$. random sampling error: in choose people for a sample, by chance, we get samples where the average is too high or too low compared to the population, but these errors would cancel out (if we repeated the sampling procedure). Produces the uncertainty of sampling (e.g., margin of error).

Example

With random samples, sometimes we get a sample with too many Biden supporters, sometimes we get a sample with too many Trump supporters.

Random Sampling

random sampling: sampling cases from the population in a manner that gives all cases an equal probability of being chosen.

This procedure creates samples that:

• on average, give unbiased inferences about the population (regardless of sample size)
• unbiased in that, across all samples, on average the sample means are the same as the population mean.
• even if sometimes sample mean is too high; sometimes too low
• has random sampling errors with a known size: produces known uncertainty)

How much social contact do we have during?

How many people, approximately, do you come into close contact with (<= 2m) each day?

Go here: https://www.menti.com/gdezmsmxkn

Example:

We want to know much many close contacts students in this course have on average…

The “population” is students registered in this class.

Students on Zoom/responding to poll are the sample

Example: Contacts

When we take the average close contacts of the sample (people taking poll in class today)…

and use it as our estimate of the average close contacts of the population (all students registered in this course)…

we are making an inference.

Example: Contacts

Was this sample a random sample of the students in the course? Why or why not?

Can you think of any reasons this sample (students in class on Zoom) would suffer from sampling bias?

Example: Contacts

When samples are not random they may suffer from sampling bias and the random errors are of unknown size

Random Sampling

Let’s now imagine that the population is students in class today…

To illustrate random sampling error: We can take random samples from the survey responses you completed.

See here

Blue line = Population Mean; Red line = average of ALL sample means

Sampling Error vs. Measurement Error

sampling error can sometimes produce measurement error

… but it may not always be measurement error.

Sampling Error vs. Measurement Error

Measurement Error:

• Incorrectly describe the world because you incorrectly observe values for the case(s) you study

$\mathrm{Value_{Case \ Truth}} - \mathrm{Value_{Case \ Obs.}} \neq 0 \xrightarrow{then} \mathrm{measurement \ error}$

Sampling Error:

• Incorrectly describe the world because you sample cases that are different from the population you want to learn about

$\mathrm{Value_{Population}} - \mathrm{Value_{Sample}} \neq 0 \xrightarrow{then} \mathrm{sampling \ error}$

Sampling Error vs. Measurement Error

Sampling error produces measurement error when you are making descriptive claims about the population that you sample. (the case we want to measure is the population)

• e.g. Sampling Americans to find out how everyone will vote in an election.

Sampling error does not produce measurement error when you are evaluating claims about the cases that you sample. (the cases we measure are, e.g. the survey respondents)

• e.g. surveying a group of people about transgender attitudes after randomly exposing half of them to contact with transgender rights canvasser Broockman and Kalla 2016

Sampling Error vs. Measurement Error

Measurement Error can lead to incorrect inferences about a population even if there is no sampling bias

• If we want to find how pervasive negative racial stereotypes are
• If we run a large random sample survey and are able to question everyone we try to contact (no non-response)
• People may still incorrectly report their true beliefs about negative racial stereotypes.

Example: Non-citizen Voting

Using an election survey…

• a large, random sample of adult Americans in 2010 (~55,400 people)
• Select a sample of non-citizens: respondents who indicate they are non-citizens ($$N = 489$$, or about $$1\%$$ of people)
• They then count who among the “non-citizen sample” voted ($$13$$)
• Conclude that 3.5% of non-citizens voted in 2010 (~700k), up to 14.7% in 2008 (~2.8 million people)

So, wait, was Trump right?

Discuss: Do you find this persuasive? Why or why not?

Two Big Problems:

Problem One: Measurement Error

The political scientists who run the survey point out:

• Citizenship question suffers from (low) measurement error.
• Those surveyed in 2010 and 2012: $$99.7\%$$ gave the same answer on citizenship, $$0.19\%$$ went from “non-citizen” to “citizen” (maybe true), $$0.11\%$$ went from “citizen” to “non-citizen” (definitely false)
• measurement error: misclassifies $$0.1\%$$ of people.

Two Big Problems:

Problem Two: Sampling Error

measurement error of individuals as citizens/non-citizens, leads to sample of “non-citizens” that include citizens and non-citizens:

• citizens $$\gg$$ non-citizens $$\to$$ many more citizens who are misclassified as “non-citizens”
• We have sampling error… the sample does not reflect the population Richman et al/ Trump want to make inferences about.
• It could be that the “non-citizen” voting is driven entirely by voting among misclassified citizens.