### (1) **Sampling Error**

**Sampling Bias**vs**Random Sampling Error****Random Sampling**

February 11, 2020

**Sampling Bias**vs**Random Sampling Error****Random Sampling**

Ahead of the election:

Everyone wanted to know: what fraction of people will vote for Biden? What fraction will vote for Trump?

We would have to observe **too many** cases.

**“How many Americans plan to vote for Donald Trump?”**

We can’t interview **all eligible voters**…

**population**: full set of cases (countries, individuals, etc.) we’re interested in describing

**sample**: a *subset* of the population that we observe and measure

**inference**: description of the (unmeasured) **population** we make *based on a (measured) sample*

and there is **uncertainty** about what is true about the population, because we **only measure a sample**

Measuring Presidential vote choice in the US:

The **population**:

- All registered voters in the US

The **sample**:

- 1000 registered voters chosen
**at random***(but how many people did they call who didn’t answer?)*

The **inference**:

- 52% of Americans intended to vote for Biden, 42% for Trump, with some
**uncertainty**due to sampling

For **sampling** to give us a good answer, we need to

- Ensure the sample is
**representative**of the population (does not differ from the population) - Know the level of
**uncertainty**associated with our inference

The difference between the value of the measure for the sample and the true value of the measure for the population

\[\mathrm{Value}_{sample} - \mathrm{Value}_{population} \neq 0 \xrightarrow{then} \mathrm{sampling \ error}\]

If cases in sample not representative of the population, we get:

\(1\). **sampling bias**: sampling error is consistently in the same direction. Consistently include too many people of some type, exclude too many people of another type.

If people who supported Trump incorrectly reported that they would vote for Biden or a third party candidate

- That is
**measurement error**.

\(2\). **random sampling error**: in choose people for a sample, by chance, we get samples where the average is **too high** or **too low** compared to the population, but these errors would cancel out (if we repeated the sampling procedure). Produces the **uncertainty** of sampling (e.g., margin of error).

With **random** samples, sometimes we get a sample with too many Biden supporters, sometimes we get a sample with too many Trump supporters.

**random sampling**: sampling cases from the population in a manner that gives **all cases** an **equal probability** of being chosen.

This procedure creates **samples** that:

- on average, give
**unbiased**inferences about the population (**regardless of sample size**)**unbiased**in that, across all samples, on average the sample means are the same as the population mean.- even if sometimes sample mean is too high; sometimes too low

- has
**random sampling errors**with a known**size**: produces**known uncertainty**)

How much social contact do we have during?

How many people, approximately, do you come into close contact with (<= 2m) each day?

Go here: https://www.menti.com/gdezmsmxkn

We want to know much many close contacts students in this course have on average…

The “**population**” is students registered in this class.

Students on Zoom/responding to poll are the **sample**

When we take the average close contacts of the **sample** (people taking poll in class today)…

and use it as our estimate of the average close contacts of the **population** (all students registered in this course)…

we are making an **inference**.

Was this **sample** a **random sample** of the students in the course? Why or why not?

Can you think of any reasons this **sample** (students in class on Zoom) would suffer from **sampling bias**?

When samples are **not random** they may suffer from sampling **bias** and the **random errors** are of unknown size

Let’s now imagine that the **population** is students in class today…

To illustrate random sampling error: We can take **random samples** from the survey responses you completed.

Blue line = Population Mean; Red line = average of ALL sample means

**sampling error** can **sometimes** produce **measurement error** …

… but it may not **always** be **measurement error**.

**Measurement Error**:

- Incorrectly describe the world because you
**incorrectly**observe values for the**case(s)**you study

\[\mathrm{Value_{Case \ Truth}} - \mathrm{Value_{Case \ Obs.}} \neq 0 \xrightarrow{then} \mathrm{measurement \ error}\]

**Sampling Error**:

- Incorrectly describe the world because you sample
**cases that are different**from the population you want to learn about

\[\mathrm{Value_{Population}} - \mathrm{Value_{Sample}} \neq 0 \xrightarrow{then} \mathrm{sampling \ error}\]

**Sampling error** produces **measurement error** when you are making descriptive claims **about the population** that you sample. (the case we want to measure

- e.g. Sampling Americans to find out how
**everyone will vote**in an election.

**Sampling error** does not produce **measurement error** when you are evaluating claims **about the cases** that you sample. (the cases we measure are, e.g.

- e.g. surveying a group of people about transgender attitudes after randomly exposing half of them to contact with transgender rights canvasser Broockman and Kalla 2016

**Measurement Error** can lead to incorrect inferences about a population **even if there is no sampling bias**

- If we want to find how pervasive negative racial stereotypes are
- If we run a large random sample survey and are able to question everyone we try to contact (no non-response)
- People may still
**incorrectly report**their true beliefs about negative racial stereotypes.

In addition to winning the Electoral College in a landslide, I won the popular vote if you deduct the millions of people who voted illegally

— Donald J. Trump (@realDonaldTrump) November 27, 2016

White House senior advisor doubles down on voter fraud claims: “Voter fraud is a serious problem in this country” pic.twitter.com/DC6lVPQznz

— ABC News (@ABC) February 12, 2017

- a large, random sample of adult Americans in 2010 (~
**55,400 people**) - Select a
**sample of non-citizens**: respondents who indicate they are non-citizens (\(N = 489\), or about \(1\%\) of people) - They then count who among the “non-citizen sample” voted (\(13\))
- Conclude that 3.5% of non-citizens voted in 2010 (~700k), up to 14.7% in 2008 (~2.8 million people)

**Discuss: Do you find this persuasive? Why or why not?**

The political scientists who run the survey point out:

- Citizenship question suffers from (low) measurement error.
- Those surveyed in 2010 and 2012: \(99.7\%\) gave the same answer on citizenship, \(0.19\%\) went from “non-citizen” to “citizen” (maybe true), \(0.11\%\) went from
**“citizen” to “non-citizen” (definitely false)** **measurement error:**misclassifies \(0.1\%\) of people.

**measurement error** of individuals as citizens/non-citizens, leads to sample of “non-citizens” that include **citizens** and non-citizens:

- citizens \(\gg\) non-citizens \(\to\) many more
**citizens**who are**misclassified**as**“non-citizens”** - We have
**sampling error**… the sample does not reflect the population Richman et al/ Trump want to make inferences about. - It could be that the “non-citizen” voting is driven entirely by voting among misclassified citizens.