Research Questions lead to a search for data …
Historical newspaper content
How and why did public discourse about lynching change over time?
- Need a way of measuring discourse:
- Full text of historical newspapers.
Historical newspaper content
Article on Sam Hose
Historical newspaper content
But there are nearly 1 million hits for “lynching” from 1880 to 1920.
There are 600 million newspaper pages on this site alone.
How can we access all of this text?
Voting for Black Suffrage
How did white Northerners come to support black suffrage after the US Civil War?
- Look at military service and voting for black suffrage in state referenda, like those held in Minnesota.
Voting for Black Suffrage
Historians have collected data on county votes for and against black suffrage.
But how do we get county-level estimates of military service?
This random dude’s website.
There are lots of results (more than 20 thousand men); we could copy and paste…?
Civil War and Partisanship
What are the effects of wartime experience during the American Civil War on post-war political beliefs?
- We need individual data on partisanship in the 19th century… Good luck finding that!!!
People’s Guides from 1874 Indiana

Civil War and Partisanship
There are 70,000+ names across a few thousand pages.
How do we access this data? Hire undergrads or data entry workers?
Accessing data
What do these examples have in common?
- Data that is useful is available in digital format (online databases, online search tools, scans of text/tables)
- It is impossible/prohibitively expensive to collect this data manually
- Data can be obtained (relatively) easily using a programming language to automate the process.
Options for access:
- “Scraping”
- Ask for permission
- Digitize it yourself
Discuss each in turn, then talk about lessons
Scraping
“Scraping” refers to using some programming tools to automate the process of
- accessing some website/service that is hosted on the internet (wide variety of possibilities)
- extracting the data we desire
- and processing that data to make it useful
Lots of online resources, but for a primer I’ve made see here
Scraping: Process
- Find that data/tool you want
- Figure out how to use programming language to access that data.
- Use programming to structure the data for your needs
- Scale up: automate the rest of the data collection
Ways of Scraping
Usually, I scrape in one of three ways:
- API: a request that directly interacts with some database or tool and returns data in a friendly format (usually JSON). (59%)
- E.g. pass image to Google Cloud Vision; Geolocating.
- HTML: a request for an HTML page that contains some data. Data is unfriendly to use, because it is in markup language for visual display (40%)
- Webdriver: Websites wanting to protect proprietary data have more layers of protection. May need to code a “robot” using a browser. (1%)
Scraping: Examples
Download online database: American Civil War Database
- relational database of soldiers, units, battles, etc.
- With subscription, entire database is easily browsable online
Scraping: Examples
Download online database: full count 1860 US Census (Fold3)
- Subscription geneaology site.
- Access to census images is behind a paywall.
- But page metadata (including all personal data transcribed) is exposed behind the scenes
- Data is hidden in background requests (explain)
Scraping: Examples
Searching a database: Historical Newspaper Archives for lynching discourse
- First pass: submitted search queries to find which newspapers matched on which dates
- Second pass: which newspapers are NOT talking about lynching? Download an index of what newspapers were available on each date.
Scraping: Lessons Learned
- Need to understand how internet works (requests, responses)
- Learn a programming language to automate sending requests, processing responses
- Think about how to save data (relational databases are best)
- Be a sleuth (find where data is exposed)
- People protect proprietary data, you need to be careful
- Need a “denominator”: index of all items being searched
Asking for Access
When data is proprietary and well-protected, scraping may be impossible or get your IP address blocked.
Sometimes you can request access to the full database in an offline format:
- CWDB (Yale)
- ProQuest Newspapers (Yale)
- HathiTrust (full text of lots and lots of books)
Asking for Access
How do you request access from faceless corporation?
Lessons Learned:
- Librarians are your friend
- Librarians can request subscriptions to archives, negotiate full data dumps to be held offline
- Better positioned to sort out legal issues over copyright
- If an archive is run through a non-profit, contact them directly (Indiana 1862 Draft)
Digitizing Archives Yourself
Sometimes you find digital scans of books/documents with usable text or tables. But you have images, not data.
If the amount of pages is large, then automation may be a good option (upfront costs worthwhile)
Digitizing Archives: Examples
Historical railroad/telegraph stations:
- scanned thousands of pages of tables listing stations
Digitizing Archives: Examples
Historical railroad/telegraph stations:
- tried to use software like ABBYY FineReader
- Ultimately, too many errors, needed human transcription
Digitizing Archives: Examples
Indiana People’s Guides list partisanship of 30,000 people in 1874

Digitizing Archives: Examples
Indiana People’s Guides
- Thousands of pages, lots of names
- Google Vision reads very well
- Python script: send images to Google, letters to word, words to lines, lines to biographical entries.
- regular expressions to extract data.
- ~1 week of work to learn this the first time
How to use large digital archives?
Trade off between “breadth” and “depth”
- cannot actually read all of the content carefully
- have to decide how to synoptically “read” the content
- options that more easily “read” more data may be too simplistic (limit depth)
- options that more carefully “read” the archive may be too costly (limit breadth)
Depth/Breadth Trade Off
When deciding, think about:
- measurement error:
- is archive providing DV or IV or control?
- is measurement error likely to be random or not?
- your research design:
- do you need variation across all/most of the archive?
- or can you use a narrow subset, or even random samples?
- what does the archive actually contain?
- you need to manually inspect the archives
Manual Inspection
Before choosing how to use the archive, you need to see what it contains.
Manual Inspection: Lynching discourse
- Searched for all articles on 30+ specific lynching events
- specific event keywords
- read hundreds/thousands of articles
- Read historiography of lynching newspaper coverage
- generated lists of keywords
- added keywords derived from searching for specific lynchings
Manual Inspection: Lynching discourse
Lessons learned:
- huge variation in how much space given to lynching coverage: informed analysis decisions
- attempting to manually code too slow/costly. Showed breadth > depth for my research design
- randomly sample in time with new projects
Choices to make
Text archives
If archives are documents to be “read” and classified, can use these methods.
(in order from least to most depth, most to least breadth)
(most to least measurement error)
- Keyword dictionaries
- Machine learning classifiers
- Hand Coding
Text Archives: Keyword Dictionary
What is it?
- lists of keywords that correspond to specific meanings/content in the text
- count up the number of keywords present
Text Archives: Keyword Dictionary
Pros:
- easy to implement
- easy computation
Cons:
- words have meaning in context
- words meaning changes over time
- i.e., keywords may not capture what you intend
Text Archives: Keyword Dictionary
Think about:
- size of dictionary
- underlying text quality (fewer matches if noisy)
- how will you validate it?
Text Archives: Machine Learning
What is it?
- Statistical algorithms that learn to classify texts based on their features (usually words, maybe meta-data)
- So many varieties, but important distinction between “supervised” (you must label some data) and “unsupervised”.
Text Archives: Machine Learning
Pros:
- potentially uses more information and word context to classify documents
- lots of work done on how to make these algorithms work well (thanks, tech monopolies)
Text Archives: Machine Learning
Cons:
- Requires baseline data (can be costly to obtain)
- Only as good as the training data (available \(\neq\) good)
- Risk of overfitting
- Models may be uninterpretable black box
- How will you validate predictions?
Text Archives: Machine Learning
Think about:
- How do you generate a representative sample to train on?
- Many methods developed for use with clean natively digital text
- May need to address issues with spelling errors, transcription errors with historical archives
- Options like Word2Vec, FastText may help
- What is the unit of classification: example, newspaper pages vs. blocks of text
Text Archives: Hand Coding
What is it?
Hire research assistants to classify a sample of documents using a codebook.
Text Archives: Hand Coding
Pros:
- You probably need to do this anyway to validate keywords/machine learning
- Humans can follow clear coding guidelines
- Random sampling for inference
Cons:
- Costly in both time and money
- May generate insufficient data for research design (lynching coverage)
Choices to make
Record Linkage
If archives are contain records of people/organizations to be linked:
- Deterministic rules (APSR civil war)
- Probabilistic Models (FastLink)
- Machine Learning (Feigenbaum)
Record Linkage:
Things to think about:
- historical data have more errors:
- unknown birth years;
- name spellings are weird;
- lots of abbreviated names/initials only
- hand matching to get a “ground truth” is hard to do
- what do you do about non-matches?
Record Linkage: Applications
Matching soldiers to hometowns:
- Township-level returns on votes for black suffrage in 1857/1865 for Iowa and Wisconsin
- Iowa and Wisconsin enlistment data, including place of residence
- If not uniquely linked to township by name, link soldier to 1860 census records using FastLink
- Distribute unmatched among possible communities, weighting by population
Record Linkage: Applications
Matching Indiana soldiers to post-war partisanship
- link soldiers to 1860 Census using FastLink (“findable” and verify residence)
- link soldiers to 1874 People’s Guide (deterministic)
Need to validate
We need to show that our synoptic reading of archive is valid
- Are the measures capturing what we intend?
- What kind of measurement error?
- Are there possible issues of inclusion/exclusion?
- Think about addressing measurement error/sample problems using research design
Validating Measures
Need to evaluate whether measures from archive align with another data source:
- Hand-coded data
- Another, authoritative database (Adjutant General Reports)
- “Out of sample” predictions
Validation Metrics
Continuous Measures:
- Usually some kind of correlation
Binary/Categorical Classification:
- “recall” or true positive rate: fraction of true members of category that are identified as matches by an algorithm
- “precision” or PPV: fraction of cases identified as matches by an algorithm that are correct matches.
Validation: Example 1
When measuring “coverage” of lynching, I count any mention of “lynching” within 7 days of a lynching event as “coverage”.
- How well does that definition do when applied to hand-coded coverage of lynching?
Validation: Example 2
When measuring lynching discourse, I generated a dictionary of pro- and anti-lynching keywords. The overall score was the anti-lynching minus pro-lynching keyword count.
How does this compare to hand-coded lynching discourse?
Validation: Example 3
When estimating effects of wartime experience on post-war partisanship, I only match ~30% of soldiers after the war.
- Is there attrition bias?
- If there were, would expect people who are likely Democrats and likely Republicans have different rates of attrition
- Use machine learning to predict partisanship based on name, birth year, place of birth for 1874 respondents not matched to veterans
Predictions for veterans identified in 1874

Predictions of 1860 Presidential Vote in 9 counties

Predictions of 1860 Presidential Vote in townships

Validating Measures
Lessons learned:
- Need a “ground truth” data set: find one or make one
- Helpful to have a “standard candle” to give meaning to effect sizes (e.g. black newspapers and lynching keywords)
- Think about whether measurement error is a problem:
- is it increasing standard errors (DV), attenuation bias (IV), or inducing unknown bias (control variables)
- If there are problems, address in analysis:
- some anti-lynching keywords performed poorly
- lynching discourse measure sensitive to number of keywords found
Checking the Sample
You need to see whether the archive you have systematically excludes data
- unknown unknowns
- you don’t always have a record of what is missing
Best advice:
- Find “full” record for all or subset of objects in archive
- Compare what is available to what isn’t (for the possible set)
Checking the Sample: Example
Historical Newspaper Digitization:
- Compare digitized newspapers to list of KNOWN newspapers in library catalogs, identified by Library of Congress
- Uneven digitization rates across states
- A function of state government funding priorities
Checking the Sample: Example
Attrition in effects of wartime experience on partisanship:
- Matched soldiers to 1860 census to focus on people who are “findable” pre-treatment
- Gives baseline attributes of those who are missing post-treatment
- Check for imbalance in attrition
Bonus: heterogeneous effects by “latent partisanship”
- effects should be stronger on those who demographically look like Democrats
What to do about unrepresentative archives:
- Research design that accounts for unrepresentativeness
- Lynching coverage analysis uses newspaper fixed effects to account for changing composition of sample
- Make claims about the sample, not the population