Scraping
“Scraping” refers to using some programming tools to
automate the process of
- accessing data hosted on a website/service (wide variety of
possibilities)
- extracting the data we desire
Lots of online resources, but for a primer I’ve made see here
Scraping: Process
- Find that data you want
- Figure out how to use programming language to access that data.
- Structure the data for your needs
- Scale up: automate the rest of the data collection
Ways of Scraping
Usually, I scrape in one of three ways:
- Downloading browseable databases:
- HTML: a request for an HTML page that contains some data. Data is
unfriendly to use, because it is in markup language for visual
display
- Automating searches in databases:
- Accessing data via an API:
- a request that directly interacts with some database or tool and
returns data in a friendly format (usually JSON)
Scraping: Browsing
African American Sailors in the Civil War
Scraping: Browsing
Download online database: full count 1860 US Census
Fold3.com
- Subscription genealogy site.
- Access to census images is behind a paywall.
- But page metadata (including all personal data transcribed) is
exposed behind the scenes
- Data is hidden in background requests (explain)
Scraping: Searching
Searching a database: Historical newspaper archives
for lynching discourse
Automate searches and saving of results.
- First pass: submitted search queries to find which newspapers
mentioned lynching on which dates
- Second pass: which newspapers are NOT talking about lynching?
Download an index of what newspapers were available on each date.
Scraping: APIs
What if you want natively digital data?:
- Facebook users (Mueller and Schwarz), Twitter networks (w/ Celene
Reynolds), Google Trends, Canvas (sure, why not)
- These services have an API: an interface for requesting/receiving
data. Usually there is an explicit guide.
Scraping: Lessons Learned
- Need to understand how internet works (requests, responses)
- Learn a programming language to automate sending requests,
processing responses
- Robust code
- Think about how to save data (SQL databases are best)
- Be a sleuth (find where data is exposed)
- People protect proprietary data, think about permissions
Digitize it Yourself
This usually means optical character recognition:
OCR
- ABBYY Finereader (licenses software)
- Google
Vision (price per image)
- Questions to ask:
- what is the quality of the text image? can it be improved?
- how is text formatted?
- how tolerable are errors?
Digitizing Archives: Examples
Indiana People’s Guides list partisanship of 30,000
people in 1874

Digitizing Archives:
Examples
Indiana People’s Guides
- Thousands of pages, lots of names
- Google
Vision
- Python script: send images to Google, letters to word, words to
lines, lines to biographical entries.
- R script to extract and format data.
- ~1 week of work to learn this the first time
Digitizing Archives:
Examples
USCT Monthly Reports:

Digitizing Archives:
Examples
Prize list
