Tools for Data Collection

Michael Weaver

November 15, 2022

Outline

Getting data

  • access online data
  • access data from digital files
  • digitize it yourself
  • getting more out of data

Using data

  • geospatial raster data
  • geospatial merging
  • merging
  • fuzzy matching
  • record linkage

Automation

Skills needed for accessing data help:

  1. Make a project feasible by access to necessary data
  2. Make a project feasible by reducing time/cost of data collection
  3. Multiply the return on your labor (work smarter, not harder)

Even with limited financial support, PhD/MA students can produce more high-quality, unique research.

  • More time doing thinking, reading, writing

Goals

  • Exposure to many tools
  • Not a detailed “how-to” guide
  • Ask questions today about possible applications
  • Reach out to get further guidance
  • A project is better than a workshop for learning

Getting Data

Scraping

“Scraping” refers to using some programming tools to automate the process of

  • accessing data hosted on a website/service (wide variety of possibilities)
  • extracting the data we desire

Lots of online resources, but for a primer I’ve made see here

Scraping: Process

  1. Find that data you want
  2. Figure out how to use programming language to access that data.
  3. Structure the data for your needs
  4. Scale up: automate the rest of the data collection

Ways of Scraping

Usually, I scrape in one of three ways:

  1. Downloading browseable databases:
  • HTML: a request for an HTML page that contains some data. Data is unfriendly to use, because it is in markup language for visual display
  1. Automating searches in databases:
  • might be HTML or JSON
  1. Accessing data via an API:
  • a request that directly interacts with some database or tool and returns data in a friendly format (usually JSON)

Scraping: Browsing

African American Sailors in the Civil War

Scraping: Browsing

Download online database: full count 1860 US Census Fold3.com

  • Subscription genealogy site.
  • Access to census images is behind a paywall.
  • But page metadata (including all personal data transcribed) is exposed behind the scenes
  • Data is hidden in background requests (explain)

Scraping: Searching

Searching a database: Historical newspaper archives for lynching discourse

Automate searches and saving of results.

  • First pass: submitted search queries to find which newspapers mentioned lynching on which dates
  • Second pass: which newspapers are NOT talking about lynching? Download an index of what newspapers were available on each date.

Scraping: APIs

What if you want natively digital data?:

  • Facebook users (Mueller and Schwarz), Twitter networks (w/ Celene Reynolds), Google Trends, Canvas (sure, why not)
  • These services have an API: an interface for requesting/receiving data. Usually there is an explicit guide.

Scraping: Lessons Learned

  1. Need to understand how internet works (requests, responses)
  2. Learn a programming language to automate sending requests, processing responses
  3. Robust code
  4. Think about how to save data (SQL databases are best)
  5. Be a sleuth (find where data is exposed)
  6. People protect proprietary data, think about permissions

Extracting Data from Documents

Data may be locked in large/many PDF or Doc files.

  • pdftools package in R can read pdf text
  • tabulizer package in R can extract tables

Extracting: Tables

Extracting: Tables

Digitize it Yourself

This usually means optical character recognition: OCR

  • ABBYY Finereader (licenses software)
  • Google Vision (price per image)
  • Questions to ask:
    • what is the quality of the text image? can it be improved?
    • how is text formatted?
    • how tolerable are errors?

Digitizing Archives: Examples

Indiana People’s Guides list partisanship of 30,000 people in 1874

Digitizing Archives: Examples

Indiana People’s Guides

  • Thousands of pages, lots of names
  • Google Vision
  • Python script: send images to Google, letters to word, words to lines, lines to biographical entries.
  • R script to extract and format data.
  • ~1 week of work to learn this the first time

Digitizing Archives: Examples

USCT Monthly Reports:

Digitizing Archives: Examples

Prize list