Tools for Data Collection

Michael Weaver

November 15, 2022

Outline

Getting data

access online data
access data from digital files
digitize it yourself
getting more out of data

Using data

geospatial raster data
geospatial merging
merging
fuzzy matching
record linkage

Automation

Skills needed for accessing data help:

Make a project feasible by access to necessary data
Make a project feasible by reducing time/cost of data collection
Multiply the return on your labor (work smarter, not harder)

Even with limited financial support, PhD/MA students can produce more high-quality, unique research.

More time doing thinking, reading, writing

Goals

Exposure to many tools
Not a detailed “how-to” guide
Ask questions today about possible applications
Reach out to get further guidance
A project is better than a workshop for learning

Getting Data

Scraping

“Scraping” refers to using some programming tools to automate the process of

accessing data hosted on a website/service (wide variety of possibilities)
extracting the data we desire

Lots of online resources, but for a primer I’ve made see here

Scraping: Process

Find that data you want
Figure out how to use programming language to access that data.
Structure the data for your needs
Scale up: automate the rest of the data collection

Ways of Scraping

Usually, I scrape in one of three ways:

Downloading browseable databases:

HTML: a request for an HTML page that contains some data. Data is unfriendly to use, because it is in markup language for visual display

Automating searches in databases:

might be HTML or JSON

Accessing data via an API:

a request that directly interacts with some database or tool and returns data in a friendly format (usually JSON)

Scraping: Browsing

African American Sailors in the Civil War

effects of wealth transfers (‘counterfactual’ reparations)
https://www.nps.gov/civilwar/search-sailors.htm
Automate the navigation and saving of this data

Scraping: Browsing

Download online database: full count 1860 US Census Fold3.com

Subscription genealogy site.
Access to census images is behind a paywall.
But page metadata (including all personal data transcribed) is exposed behind the scenes
Data is hidden in background requests (explain)

Scraping: Searching

Searching a database: Historical newspaper archives for lynching discourse

Automate searches and saving of results.

First pass: submitted search queries to find which newspapers mentioned lynching on which dates
Second pass: which newspapers are NOT talking about lynching? Download an index of what newspapers were available on each date.

Scraping: APIs

What if you want natively digital data?:

Facebook users (Mueller and Schwarz), Twitter networks (w/ Celene Reynolds), Google Trends, Canvas (sure, why not)
These services have an API: an interface for requesting/receiving data. Usually there is an explicit guide.

Scraping: Lessons Learned

Need to understand how internet works (requests, responses)
Learn a programming language to automate sending requests, processing responses
Robust code
Think about how to save data (SQL databases are best)
Be a sleuth (find where data is exposed)
People protect proprietary data, think about permissions

Extracting Data from Documents

Data may be locked in large/many PDF or Doc files.

pdftools package in R can read pdf text
tabulizer package in R can extract tables

Extracting: Tables

Digitize it Yourself

This usually means optical character recognition: OCR

ABBYY Finereader (licenses software)
Google Vision (price per image)
Questions to ask:
- what is the quality of the text image? can it be improved?
- how is text formatted?
- how tolerable are errors?

Digitizing Archives: Examples

Indiana People’s Guides list partisanship of 30,000 people in 1874

Digitizing Archives: Examples

Indiana People’s Guides

Thousands of pages, lots of names
Google Vision
Python script: send images to Google, letters to word, words to lines, lines to biographical entries.
R script to extract and format data.
~1 week of work to learn this the first time

Digitizing Archives: Examples

USCT Monthly Reports:

Digitizing Archives: Examples

Prize list

Extending Data

Once you have data, you can also use online tools to extend/modify it:

automate the coding of data (machine learning)
pre-built classification tools
extract locations, dates, syntactic dependencies, etc.
geo-referencing

Extending Data: Machine Learning

What is it?

Statistical algorithms that learn to classify texts based on their features (usually words, maybe meta-data)
So many varieties, but important distinction between “supervised” (you must label some data) and “unsupervised”.
Workflow: collect data; code a training set; “learn” on part of training set; validate on other part; classify full data set

Extending Data: Machine Learning

Pros:

can automatically code large quantities of data
a more sophisticated way to classify documents
lots of work done on how to make these algorithms work well (thanks, tech monopolies)

Cons:

Requires ground-truth data (can be costly to obtain)
Only as good as the training data (available \(\neq\) good)
Risk of overfitting
How will you validate predictions?

Extending Data: Machine Learning

Examples:

Civil War era Partisanship
USCT unit reports: entity extraction

Software recommendations:

easy to use out-of-the-box
automatic “tuning”
fastText is one good option

Extending Data: Trained Models

You can just use existing trained AI models:

name gender https://genderize.io/
name nationality https://nationalize.io/
what is in a photo (Google Cloud Vision, among others)
text sentiment

But, if training data differs from your data, may not work the best.

Extending Data: Natural Language Processing

NLP parses clean textual data:

parts of speech, syntactic dependencies, entities (people, locations, dates, etc)
can query texts based on parts of speech, dependencies

Extending Data: Natural Language Processing

map link

Extending Data: Natural Language Processing

Lots of useful tools:

spaCy (accessed in R using spacyr): high-quality, neural-net based natural language processor
query spaCy using rsyntax

Extending Data: Geocoding

You can add latitude/longitude to data; or get estimated travel times:

Google Maps API
Other geocoding APIs
ggmap implements Google Maps Geocoding API in R
not free, but relatively cheap

Working with Data

We want to merge data together, but can be tricky if:

some data is geospatial raster images
data is geospatial in nature
data is large
names/place names are “messy”
you don’t have unique identifiers

Working with: Raster Data

Lots of potentially interesting geospatial data comes as a raster:

temperature, rainfall, cloud cover, agricultural suitability, elevation
night-time lights, population, cell signal

We want to compute values based on this data for points or boundaries

Working with: Raster Data

Example: Myanmar townships

terrain ruggedness
wealth/inequality using night-time lights
exposure to cell service/Facebook

Working with: Raster Data

Multiple R packages, but use terra package

faster
handle larger data

Working with: Geospatial Data

We may want to merge data by geographic overlap; get distances in geospatial network

create stable boundaries over time
proximity of units to some treatment
distances along roads/railroads

Working with: Geospatial Data

R package sf is easiest to use.

compute distances, intersection, contains, nearest feature, etc.
spatial networks requires additional packages

Working with: Large/Complex Data

data.table package in R:

handles large data very well
can handle complex aggregations of data
can handle complex merges
relatively concise code

dapil_elections_2004 = results_2004_clean[, {
  p_no = no_partai_politik;
  dapil_seats = meta_total_seats %>% unique;
  all_total_votes = sum(total_votes, na.rm = T);
  p_vs = total_votes / all_total_votes;
  enp = 1 / sum(p_vs^2);
  ni_vs = p_vs[which(p_no %in% p_NI_2004)];
  enp_NI =  1 / sum((ni_vs/sum(ni_vs, na.rm = T))^2, na.rm = T);
  icit_vs = p_vs[which(p_no %in% p_ICIT_2004)];
  enp_ICIT = 1 / sum((icit_vs/sum(icit_vs, na.rm = T))^2, na.rm = T);
  ic_vs = p_vs[which(p_no %in% p_IC_2004)];
  enp_IC = 1 / sum( (ic_vs/sum(ic_vs, na.rm = T)) ^2, na.rm = T);
  it_vs = p_vs[which(p_no %in% p_IT_2004)];
  enp_IT = 1 / sum( (it_vs/sum(it_vs, na.rm = T)) ^2, na.rm = T);
  quota = round(all_total_votes/dapil_seats);
  first_round_seats = floor(total_votes / quota);
  remaining_seats = dapil_seats - sum(first_round_seats, na.rm = T);
  remainder = total_votes - (first_round_seats*quota);
  remainder_rank = frankv(remainder, order = -1L);
  remainder_quota = (remainder_rank <= remaining_seats);
  won_seats = first_round_seats + remainder_quota;
  last_seat = (remainder_rank %in% (remaining_seats + 0:1));
  last_seat_win = remainder_rank %in% (remaining_seats);
  last_seat_lose = remainder_rank %in% (remaining_seats + 1);
  last_seat_mov_pct = (remainder[which(last_seat_win)] - 
                         remainder[which(last_seat_lose)]) / all_total_votes;
  party_last_seat_winner = p_no[which(last_seat_win)];
  party_last_seat_loser = p_no[which(last_seat_lose)];
  winner_remainder = remainder[which(last_seat_win)];
  loser_remainder = remainder[which(last_seat_lose)]
  winner_first_round_seats = first_round_seats[which(last_seat_win)];
  loser_first_round_seats = first_round_seats[which(last_seat_lose)];
  winner_total_votes = total_votes[which(last_seat_win)];
  loser_total_votes = total_votes[which(last_seat_lose)];
  list(
    bw = last_seat_mov_pct,
    enp = enp,
    enp_NI = enp_NI,
    enp_ICIT = enp_ICIT,
    enp_IC = enp_IC,
    enp_IT = enp_IT, 
    remaining_seats = remaining_seats,
    remaining_seats_pct = remaining_seats/dapil_seats,
    party_last_seat_winner = party_last_seat_winner,
    party_last_seat_loser = party_last_seat_loser,
    winner_first_round_seats = winner_first_round_seats,
    loser_first_round_seats = loser_first_round_seats,
    winner_total_votes = winner_total_votes, 
    loser_total_votes = loser_total_votes,
    winner_remainder =winner_remainder,
    loser_remainder = loser_remainder,
    NI_close = any(p_no[which(last_seat)] %in% p_NI_2004),
    ICIT_close = any(p_no[which(last_seat)] %in% p_ICIT_2004),
    IC_close = any(p_no[which(last_seat)] %in% p_IC_2004),
    IT_close = any(p_no[which(last_seat)] %in% p_IT_2004),
    all_total_votes = all_total_votes,
    meta_total_votes = unique(meta_total_votes) %>% sum,
    dapil_seats = dapil_seats,
    all_NI_vs = sum(total_votes[which(p_no %in% p_NI_2004)]),
    all_ICIT_vs = sum(total_votes[which(p_no %in% p_ICIT_2004)]),
    all_IC_vs = sum(total_votes[which(p_no %in% p_IC_2004)]),
    all_IT_vs = sum(total_votes[which(p_no %in% p_IT_2004)]),
    all_NI_fr_seats = sum(first_round_seats[which(p_no %in% p_NI_2004)]),
    all_ICIT_fr_seats = sum(first_round_seats[which(p_no %in% p_ICIT_2004)]),
    all_IC_fr_seats = sum(first_round_seats[which(p_no %in% p_IC_2004)]),
    all_IT_fr_seats = sum(first_round_seats[which(p_no %in% p_IT_2004)]),
    all_NI_seats = sum(won_seats[which(p_no %in% p_NI_2004)]),
    all_ICIT_seats = sum(won_seats[which(p_no %in% p_ICIT_2004)]),
    all_IC_seats = sum(won_seats[which(p_no %in% p_IC_2004)]),
    all_IT_seats = sum(won_seats[which(p_no %in% p_IT_2004)]),
    all_NI_remainder = sum(remainder[which(p_no %in% p_NI_2004)]),
    all_ICIT_remainder = sum(remainder[which(p_no %in% p_ICIT_2004)]),
    all_IC_remainder = sum(remainder[which(p_no %in% p_IC_2004)]),
    all_IT_remainder = sum(remainder[which(p_no %in% p_IT_2004)]),
    IT_count = sum(total_votes[which(p_no %in% p_IT_2004)] > 0), 
    NI_count = sum(total_votes[which(p_no %in% p_NI_2004)] > 0),
    IC_count = sum(total_votes[which(p_no %in% p_IC_2004)] > 0),
    ICIT_count = sum(total_votes[which(p_no %in% p_ICIT_2004)] > 0),
    quota = quota
  )
}, by =  list(province, kabupaten, kab_code, dapil)]

Working with: Large/Complex Data

setkey(a, companypk, indate, expected_out)
setkey(b, companypk, indate, expected_out)

company_casualty_rate = foverlaps(na.omit(a), na.omit(b), by.x = c('companypk', 'indate', 'expected_out'), by.y = c('companypk', 'indate', 'expected_out')) %>% 
  .[personpk != i.personpk, list(in_company_n = .N, 
                                 company_deaths = sum(i.died),
                                 company_kia = sum(i.kia),
                                 company_disabled = sum(i.disabled)), by = list(personpk, companypk)]

Working with: Messy Data

Fuzzy Matching

if names/place names have spelling variants or typos
can use string similarity metrics (e.g. Jaro-Winkler scores)
stringdist package in R makes this easy and fast
easier than checking by hand to correct errors

Working with: Record Linkage

If data contain records of people/organizations to be linked:

Deterministic rules (APSR civil war)
Probabilistic Models (FastLink)
Machine Learning (Feigenbaum)

Working with: Record Linkage

Matching soldiers to hometowns:

Township-level returns on votes for black suffrage in 1857/1865 for Iowa and Wisconsin
Iowa and Wisconsin enlistment data, including place of residence
If not uniquely linked to township by name, link soldier to 1860 census records using FastLink
Distribute unmatched among possible communities, weighting by population

Tools for Data Collection

Michael Weaver

November 15, 2022

Outline

Getting data

Using data

Automation

Goals

Getting Data

Scraping

Scraping: Process

Ways of Scraping

Scraping: Browsing

Scraping: Browsing

Scraping: Searching

Scraping: APIs

Scraping: Lessons Learned

Extracting Data from Documents

Extracting: Tables

Extracting: Tables

Digitize it Yourself

Digitizing Archives: Examples

Digitizing Archives: Examples

Digitizing Archives: Examples

Digitizing Archives: Examples

Extending Data

Extending Data: Machine Learning

Extending Data: Machine Learning

Extending Data: Machine Learning

Extending Data: Trained Models

Extending Data: Natural Language Processing

Extending Data: Natural Language Processing

Extending Data: Natural Language Processing

Extending Data: Geocoding

Working with Data

Working with Data

Working with: Raster Data

Working with: Raster Data

Working with: Raster Data

Working with: Geospatial Data

Working with: Geospatial Data

Working with: Geospatial Data

Working with: Large/Complex Data

Working with: Large/Complex Data

Working with: Messy Data

Working with: Record Linkage

Working with: Record Linkage

Questions