Skills needed for accessing data help:
Even with limited financial support, PhD/MA students can produce more high-quality, unique research.
“Scraping” refers to using some programming tools to automate the process of
Lots of online resources, but for a primer I’ve made see here
Usually, I scrape in one of three ways:
African American Sailors in the Civil War
Download online database: full count 1860 US Census Fold3.com
Searching a database: Historical newspaper archives for lynching discourse
Automate searches and saving of results.
What if you want natively digital data?:
Data may be locked in large/many PDF or Doc files.
pdftools
package in R
can read pdf
texttabulizer
package in R
can extract
tablesThis usually means optical character recognition: OCR
Indiana People’s Guides list partisanship of 30,000 people in 1874
Indiana People’s Guides
USCT Monthly Reports:
Prize list
Once you have data, you can also use online tools to extend/modify it:
What is it?
Pros:
Cons:
Examples:
Software recommendations:
You can just use existing trained AI models:
But, if training data differs from your data, may not work the best.
NLP parses clean textual data:
Lots of useful tools:
spacyr
): high-quality, neural-net based natural language
processorrsyntax
You can add latitude/longitude to data; or get estimated travel times:
ggmap
implements Google Maps Geocoding API in RWe want to merge data together, but can be tricky if:
Lots of potentially interesting geospatial data comes as a raster:
We want to compute values based on this data for points or boundaries
Example: Myanmar townships
Multiple R packages, but use terra
package
We may want to merge data by geographic overlap; get distances in geospatial network
R package sf
is easiest to use.
data.table
package in R
:
dapil_elections_2004 = results_2004_clean[, {
p_no = no_partai_politik;
dapil_seats = meta_total_seats %>% unique;
all_total_votes = sum(total_votes, na.rm = T);
p_vs = total_votes / all_total_votes;
enp = 1 / sum(p_vs^2);
ni_vs = p_vs[which(p_no %in% p_NI_2004)];
enp_NI = 1 / sum((ni_vs/sum(ni_vs, na.rm = T))^2, na.rm = T);
icit_vs = p_vs[which(p_no %in% p_ICIT_2004)];
enp_ICIT = 1 / sum((icit_vs/sum(icit_vs, na.rm = T))^2, na.rm = T);
ic_vs = p_vs[which(p_no %in% p_IC_2004)];
enp_IC = 1 / sum( (ic_vs/sum(ic_vs, na.rm = T)) ^2, na.rm = T);
it_vs = p_vs[which(p_no %in% p_IT_2004)];
enp_IT = 1 / sum( (it_vs/sum(it_vs, na.rm = T)) ^2, na.rm = T);
quota = round(all_total_votes/dapil_seats);
first_round_seats = floor(total_votes / quota);
remaining_seats = dapil_seats - sum(first_round_seats, na.rm = T);
remainder = total_votes - (first_round_seats*quota);
remainder_rank = frankv(remainder, order = -1L);
remainder_quota = (remainder_rank <= remaining_seats);
won_seats = first_round_seats + remainder_quota;
last_seat = (remainder_rank %in% (remaining_seats + 0:1));
last_seat_win = remainder_rank %in% (remaining_seats);
last_seat_lose = remainder_rank %in% (remaining_seats + 1);
last_seat_mov_pct = (remainder[which(last_seat_win)] -
remainder[which(last_seat_lose)]) / all_total_votes;
party_last_seat_winner = p_no[which(last_seat_win)];
party_last_seat_loser = p_no[which(last_seat_lose)];
winner_remainder = remainder[which(last_seat_win)];
loser_remainder = remainder[which(last_seat_lose)]
winner_first_round_seats = first_round_seats[which(last_seat_win)];
loser_first_round_seats = first_round_seats[which(last_seat_lose)];
winner_total_votes = total_votes[which(last_seat_win)];
loser_total_votes = total_votes[which(last_seat_lose)];
list(
bw = last_seat_mov_pct,
enp = enp,
enp_NI = enp_NI,
enp_ICIT = enp_ICIT,
enp_IC = enp_IC,
enp_IT = enp_IT,
remaining_seats = remaining_seats,
remaining_seats_pct = remaining_seats/dapil_seats,
party_last_seat_winner = party_last_seat_winner,
party_last_seat_loser = party_last_seat_loser,
winner_first_round_seats = winner_first_round_seats,
loser_first_round_seats = loser_first_round_seats,
winner_total_votes = winner_total_votes,
loser_total_votes = loser_total_votes,
winner_remainder =winner_remainder,
loser_remainder = loser_remainder,
NI_close = any(p_no[which(last_seat)] %in% p_NI_2004),
ICIT_close = any(p_no[which(last_seat)] %in% p_ICIT_2004),
IC_close = any(p_no[which(last_seat)] %in% p_IC_2004),
IT_close = any(p_no[which(last_seat)] %in% p_IT_2004),
all_total_votes = all_total_votes,
meta_total_votes = unique(meta_total_votes) %>% sum,
dapil_seats = dapil_seats,
all_NI_vs = sum(total_votes[which(p_no %in% p_NI_2004)]),
all_ICIT_vs = sum(total_votes[which(p_no %in% p_ICIT_2004)]),
all_IC_vs = sum(total_votes[which(p_no %in% p_IC_2004)]),
all_IT_vs = sum(total_votes[which(p_no %in% p_IT_2004)]),
all_NI_fr_seats = sum(first_round_seats[which(p_no %in% p_NI_2004)]),
all_ICIT_fr_seats = sum(first_round_seats[which(p_no %in% p_ICIT_2004)]),
all_IC_fr_seats = sum(first_round_seats[which(p_no %in% p_IC_2004)]),
all_IT_fr_seats = sum(first_round_seats[which(p_no %in% p_IT_2004)]),
all_NI_seats = sum(won_seats[which(p_no %in% p_NI_2004)]),
all_ICIT_seats = sum(won_seats[which(p_no %in% p_ICIT_2004)]),
all_IC_seats = sum(won_seats[which(p_no %in% p_IC_2004)]),
all_IT_seats = sum(won_seats[which(p_no %in% p_IT_2004)]),
all_NI_remainder = sum(remainder[which(p_no %in% p_NI_2004)]),
all_ICIT_remainder = sum(remainder[which(p_no %in% p_ICIT_2004)]),
all_IC_remainder = sum(remainder[which(p_no %in% p_IC_2004)]),
all_IT_remainder = sum(remainder[which(p_no %in% p_IT_2004)]),
IT_count = sum(total_votes[which(p_no %in% p_IT_2004)] > 0),
NI_count = sum(total_votes[which(p_no %in% p_NI_2004)] > 0),
IC_count = sum(total_votes[which(p_no %in% p_IC_2004)] > 0),
ICIT_count = sum(total_votes[which(p_no %in% p_ICIT_2004)] > 0),
quota = quota
)
}, by = list(province, kabupaten, kab_code, dapil)]
setkey(a, companypk, indate, expected_out)
setkey(b, companypk, indate, expected_out)
company_casualty_rate = foverlaps(na.omit(a), na.omit(b), by.x = c('companypk', 'indate', 'expected_out'), by.y = c('companypk', 'indate', 'expected_out')) %>%
.[personpk != i.personpk, list(in_company_n = .N,
company_deaths = sum(i.died),
company_kia = sum(i.kia),
company_disabled = sum(i.disabled)), by = list(personpk, companypk)]
Fuzzy Matching
stringdist
package in R
makes this easy
and fastIf data contain records of people/organizations to be linked:
Matching soldiers to hometowns: