Skills needed for accessing data help:
Even with limited financial support, PhD/MA students can produce more high-quality, unique research.
Do:
African American Sailors in the Civil War
Download online database: full count 1860 US Census Fold3.com
Searching a database: Historical newspaper archives for lynching discourse
Automate searches and saving of results.
What if you have natively digital data?:
Mostly, we access resources on the internet using HTTP.
As internet users, the programs/apps we use send requests of various types to a server (a computer somewhere else).
The server then sends us a response that indicates the status of our request (did it go OK?) and any data.
In my experience, we largely only need to use:
Right click in your browser, select “Inspect”, choose “Network” tab (may need to enable “Developer Menu” on Safari > Settings)
http://chroniclingamerica.loc.gov/search/titles/results/?state=Alabama&page=2
Request includes a ‘form’ that is submitted:
Requests also have headers which contains information about where the request comes from and “cookies”. (Used for tracking permission to content… and just tracking).
Click on a request, examine the “Headers” tab…
When we submit requests, the format of the response may vary:
HTML: a text markup language that is converted to a visual page in a browser.
<tag>stuff</tag>
<tag id="style1" class="style2">stuff</tag>
<tag attribute="something">stuff</tag>
JSON: a structured data object: - contains
dictionaries (in “{}
”) of “keys” (labels), a “:”, and
values (things that are labeled), separated by commas. - lists of values
(in “[]
”), separated by commas. -
{'Key1':'Value', 'Key2': ['Value1','Value2']}
Other responses are possible:
What you do…
“Scraping” refers to using some programming tools to automate the process of
Procedure:
For sending requests: httr2
For parsing HTML rvest
Let’s go here: https://en.wikipedia.org/w/index.php?title=Category:Ships_of_the_Union_Navy
Let’s go here: http://www.dalbydata.com/user.php?action=civwarsearch
Where as “scraping” refers to pulling data from webpage that you’d visit with your browser…
An API (application programming interface) uses HTTP requests to directly access structured data (typically in JSON format).
APIs may be useful to:
Lots of APIs for various purposes. Next session will cover using APIs for LLMs like GPT
Two situations:
Procedure:
Let’s say we want to get latitude and longitude for a list of addresses.
We could look them up in Google Maps one by one, but this would take a long time.
We could instead use OpenStreetMap API
The Nominatim API from OSM is public
Let’s turn to our R Notebook.
https://www.fold3.com/publication/120/us-navy-survivors-certificates-1861-1910/
Open Network tool; start clicking through, look for Requests, filtering by “XHR”; check “Request”, “Response”
A common situation is:
We have images or PDFs with text data or tables that we want to digitize.
Which tool to use:
tabulapdf
in
R
or camelot
in PythonIf your data is in a natively digital PDF… you can get text and tables.
PDF table extraction is finnicky, sometimes installation trouble.
Consensus that camelot
in Python is “best”.
PDF table extraction often requires some code to “clean” up the table:
If your data is in a scanned image, you can get text and tables, but it will be harder.
We need to use some combination of OCR (optical character recognition) and layout parsing.
tesseract
: open source OCR engine with R package.
(Alternatively, Google Vision API)Indiana People’s Guides list partisanship of 30,000 people in 1874
Let’s see how to get the text…
Common to need to process the data you extract further:
If you need to pull out specific segments of text (e.g. Names? Addresses?)…
stringr
package: reference
guide hereFuzzy Matching
stringdist
package in R
makes this easy
and fastIf data contain records of people/organizations to be linked:
When scraping or using APIs…
RSQLite
package