What is scraping? (And why might it be useful?)
Ways of scraping
How to get data
How to process data
Common Issues
requests, bs4, lxml, json, re, seleniumhttr, rvest, jsonlite, etc.I use Python (usually) by preference/habit. Not a requirement.
How and why did public discourse about lynching change over time?
But there are more than 1 million hits for “lynching” from 1880 to 1920.
There are 600 million newspaper pages on this site alone.
How can we examine all of this text?
How did white Northerners come to support black suffrage after the US Civil War?
Historians have collected data on county votes for and against black suffrage.
But how do we get county-level estimates of military service?
There are lots of results (more than 20 thousand men); we could copy and paste… but this is time consuming.
How can we extract this database for use?
Enforcement of federal laws against gender discrimination in American universities usually requires someone to file a Title IX complaint.
In recent years, the use of this complaint process has become more widespread. How did this diffusion happen?
One answer might be that knowledge of the Title IX complaint process diffused among universities that are peers or by emulation of respected institutions.
How do we construct a network of universities that shows which institutions see other institutions as peers or worthy of emulation?
There are nearly 2000 accredited universities and colleges, sometimes many official twitter accounts at each university.
Lurking on Twitter to put this network together would take forever…
What are the effects of wartime experience during civil war on post-war political beliefs?
We can look at this in the context of the American Civil War: did exposure to combat and slavery cause a shift in partisanship?
There are 70,000+ names across a few thousand pages.
Rather than hire undergraduates, we can try computer vision.
How do we use this tool on thousands of pages of text?
What do these examples have in common?
“Scraping” refers to using some programming tools to automate the process of
Even with limited financial support, PhD/MA students can produce more high-quality, unique research.
Mostly, we access resources on the internet using HTTP.
As internet users, the programs/apps we use send requests of various types to a server (a computer somewhere else).
The server then sends us a response that indicates the status of our request (did it go OK?) and any data.
In my experience, we only ever need to use:
Request includes a ‘form’ that is submitted:
Requests also have headers which contains information about where the request comes from and “cookies”. (Used for tracking permission to content… and just tracking).
When we submit requests, the format of the response may vary:
To see what kinds of requests and responses are involved in accessing the data you want, need to use developer tools on your browser:
Chrome:
Firefox:
Let’s practice:
Open your developer tools in a new tab; Click on the “Network” tab. Then open a webpage (ideally one you’d like to get data from)
Explore the different requests that are made on a page, and click and explore the headers, request, response
This is a great chance to ask questions.
Questions
Usually, I scrape in one of three ways:
Need to locate places within unusual sets of boundaries (e.g. historical boundaries, electoral constituencies):
We can use APIs to do this:
R or carto.com)If we use a web service API, it will have instructions on how to use it: like this
In this case, all we need is the URL of the API:
https://maps.googleapis.com/maps/api/geocode/json
an address field and a key (so they can charge you). And we learn that it is a GET request.
In Python
In R
Either way, we are writing code to open this URL and read the response:
https://maps.googleapis.com/maps/api/geocode/json?address=Diez%20Vistas%20Trail&key=YOUR_API_KEY
Sandbox time:
NOTE:
httr package in R and requests package in Python covert plain text to URL encoding when submitting requests.Lots of APIs; some are free, most cost money.
May have a “client”: a package/module in R/Python that makes API easier to access.
When we want to extract data from an HTML page, things get a bit more complicated.
There is no documentation. We have to discover what’s going on for ourselves.
html or xhr)Let’s revisit the list of Civil War soldiers from Minnesota:
Open the page; Open the developer tools; click on “Network”; search for last names that start with “b”.
In Python:
In R:
We were able to get the HTML from the page containing the table by:
If a POST request, we sometimes need to exract hidden fields in the form. Can be detected in the HTML on the search page using R/Python
Let’s return to our example of historical newspaper coverage of lynching. This page lets us search for words in newspaper pages (behind a paywall)
I’ll open and search for “lynching” (you can too, but it will redirect you away from results), using the developer tools.
Let’s explore the requests: what can we find?
This link returns a JSON object containing the matching newspaper pages (the first 30)
This is an API, but we have to figure out how the parameters work: let’s look at the request more closely.
Note! You cannot look at the search results page without an account, BUT you can open the API request results without a login!
Once we know how the API works, we can produce a script that finds the number of search results, and keeps pulling groups of 30 or more search results until we have them all.
(This is what the search results page does when you scroll down. I’ll show you)
Sandbox time: Try changing parts of the URL for the API call to see how it changes the results. (Change the search term, starting point, number of results)
Any questions?
Go to https://tracktherecovery.org/
Open the developer tools, click on network.
Click on “explore the data”; under “public health” click on “time outside home”; click on “counties”
Can you find the request that gives us data on time spent outside the home by county?
We have seen data returned as either JSON objects or as HTML.
We focus on HTML, because it requires more work.
In Python:
json.loads() function. Turns the JSON into a Python dictionaryIn R:
httr will convert JSON to a list object.You select data from the JSON object by selecting specific values corresponding to specific keys.
In Python
In R
HTML is harder to use, because it organizes data for visual display.
Content in HTML is nested inside tags (
Tags have attributes (class, id, style, etc.) that take on specific values.
Tags can be nested within other tags.
When extracting content from HTML, it is useful to right click and select “Inspect Element”. This will open the HTML and show the tag containing the data.
Let’s do this with the Minnesota Civil War soldiers.
What tags contain the data? What tags contain data on individual soldiers?
In Python (for R see here)
If your data is behind a paywall or a login, you may be unable to access pages without additional work.
Requests use server resources: if you are making too many requests…
Present a few difficulties:
Organize requests to be completed and response data using SQL
I use SQLite for larger projects (examples)
Error Handling:
Requests may fail or time out. You want to handle errors so that they don’t stop the entire scraping project. This means writing code for what to do when errors occur.
If you have thousands of requests to make, it can take a long time. Computers are fast, but you have to wait for data to transfer.
One way to speed things up is to send multiple requests at once: parallel processing or running synchronous tasks.
gevent (use case: Census, military records)Many organizations release data as large PDFs. If the PDFs are natively digital (not scans), you can automate the extraction of these tables.
tabulizer package in R (see here)Even if I ask RAs to manually go through online search results: