We will start by loading the rvest
package, which will help us scrape data from the web.
library(rvest)
Here we will learn how to scrape the number of new social security number holders by year in the US, and then the collected data so that we can generate a plot showing the evolution in this variable over time.
The first step is to read the html code from the website we want to scrape, using the read_html()
function. If we want to see the html in text format, we can then use html_text()
.
url <- "https://www.ssa.gov/oact/babynames/numberUSbirths.html"
html <- read_html(url) # reading the html code into memory
html # not very informative
## {html_document}
## <html class="no-js" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\r\n <!-- BANNER -->\r\n <aside class="accessibility" id="accessi ...
substr(html_text(html), 1, 1000) # first 1000 characters
## [1] "Social Security number holders\r\ntd,th {text-align:center}\r\n(window.BOOMR_mq=window.BOOMR_mq||[]).push([\"addVar\",{\"rua.upush\":\"false\",\"rua.cpush\":\"false\",\"rua.upre\":\"false\",\"rua.cpre\":\"false\",\"rua.uprl\":\"false\",\"rua.cprl\":\"false\",\"rua.cprf\":\"false\",\"rua.trans\":\"\",\"rua.cook\":\"false\",\"rua.ims\":\"false\",\"rua.ufprl\":\"false\",\"rua.cfprl\":\"false\",\"rua.isuxp\":\"false\",\"rua.texp\":\"norulematch\"}]);!function(e){var n=\"https://s.go-mpulse.net/boomerang/\";if(\"False\"==\"True\")e.BOOMR_config=e.BOOMR_config||{},e.BOOMR_config.PageParams=e.BOOMR_config.PageParams||{},e.BOOMR_config.PageParams.pci=!0,n=\"https://s2.go-mpulse.net/boomerang/\";if(window.BOOMR_API_key=\"LERZW-HECFS-R8H4E-23UQ7-ERMQB\",function(){function e(){if(!o){var e=document.createElement(\"script\");e.id=\"boomr-scr-as\",e.src=window.BOOMR.url,e.async=!0,i.parentNode.appendChild(e),o=!0}}function t(e){o=!0;var n,t,a,r,d=document,O=window;if(window.BOOMR.snippetMethod=e?\"if\":\"i\",t=function(e,n){var t=d.createElement(\"script\");t.id=n||\"boomr-if-as"
To extract all the tables in the html code automatically, we use html_table()
. Note that it returns a list of data frames, so in order to work with this dataset, we will have to subset the second element of this list.
tab <- html_table(html, fill=TRUE)
str(tab)
## List of 1
## $ :'data.frame': 142 obs. of 4 variables:
## ..$ Year of birth: int [1:142] 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 ...
## ..$ Male : chr [1:142] "118,399" "108,279" "122,031" "112,475" ...
## ..$ Female : chr [1:142] "97,606" "98,855" "115,694" "120,059" ...
## ..$ Total : chr [1:142] "216,005" "207,134" "237,725" "232,534" ...
pop <- tab[[1]]
Now let’s clean the data so that we can use it for our analysis. We need to convert the population values into a numeric format, which requires deleting the commas. We will also change the variable names so that it’s easier to work with them.
pop$Male <- as.numeric(gsub(",", "", pop$Male))
pop$Female <- as.numeric(gsub(",", "", pop$Female))
names(pop) <- c("year", "male", "female", "total")
And now we can plot to see how the number of people applying for a Social Security Number in the US has increased over time.
plot(pop$year, pop$male, xlab="Year of birth", ylab="New SSN petitions",
col="darkgreen", type="l")
lines(pop$year, pop$female, col="red")
legend(x="topleft", c("Male", "Female"), lty=1, col=c("green", "red"))
When there are multiple tables on the website, scraping them becomes a bit more complicated. Let’s work through a common case scenario: scraping a table from Wikipedia with a list of the most populated cities in the United States.
url <- 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population'
html <- read_html(url)
tables <- html_table(html, fill=TRUE)
length(tables)
## [1] 15
The function now returns 15 different tables. I had to use the option fill=TRUE
because some of the tables appear to have incomplete rows.
In this case, identifying the part of the html code that contains the table is a better approach. To do so, let’s take a look at the source code of the website. In Google Chrome, go to View > Developer > View Source. All browsers should have similar options to view the source code of a website.
In the source code, search for the text of the page (e.g. 2021
rank), where
is the line break. Right above it you will see: <table class="wikitable sortable"...">
. This is the CSS selector that contains the table. (You can also find this information, probably more easily, by right-clicking anywhere on the table, and choosing Inspect).
Now that we now what we’re looking for, let’s use html_nodes()
to identify all the elements of the page that have that CSS class. (Note that we need to use a dot before the name of the class to indicate it’s CSS.)
wiki <- html_nodes(html, '.wikitable')
length(wiki)
## [1] 8
There are 8 tables in total, and we will extract the second one.
pop <- html_table(wiki[[2]], fill=TRUE)
str(pop)
## 'data.frame': 331 obs. of 11 variables:
## $ 2021rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ City : chr "New York[d]" "Los Angeles" "Chicago" "Houston" ...
## $ State[c] : chr "New York" "California" "Illinois" "Texas" ...
## $ 2021estimate : chr "8,467,513" "3,849,297" "2,696,555" "2,288,250" ...
## $ 2020census : chr "8,804,190" "3,898,747" "2,746,388" "2,304,580" ...
## $ Change : chr "−3.82%" "−1.27%" "−1.81%" "−0.71%" ...
## $ 2020 land area : chr "300.5 sq mi" "469.5 sq mi" "227.7 sq mi" "640.4 sq mi" ...
## $ 2020 land area : chr "778.3 km2" "1,216.0 km2" "589.7 km2" "1,658.6 km2" ...
## $ 2020 population density: chr "29,298/sq mi" "8,304/sq mi" "12,061/sq mi" "3,599/sq mi" ...
## $ 2020 population density: chr "11,312/km2" "3,206/km2" "4,657/km2" "1,390/km2" ...
## $ Location : chr ".mw-parser-output .geo-default,.mw-parser-output .geo-dms,.mw-parser-output .geo-dec{display:inline}.mw-parser-"| __truncated__ "34°01′N 118°25′W / 34.01°N 118.41°W / 34.01; -118.41 (2 Los Angeles)" "41°50′N 87°41′W / 41.83°N 87.68°W / 41.83; -87.68 (3 Chicago)" "29°47′N 95°23′W / 29.78°N 95.39°W / 29.78; -95.39 (4 Houston)" ...
As in the previous case, we still need to clean the data before we can use it. For this particular example, let’s see if this dataset provides evidence in support of Zipf’s law for population ranks.
We’ll use regular expressions to remove endnotes and commas in the population numbers, and clean the variable names. (We’ll come back to this later in the course.)
pop$city_name <- gsub('\\[.*\\]', '', pop$City)
pop$population <- pop[,"2020census"]
pop$population <- as.numeric(gsub(",", "", pop$population))
pop$rank <- pop[,"2021rank"]
pop <- pop[,c("rank", "population", "city_name")]
Now we’re ready to generate the figure:
library(ggplot2)
p <- ggplot(pop, aes(x=rank, y=population, label=city_name))
pq <- p + geom_point() + geom_text(hjust=-.1, size=3) +
scale_x_log10("log(rank)") +
scale_y_log10("log(population)", labels=scales::comma) +
theme_minimal()
pq
We can also check if this distribution follows Zipf’s law estimating a log-log regression.
lm(log(rank) ~ log(population), data=pop)
##
## Call:
## lm(formula = log(rank) ~ log(population), data = pop)
##
## Coefficients:
## (Intercept) log(population)
## 21.631 -1.379