Scraping web data in table format

We will start by loading the rvest package, which will help us scrape data from the web.

library(rvest)

The goal of this exercise is to scrape the population data from the Census website, and then clean it so that we can generate a population pyramid.

The first step is to read the html code from the website we want to scrape, using the read_html() function. If we want to see the html in text format, we can then use html_text().

url <- "https://www.ssa.gov/oact/babynames/numberUSbirths.html"
html <- read_html(url) # reading the html code into memory
html # not very informative
## {xml_document}
## <html class="no-js" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="oact-sidebar-rows">\r\n<!-- PAGE CONTAINER -->\r\n<div  ...
substr(html_text(html), 1, 1000) # first 1000 characters
## [1] "Social Security number holders\r\n\r\n\r\n\r\n\r\n\n\n  Skip to content\n\n\n\n        \n            Social Security\n        \n        \n                Search\n            \n                Menu\n            \n                Languages\n            \n                Sign in / up\n        \n\r\n\tSocial Security number holders\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n  \r\n      Popular Baby Names\r\n    \r\n      Background information\r\n      Beyond the top 1000 names\r\n      Social Security card holders graph\r\n  \r\n  \r\n   Our popular name data come from applications for Social Security cards. The table \r\n   below shows the number of applicants for a Social Security card by year of birth and \r\n   sex. The number of such applicants is restricted to U. S. births where the\r\n   year of birth, sex, State of birth (50 States and District of Columbia) are known, \r\n   and where the given name is at least 2 characters long.\r\n   \r\nNumber of Social Security card holders bornin the U. S.\r\n by year of birth and sex\r\nYear ofbirth\r\n Male\r\n Female\r\n"

To extract all the tables in the html code automatically, we use html_table(). Note that it returns a list of data frames, so in order to work with this dataset, we will have to subset the second element of this list.

tab <- html_table(html, fill=TRUE)
str(tab)
## List of 2
##  $ :'data.frame':    140 obs. of  558 variables:
##   ..$ X1  : chr [1:140] "Popular Baby Names\r\n    \r\n      Background information\r\n      Beyond the top 1000 names\r\n      Social S"| __truncated__ "Year ofbirth" "1880" "1881" ...
##   ..$ X2  : chr [1:140] "Our popular name data come from applications for Social Security cards. The table \r\n   below shows the number"| __truncated__ "Male" "118,400" "108,282" ...
##   ..$ X3  : chr [1:140] "Year ofbirth" "Female" "97,605" "98,855" ...
##   ..$ X4  : chr [1:140] "Male" "Total" "216,005" "207,137" ...
##   ..$ X5  : chr [1:140] "Female" NA NA NA ...
##   ..$ X6  : chr [1:140] "Total" NA NA NA ...
##   ..$ X7  : int [1:140] 1880 NA NA NA NA NA NA NA NA NA ...
##   ..$ X8  : chr [1:140] "118,400" NA NA NA ...
##   ..$ X9  : chr [1:140] "97,605" NA NA NA ...
##   ..$ X10 : chr [1:140] "216,005" NA NA NA ...
##   ..$ X11 : int [1:140] 1881 NA NA NA NA NA NA NA NA NA ...
##   ..$ X12 : chr [1:140] "108,282" NA NA NA ...
##   ..$ X13 : chr [1:140] "98,855" NA NA NA ...
##   ..$ X14 : chr [1:140] "207,137" NA NA NA ...
##   ..$ X15 : int [1:140] 1882 NA NA NA NA NA NA NA NA NA ...
##   ..$ X16 : chr [1:140] "122,031" NA NA NA ...
##   ..$ X17 : chr [1:140] "115,695" NA NA NA ...
##   ..$ X18 : chr [1:140] "237,726" NA NA NA ...
##   ..$ X19 : int [1:140] 1883 NA NA NA NA NA NA NA NA NA ...
##   ..$ X20 : chr [1:140] "112,477" NA NA NA ...
##   ..$ X21 : chr [1:140] "120,059" NA NA NA ...
##   ..$ X22 : chr [1:140] "232,536" NA NA NA ...
##   ..$ X23 : int [1:140] 1884 NA NA NA NA NA NA NA NA NA ...
##   ..$ X24 : chr [1:140] "122,738" NA NA NA ...
##   ..$ X25 : chr [1:140] "137,586" NA NA NA ...
##   ..$ X26 : chr [1:140] "260,324" NA NA NA ...
##   ..$ X27 : int [1:140] 1885 NA NA NA NA NA NA NA NA NA ...
##   ..$ X28 : chr [1:140] "115,945" NA NA NA ...
##   ..$ X29 : chr [1:140] "141,948" NA NA NA ...
##   ..$ X30 : chr [1:140] "257,893" NA NA NA ...
##   ..$ X31 : int [1:140] 1886 NA NA NA NA NA NA NA NA NA ...
##   ..$ X32 : chr [1:140] "119,041" NA NA NA ...
##   ..$ X33 : chr [1:140] "153,735" NA NA NA ...
##   ..$ X34 : chr [1:140] "272,776" NA NA NA ...
##   ..$ X35 : int [1:140] 1887 NA NA NA NA NA NA NA NA NA ...
##   ..$ X36 : chr [1:140] "109,313" NA NA NA ...
##   ..$ X37 : chr [1:140] "155,422" NA NA NA ...
##   ..$ X38 : chr [1:140] "264,735" NA NA NA ...
##   ..$ X39 : int [1:140] 1888 NA NA NA NA NA NA NA NA NA ...
##   ..$ X40 : chr [1:140] "129,906" NA NA NA ...
##   ..$ X41 : chr [1:140] "189,445" NA NA NA ...
##   ..$ X42 : chr [1:140] "319,351" NA NA NA ...
##   ..$ X43 : int [1:140] 1889 NA NA NA NA NA NA NA NA NA ...
##   ..$ X44 : chr [1:140] "119,032" NA NA NA ...
##   ..$ X45 : chr [1:140] "189,219" NA NA NA ...
##   ..$ X46 : chr [1:140] "308,251" NA NA NA ...
##   ..$ X47 : int [1:140] 1890 NA NA NA NA NA NA NA NA NA ...
##   ..$ X48 : chr [1:140] "119,701" NA NA NA ...
##   ..$ X49 : chr [1:140] "201,661" NA NA NA ...
##   ..$ X50 : chr [1:140] "321,362" NA NA NA ...
##   ..$ X51 : int [1:140] 1891 NA NA NA NA NA NA NA NA NA ...
##   ..$ X52 : chr [1:140] "109,265" NA NA NA ...
##   ..$ X53 : chr [1:140] "196,566" NA NA NA ...
##   ..$ X54 : chr [1:140] "305,831" NA NA NA ...
##   ..$ X55 : int [1:140] 1892 NA NA NA NA NA NA NA NA NA ...
##   ..$ X56 : chr [1:140] "131,453" NA NA NA ...
##   ..$ X57 : chr [1:140] "224,913" NA NA NA ...
##   ..$ X58 : chr [1:140] "356,366" NA NA NA ...
##   ..$ X59 : int [1:140] 1893 NA NA NA NA NA NA NA NA NA ...
##   ..$ X60 : chr [1:140] "121,041" NA NA NA ...
##   ..$ X61 : chr [1:140] "225,232" NA NA NA ...
##   ..$ X62 : chr [1:140] "346,273" NA NA NA ...
##   ..$ X63 : int [1:140] 1894 NA NA NA NA NA NA NA NA NA ...
##   ..$ X64 : chr [1:140] "124,893" NA NA NA ...
##   ..$ X65 : chr [1:140] "235,971" NA NA NA ...
##   ..$ X66 : chr [1:140] "360,864" NA NA NA ...
##   ..$ X67 : int [1:140] 1895 NA NA NA NA NA NA NA NA NA ...
##   ..$ X68 : chr [1:140] "126,643" NA NA NA ...
##   ..$ X69 : chr [1:140] "247,106" NA NA NA ...
##   ..$ X70 : chr [1:140] "373,749" NA NA NA ...
##   ..$ X71 : int [1:140] 1896 NA NA NA NA NA NA NA NA NA ...
##   ..$ X72 : chr [1:140] "129,071" NA NA NA ...
##   ..$ X73 : chr [1:140] "251,993" NA NA NA ...
##   ..$ X74 : chr [1:140] "381,064" NA NA NA ...
##   ..$ X75 : int [1:140] 1897 NA NA NA NA NA NA NA NA NA ...
##   ..$ X76 : chr [1:140] "121,942" NA NA NA ...
##   ..$ X77 : chr [1:140] "248,275" NA NA NA ...
##   ..$ X78 : chr [1:140] "370,217" NA NA NA ...
##   ..$ X79 : int [1:140] 1898 NA NA NA NA NA NA NA NA NA ...
##   ..$ X80 : chr [1:140] "132,104" NA NA NA ...
##   ..$ X81 : chr [1:140] "274,144" NA NA NA ...
##   ..$ X82 : chr [1:140] "406,248" NA NA NA ...
##   ..$ X83 : int [1:140] 1899 NA NA NA NA NA NA NA NA NA ...
##   ..$ X84 : chr [1:140] "115,193" NA NA NA ...
##   ..$ X85 : chr [1:140] "247,490" NA NA NA ...
##   ..$ X86 : chr [1:140] "362,683" NA NA NA ...
##   ..$ X87 : int [1:140] 1900 NA NA NA NA NA NA NA NA NA ...
##   ..$ X88 : chr [1:140] "162,133" NA NA NA ...
##   ..$ X89 : chr [1:140] "317,752" NA NA NA ...
##   ..$ X90 : chr [1:140] "479,885" NA NA NA ...
##   ..$ X91 : int [1:140] 1901 NA NA NA NA NA NA NA NA NA ...
##   ..$ X92 : chr [1:140] "115,595" NA NA NA ...
##   ..$ X93 : chr [1:140] "254,229" NA NA NA ...
##   ..$ X94 : chr [1:140] "369,824" NA NA NA ...
##   ..$ X95 : int [1:140] 1902 NA NA NA NA NA NA NA NA NA ...
##   ..$ X96 : chr [1:140] "132,750" NA NA NA ...
##   ..$ X97 : chr [1:140] "280,334" NA NA NA ...
##   ..$ X98 : chr [1:140] "413,084" NA NA NA ...
##   ..$ X99 : int [1:140] 1903 NA NA NA NA NA NA NA NA NA ...
##   .. [list output truncated]
##  $ :'data.frame':    138 obs. of  4 variables:
##   ..$ Year ofbirth: int [1:138] 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 ...
##   ..$ Male        : chr [1:138] "118,400" "108,282" "122,031" "112,477" ...
##   ..$ Female      : chr [1:138] "97,605" "98,855" "115,695" "120,059" ...
##   ..$ Total       : chr [1:138] "216,005" "207,137" "237,726" "232,536" ...
pop <- tab[[2]]

Now let’s clean the data so that we can use it for our analysis. We need to convert the population values into a numeric format, which requires deleting the commas. We will also change the variable names so that it’s easier to work with them.

pop$Male <- as.numeric(gsub(",", "", pop$Male))
pop$Female <- as.numeric(gsub(",", "", pop$Female))
names(pop) <- c("year", "male", "female", "total")

And now we can plot to see how the number of people applying for a Social Security Number in the US has increased over time.

plot(pop$year, pop$male, xlab="Year of birth", ylab="New SSN petitions",
      col="darkgreen", type="l")
lines(pop$year, pop$female, col="red")
legend(x="topleft", c("Male", "Female"), lty=1, col=c("green", "red"))

Scraping web data in table format: a more advanced example

When there are multiple tables on the website, scraping them becomes a bit more complicated. Let’s work through a common case scenario: scraping a table from Wikipedia with a list of the most populated cities in the United States.

url <- 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population'
html <- read_html(url)
tables <- html_table(html, fill=TRUE)
length(tables)
## [1] 12

The function now returns 12 different tables. I had to use the option fill=TRUE because some of the tables appear to have incomplete rows.

In this case, identifying the part of the html code that contains the table is a better approach. To do so, let’s take a look at the source code of the website. In Google Chrome, go to View > Developer > View Source. All browsers should have similar options to view the source code of a website.

In the source code, search for the text of the page (e.g. 2017 rank). Right above it you will see: <table class="wikitable sortable"...">. This is the CSS selector that contains the table. (You can also find this information by right-clicking anywhere on the table, and choosing Inspect).

Now that we now what we’re looking for, let’s use html_nodes() to identify all the elements of the page that have that CSS class. (Note that we need to use a dot before the name of the class to indicate it’s CSS.)

wiki <- html_nodes(html, '.wikitable')
length(wiki)
## [1] 7

There are 7 tables in total, and we will extract the first one.

pop <- html_table(wiki[[2]])
str(pop)
## 'data.frame':    311 obs. of  11 variables:
##  $ 2017
## rank             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ City                   : chr  "New York[6]" "Los Angeles" "Chicago" "Houston[7]" ...
##  $ State[5]               : chr  "New York" "California" "Illinois" "Texas" ...
##  $ 2017
## estimate         : chr  "8,622,698" "3,999,759" "2,716,450" "2,312,717" ...
##  $ 2010
## Census           : chr  "8,175,133" "3,792,621" "2,695,598" "2,100,263" ...
##  $ Change                 : chr  "+5.47%" "+5.46%" "+0.77%" "+10.12%" ...
##  $ 2016 land area         : chr  "301.5 sq mi" "468.7 sq mi" "227.3 sq mi" "637.5 sq mi" ...
##  $ 2016 land area         : chr  "780.9 km2" "1,213.9 km2" "588.7 km2" "1,651.1 km2" ...
##  $ 2016 population density: chr  "28,317/sq mi" "8,484/sq mi" "11,900/sq mi" "3,613/sq mi" ...
##  $ 2016 population density: chr  "10,933/km2" "3,276/km2" "4,600/km2" "1,395/km2" ...
##  $ Location               : chr  "40°39′49″N 73°56′19″W / 40.6635°N 73.9387°W / 40.6635; -73.9387 (1 New York City)" "34°01′10″N 118°24′39″W / 34.0194°N 118.4108°W / 34.0194; -118.4108 (2 Los Angeles)" "41°50′15″N 87°40′54″W / 41.8376°N 87.6818°W / 41.8376; -87.6818 (3 Chicago)" "29°47′12″N 95°23′27″W / 29.7866°N 95.3909°W / 29.7866; -95.3909 (4 Houston)" ...

As in the previous case, we still need to clean the data before we can use it. For this particular example, let’s see if this dataset provides evidence in support of Zipf’s law for population ranks.

We’ll use regular expressions to remove endnotes and commas in the population numbers, and clean the variable names. (We’ll come back to this later in the course.)

pop$City <- gsub('\\[.*\\]', '', pop$City)
library(stringr)
pop$population <- as.numeric(str_replace_all(pop[,"2017\nestimate"], ",", ""))
pop$rank <- pop[,"2017\nrank"]
head(pop[,c("City", "population", "rank")])
##           City population rank
## 1     New York    8622698    1
## 2  Los Angeles    3999759    2
## 3      Chicago    2716450    3
## 4      Houston    2312717    4
## 5      Phoenix    1626078    5
## 6 Philadelphia    1580863    6

Now we’re ready to generate the figure:

library(ggplot2)
p <- ggplot(pop, aes(x=rank, y=population, label=City))
pq <- p + geom_point() + geom_text(hjust=-.1, size=3) +
    scale_x_log10("log(rank)") + scale_y_log10("log(population)")
pq

We can also check if this distribution follows Zipf’s law estimating a log-log regression.

lm(log(rank) ~ log(population), data=pop)
## 
## Call:
## lm(formula = log(rank) ~ log(population), data = pop)
## 
## Coefficients:
##     (Intercept)  log(population)  
##          21.437           -1.365