Scraping web data in table format

We will start by loading the rvest package, which will help us scrape data from the web.

library(rvest)

The goal of this exercise is to scrape the population data from the Census website, and then clean it so that we can generate a population pyramid.

The first step is to read the html code from the website we want to scrape, using the read_html() function. If we want to see the html in text format, we can then use html_text().

url <- paste("http://www.census.gov/population/international/data/idb/",
    "region.php?N=%20Results%20&T=10&A=separate&RT=0&Y=2016&R=-1&C=US",
    sep="")
html <- read_html(url) # reading the html code into memory
html # not very informative
## {xml_document}
## <html>
## [1] <head>\n<meta name="DESCRIPTION" content="International Data Base (I ...
## [2] <body onload="hide_region_tables_region(); SelectedReport();">\n     ...
substr(html_text(html), 1, 1000) # first 1000 characters
## [1] "\n                        <!--\n                        @media print {\n                        .query_source {page-break-after:always;}\n                        .noprint {display:none; }\n                        }\n                        -->\n                \n                $(document).ready(function() {\n                        $('.download').click(function(){\n                                s.linkTrackVars='eVar3,events';\n                                s.linkTrackEvents='event2';\n                                s.events='event2';\n                                s.eVar3=\"IDB PHP-driven Download\";\n                                s.tl(false,'o',\"Download\");\n                        });\n                });\n                International Programs - Region Summary - U.S. Census Bureau\n//<![CDATA[\nvar focusObj;\n\nfunction unselectAll( id )\n{\n        var y = document.getElementById( id );\n\n        for(x=0;x<y.length;x++)\n                y.options[x].selected = false;\n}\n\n\nfunction limitSelect( obj, "

To extract all the tables in the html code automatically, we use html_table(). Note that it returns a list of data frames, so in order to work with this dataset, we will have to subset the first element of this list.

tab <- html_table(html)
str(tab)
## List of 1
##  $ :'data.frame':    22 obs. of  9 variables:
##   ..$ Year                 : int [1:22] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##   ..$ Age                  : chr [1:22] "Total" "0-4" "5-9" "10-14" ...
##   ..$ Both Sexes Population: chr [1:22] "323,995,528" "20,038,560" "20,403,263" "20,595,524" ...
##   ..$ Male Population      : chr [1:22] "159,690,758" "10,248,131" "10,417,020" "10,517,509" ...
##   ..$ Female Population    : chr [1:22] "164,304,770" "9,790,429" "9,986,243" "10,078,015" ...
##   ..$ Percent Both Sexes   : num [1:22] 100 6.2 6.3 6.4 6.5 6.9 7.1 6.7 6.4 6.1 ...
##   ..$ Percent Male         : num [1:22] 100 6.4 6.5 6.6 6.8 7.2 7.3 6.9 6.5 6.1 ...
##   ..$ Percent Female       : num [1:22] 100 6 6.1 6.1 6.3 6.6 6.9 6.6 6.3 6 ...
##   ..$ Sex Ratio            : num [1:22] 97.2 104.7 104.3 104.4 104.7 ...
pop <- tab[[1]]

Now let’s clean the data so that we can use it for our analysis. First, I’ll remove the row with the totals. (Note that I could just do pop[-1,], but imagine that we run this code again one year from now and the format has changed. The approach below is more robust to such changes.)

pop <- pop[-which(pop$Age=="Total"),]

Let’s also change the variable names so that they’re easier to work with. (R doesn’t like variable names with spaces.)

names(pop)[which(names(pop)=="Male Population")] <- "Male"
names(pop)[which(names(pop)=="Female Population")] <- "Female"

Finally, the other problem we need to fix is to convert the population values into a numeric format, which requires deleting the commas.

pop$Male <- as.numeric(gsub(",", "", pop$Male))
pop$Female <- as.numeric(gsub(",", "", pop$Female))

The data is now clean, but we need to convert it to long format if we want to use ggplot2 to generate the population pyramid. The following two approaches are equivalent:

df <- data.frame(
    Age = rep(pop$Age, 2),
    value = c(pop$Male, pop$Female),
    Gender = rep(c("Male", "Female"), each=nrow(pop)))
head(df)
##     Age    value Gender
## 1   0-4 10248131   Male
## 2   5-9 10417020   Male
## 3 10-14 10517509   Male
## 4 15-19 10813192   Male
## 5 20-24 11547150   Male
## 6 25-29 11725696   Male
# using melt() from the reshape package
library(reshape)
df <- melt(pop, 
    id.vars="Age", measure.vars=c("Male", "Female"),
    variable_name = "Gender")

Finally, we need to make the population value negative for one of the two gender categories so that the population pyramid is properly displayed, and also convert the age variable into a factor so that the ordering of the categories is not alphabetical (default with ggplot2).

df$value[df$Gender=="Male"] <- -df$value[df$Gender=="Male"]
df$Age <- factor(df$Age, levels=pop$Age)

Now we’re ready to generate the population pyramid figure!

library(ggplot2)
p <- ggplot(df, aes(x=Age, y=value, fill=Gender))
pq <- p + geom_bar(data=df[df$Gender=="Male",], stat="identity") + 
    geom_bar(data=df[df$Gender=="Female",], stat="identity") +  
    coord_flip() + theme_minimal() +
    theme(axis.title.x=element_blank(), axis.text.x=element_blank())
pq

Scraping web data in table format: a more involved example

When there are multiple tables on the website, scraping them becomes a bit more complicated. Let’s work through a common case scenario: scraping a table from Wikipedia with a list of the most populated cities in the United States.

url <- 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population'
html <- read_html(url)
tables <- html_table(html, fill=TRUE)
length(tables)
## [1] 62

The function now returns 62 different tables. I had to use the option fill=TRUE because some of the tables appear to have incomplete rows.

In this case, identifying the part of the html code that contains the table is a better approach. To do so, let’s take a look at the source code of the website. In Google Chrome, go to View > Developer > View Source. All browsers should have similar options to view the source code of a website.

In the source code, search for the text of the page (e.g. 2016 rank). Right above it you will see: <table class="wikitable sortable" style="text-align:center">. This is the CSS selector that contains the table. (You can also find this information by right-clicking anywhere on the table, and choosing Inspect).

Now that we now what we’re looking for, let’s use html_nodes() to identify all the elements of the page that have that CSS class. (Note that we need to use a dot before the name of the class to indicate it’s CSS.)

wiki <- html_nodes(html, '.wikitable')
length(wiki)
## [1] 6

There are 6 tables in total, and we will extract the first one.

pop <- html_table(wiki[[1]])
str(pop)
## 'data.frame':    307 obs. of  9 variables:
##  $ 2016 rank              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ City                   : chr  "New York[6]" "Los Angeles" "Chicago" "Houston[7]" ...
##  $ State[5]               : chr  "New York" "California" "Illinois" "Texas" ...
##  $ 2016 estimate          : chr  "8,537,673" "3,976,322" "2,704,958" "2,303,482" ...
##  $ 2010 Census            : chr  "8,175,133" "3,792,621" "2,695,598" "2,100,263" ...
##  $ Change                 : chr  "7000443466791304800♠+4.43%" "7000484364243092050♠+4.84%" "6999347232784710470♠+0.35%" "7000967588344888229♠+9.68%" ...
##  $ 2014 land area         : chr  "7002302600000000000♠302.6 sq mi\n783.8 km2" "7002468700000000000♠468.7 sq mi\n1,213.9 km2" "7002227600000000000♠227.6 sq mi\n589.6 km2" "7002599600000000000♠599.6 sq mi\n1,552.9 km2" ...
##  $ 2010 population density: chr  "7004270120000000000♠27,012 per sq mi\n10,430 km−2" "7003809200000000000♠8,092 per sq mi\n3,124 km−2" "7004118420000000000♠11,842 per sq mi\n4,572 km−2" "7003350100000000000♠3,501 per sq mi\n1,352 km−2" ...
##  $ Location               : chr  "40°39′51″N 73°56′19″W / 40.6643°N 73.9385°W / 40.6643; -73.9385 (1 New York City)" "34°01′10″N 118°24′39″W / 34.0194°N 118.4108°W / 34.0194; -118.4108 (2 Los Angeles)" "41°50′15″N 87°40′54″W / 41.8376°N 87.6818°W / 41.8376; -87.6818 (3 Chicago)" "29°46′50″N 95°23′11″W / 29.7805°N 95.3863°W / 29.7805; -95.3863 (4 Houston)" ...

As in the previous case, we still need to clean the data before we can use it. For this particular example, let’s see if this dataset provides evidence in support of Zipf’s law for population ranks.

We’ll use regular expressions to remove endnotes and commas in the population numbers, and clean the variable names.

pop$City <- gsub('\\[.*\\]', '', pop$City)
pop$population <- as.numeric(gsub(",", "", pop[,"2016 estimate"]))
pop$rank <- pop[,"2016 rank"]
head(pop[,c("City", "population", "rank")])
##           City population rank
## 1     New York    8537673    1
## 2  Los Angeles    3976322    2
## 3      Chicago    2704958    3
## 4      Houston    2303482    4
## 5      Phoenix    1615017    5
## 6 Philadelphia    1567872    6

Now we’re ready to generate the figure:

library(ggplot2)
p <- ggplot(pop, aes(x=rank, y=population, label=City))
pq <- p + geom_point() + geom_text(hjust=-.1, size=3) +
    scale_x_log10("log(rank)") + scale_y_log10("log(population)")
pq

We can also check if this distribution follows Zipf’s law estimating a log-log regression.

lm(log(rank) ~ log(population), data=pop)
## 
## Call:
## lm(formula = log(rank) ~ log(population), data = pop)
## 
## Coefficients:
##     (Intercept)  log(population)  
##          21.442           -1.367