We will start by loading the rvest
package, which will help us scrape data from the web.
library(rvest)
The goal of this exercise is to scrape the population data from the Census website, and then clean it so that we can generate a population pyramid.
The first step is to read the html code from the website we want to scrape, using the read_html()
function. If we want to see the html in text format, we can then use html_text()
.
url <- paste("http://www.census.gov/population/international/data/idb/",
"region.php?N=%20Results%20&T=10&A=separate&RT=0&Y=2016&R=-1&C=US",
sep="")
html <- read_html(url) # reading the html code into memory
html # not very informative
## {xml_document}
## <html>
## [1] <head>\n<meta name="DESCRIPTION" content="International Data Base (I ...
## [2] <body onload="hide_region_tables_region(); SelectedReport();">\n ...
substr(html_text(html), 1, 1000) # first 1000 characters
## [1] "\n <!--\n @media print {\n .query_source {page-break-after:always;}\n .noprint {display:none; }\n }\n -->\n \n $(document).ready(function() {\n $('.download').click(function(){\n s.linkTrackVars='eVar3,events';\n s.linkTrackEvents='event2';\n s.events='event2';\n s.eVar3=\"IDB PHP-driven Download\";\n s.tl(false,'o',\"Download\");\n });\n });\n International Programs - Region Summary - U.S. Census Bureau\n//<![CDATA[\nvar focusObj;\n\nfunction unselectAll( id )\n{\n var y = document.getElementById( id );\n\n for(x=0;x<y.length;x++)\n y.options[x].selected = false;\n}\n\n\nfunction limitSelect( obj, "
To extract all the tables in the html code automatically, we use html_table()
. Note that it returns a list of data frames, so in order to work with this dataset, we will have to subset the first element of this list.
tab <- html_table(html)
str(tab)
## List of 1
## $ :'data.frame': 22 obs. of 9 variables:
## ..$ Year : int [1:22] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## ..$ Age : chr [1:22] "Total" "0-4" "5-9" "10-14" ...
## ..$ Both Sexes Population: chr [1:22] "323,127,513" "19,927,037" "20,429,799" "20,618,233" ...
## ..$ Male Population : chr [1:22] "159,078,923" "10,186,853" "10,429,923" "10,518,732" ...
## ..$ Female Population : chr [1:22] "164,048,590" "9,740,184" "9,999,876" "10,099,501" ...
## ..$ Percent Both Sexes : num [1:22] 100 6.2 6.3 6.4 6.5 6.9 7.1 6.7 6.4 6.1 ...
## ..$ Percent Male : num [1:22] 100 6.4 6.6 6.6 6.8 7.2 7.3 6.9 6.5 6.1 ...
## ..$ Percent Female : num [1:22] 100 5.9 6.1 6.2 6.3 6.6 6.9 6.6 6.3 6 ...
## ..$ Sex Ratio : num [1:22] 97 105 104 104 105 ...
pop <- tab[[1]]
Now let’s clean the data so that we can use it for our analysis. First, I’ll remove the row with the totals. (Note that I could just do pop[-1,]
, but imagine that we run this code again one year from now and the format has changed. The approach below is more robust to such changes.)
pop <- pop[-which(pop$Age=="Total"),]
Let’s also change the variable names so that they’re easier to work with. (R doesn’t like variable names with spaces.)
names(pop)[which(names(pop)=="Male Population")] <- "Male"
names(pop)[which(names(pop)=="Female Population")] <- "Female"
Finally, the other problem we need to fix is to convert the population values into a numeric format, which requires deleting the commas.
pop$Male <- as.numeric(gsub(",", "", pop$Male))
pop$Female <- as.numeric(gsub(",", "", pop$Female))
The data is now clean, but we need to convert it to long format if we want to use ggplot2 to generate the population pyramid. The following two approaches are equivalent:
df <- data.frame(
Age = rep(pop$Age, 2),
value = c(pop$Male, pop$Female),
Gender = rep(c("Male", "Female"), each=nrow(pop)))
head(df)
## Age value Gender
## 1 0-4 10186853 Male
## 2 5-9 10429923 Male
## 3 10-14 10518732 Male
## 4 15-19 10801846 Male
## 5 20-24 11491065 Male
## 6 25-29 11631474 Male
# using melt() from the reshape package
library(reshape)
df <- melt(pop,
id.vars="Age", measure.vars=c("Male", "Female"),
variable_name = "Gender")
Finally, we need to make the population value negative for one of the two gender categories so that the population pyramid is properly displayed, and also convert the age variable into a factor so that the ordering of the categories is not alphabetical (default with ggplot2).
df$value[df$Gender=="Male"] <- -df$value[df$Gender=="Male"]
df$Age <- factor(df$Age, levels=pop$Age)
Now we’re ready to generate the population pyramid figure!
library(ggplot2)
p <- ggplot(df, aes(x=Age, y=value, fill=Gender))
pq <- p + geom_bar(data=df[df$Gender=="Male",], stat="identity") +
geom_bar(data=df[df$Gender=="Female",], stat="identity") +
coord_flip() + theme_minimal() +
theme(axis.title.x=element_blank(), axis.text.x=element_blank())
pq
When there are multiple tables on the website, scraping them becomes a bit more complicated. Let’s work through a common case scenario: scraping a table from Wikipedia with a list of the most populated cities in the United States.
url <- 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population'
html <- read_html(url)
tables <- html_table(html, fill=TRUE)
length(tables)
## [1] 62
The function now returns 62 different tables. I had to use the option fill=TRUE
because some of the tables appear to have incomplete rows.
In this case, identifying the part of the html code that contains the table is a better approach. To do so, let’s take a look at the source code of the website. In Google Chrome, go to View > Developer > View Source. All browsers should have similar options to view the source code of a website.
In the source code, search for the text of the page (e.g. 2016 rank). Right above it you will see: <table class="wikitable sortable" style="text-align:center">
. This is the CSS selector that contains the table. (You can also find this information by right-clicking anywhere on the table, and choosing Inspect).
Now that we now what we’re looking for, let’s use html_nodes()
to identify all the elements of the page that have that CSS class. (Note that we need to use a dot before the name of the class to indicate it’s CSS.)
wiki <- html_nodes(html, '.wikitable')
length(wiki)
## [1] 6
There are 6 tables in total, and we will extract the first one.
pop <- html_table(wiki[[1]])
str(pop)
## 'data.frame': 307 obs. of 11 variables:
## $ 2016
## rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ City : chr "New York[6]" "Los Angeles" "Chicago" "Houston[7]" ...
## $ State[5] : chr "New York" "California" "Illinois" "Texas" ...
## $ 2016
## estimate : chr "8,537,673" "3,976,322" "2,704,958" "2,303,482" ...
## $ 2010
## Census : chr "8,175,133" "3,792,621" "2,695,598" "2,100,263" ...
## $ Change : chr "7000443466791304800♠+4.43%" "7000484364243092050♠+4.84%" "6999347232784710470♠+0.35%" "7000967588344888229♠+9.68%" ...
## $ 2014 land area : chr "302.6 sq mi" "468.7 sq mi" "227.6 sq mi" "599.6 sq mi" ...
## $ 2014 land area : chr "783.7 km2" "1,213.9 km2" "589.5 km2" "1,553.0 km2" ...
## $ 2010 population density: chr "27,012/sq mi" "8,092/sq mi" "11,842/sq mi" "3,501/sq mi" ...
## $ 2010 population density: chr "10,429/km2" "3,124/km2" "4,572/km2" "1,352/km2" ...
## $ Location : chr "40°39′51″N 73°56′19″W / 40.6643°N 73.9385°W / 40.6643; -73.9385 (1 New York City)" "34°01′10″N 118°24′39″W / 34.0194°N 118.4108°W / 34.0194; -118.4108 (2 Los Angeles)" "41°50′15″N 87°40′54″W / 41.8376°N 87.6818°W / 41.8376; -87.6818 (3 Chicago)" "29°46′50″N 95°23′11″W / 29.7805°N 95.3863°W / 29.7805; -95.3863 (4 Houston)" ...
As in the previous case, we still need to clean the data before we can use it. For this particular example, let’s see if this dataset provides evidence in support of Zipf’s law for population ranks.
We’ll use regular expressions to remove endnotes and commas in the population numbers, and clean the variable names. (We’ll come back to this later in the course.)
pop$City <- gsub('\\[.*\\]', '', pop$City)
library(stringr)
pop$population <- as.numeric(str_replace_all(pop[,"2016\nestimate"], ",", ""))
pop$rank <- pop[,"2016\nrank"]
head(pop[,c("City", "population", "rank")])
## City population rank
## 1 New York 8537673 1
## 2 Los Angeles 3976322 2
## 3 Chicago 2704958 3
## 4 Houston 2303482 4
## 5 Phoenix 1615017 5
## 6 Philadelphia 1567872 6
Now we’re ready to generate the figure:
library(ggplot2)
p <- ggplot(pop, aes(x=rank, y=population, label=City))
pq <- p + geom_point() + geom_text(hjust=-.1, size=3) +
scale_x_log10("log(rank)") + scale_y_log10("log(population)")
pq
We can also check if this distribution follows Zipf’s law estimating a log-log regression.
lm(log(rank) ~ log(population), data=pop)
##
## Call:
## lm(formula = log(rank) ~ log(population), data = pop)
##
## Coefficients:
## (Intercept) log(population)
## 21.442 -1.367