### Scraping web data in table format We will start by loading the `rvest` package, which will help us scrape data from the web. ```{r, message=FALSE} library(rvest) ``` The goal of this exercise is to scrape the number of new social security number holders by year in the US, and then clean it so that we can generate a plot showing the evolution in this variable over time. The first step is to read the html code from the website we want to scrape, using the `read_html()` function. If we want to see the html in text format, we can then use `html_text()`. ```{r} url <- "https://www.ssa.gov/oact/babynames/numberUSbirths.html" html <- read_html(url) # reading the html code into memory html # not very informative substr(html_text(html), 1, 1000) # first 1000 characters ``` To extract all the tables in the html code automatically, we use `html_table()`. Note that it returns a list of data frames, so in order to work with this dataset, we will have to subset the second element of this list. ```{r} tab <- html_table(html, fill=TRUE) str(tab) pop <- tab[[2]] ``` Now let's clean the data so that we can use it for our analysis. We need to convert the population values into a numeric format, which requires deleting the commas. We will also change the variable names so that it's easier to work with them. ```{r} pop$Male <- as.numeric(gsub(",", "", pop$Male)) pop$Female <- as.numeric(gsub(",", "", pop$Female)) names(pop) <- c("year", "male", "female", "total") ``` And now we can plot to see how the number of people applying for a Social Security Number in the US has increased over time. ```{r} plot(pop$year, pop$male, xlab="Year of birth", ylab="New SSN petitions", col="darkgreen", type="l") lines(pop$year, pop$female, col="red") legend(x="topleft", c("Male", "Female"), lty=1, col=c("green", "red")) ``` ### Scraping web data in table format: a more advanced example When there are multiple tables on the website, scraping them becomes a bit more complicated. Let's work through a common case scenario: scraping a table from Wikipedia with a list of the most populated cities in the United States. ```{r} url <- 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population' html <- read_html(url) tables <- html_table(html, fill=TRUE) length(tables) ``` The function now returns 12 different tables. I had to use the option `fill=TRUE` because some of the tables appear to have incomplete rows. In this case, identifying the part of the html code that contains the table is a better approach. To do so, let's take a look at the source code of the website. In Google Chrome, go to _View_ > _Developer_ > _View Source_. All browsers should have similar options to view the source code of a website. In the source code, search for the text of the page (e.g. _2017 rank_). Right above it you will see: `