### Scraping web data in table format
We will start by loading the `rvest` package, which will help us scrape data from the web.
```{r, message=FALSE}
library(rvest)
```
Here we will learn how to scrape the number of new social security number holders by year in the US, and then the collected data so that we can generate a plot showing the evolution in this variable over time.
The first step is to read the html code from the website we want to scrape, using the `read_html()` function. If we want to see the html in text format, we can then use `html_text()`.
```{r}
url <- "https://www.ssa.gov/oact/babynames/numberUSbirths.html"
html <- read_html(url) # reading the html code into memory
html # not very informative
substr(html_text(html), 1, 1000) # first 1000 characters
```
To extract all the tables in the html code automatically, we use `html_table()`. Note that it returns a list of data frames, so in order to work with this dataset, we will have to subset the second element of this list.
```{r}
tab <- html_table(html, fill=TRUE)
str(tab)
pop <- tab[[1]]
```
Now let's clean the data so that we can use it for our analysis. We need to convert the population values into a numeric format, which requires deleting the commas. We will also change the variable names so that it's easier to work with them.
```{r}
pop$Male <- as.numeric(gsub(",", "", pop$Male))
pop$Female <- as.numeric(gsub(",", "", pop$Female))
names(pop) <- c("year", "male", "female", "total")
```
And now we can plot to see how the number of people applying for a Social Security Number in the US has increased over time.
```{r}
plot(pop$year, pop$male, xlab="Year of birth", ylab="New SSN petitions",
col="darkgreen", type="l")
lines(pop$year, pop$female, col="red")
legend(x="topleft", c("Male", "Female"), lty=1, col=c("green", "red"))
```
### Scraping web data in table format: a more advanced example
When there are multiple tables on the website, scraping them becomes a bit more complicated. Let's work through a common case scenario: scraping a table from Wikipedia with a list of the most populated cities in the United States.
```{r}
url <- 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population'
html <- read_html(url)
tables <- html_table(html, fill=TRUE)
length(tables)
```
The function now returns 15 different tables. I had to use the option `fill=TRUE` because some of the tables appear to have incomplete rows.
In this case, identifying the part of the html code that contains the table is a better approach. To do so, let's take a look at the source code of the website. In Google Chrome, go to _View_ > _Developer_ > _View Source_. All browsers should have similar options to view the source code of a website.
In the source code, search for the text of the page (e.g. _2021
rank_), where _
_ is the line break. Right above it you will see: `