A common scenario for web scraping is when the data we want is available in plain html, but in different parts of the web, and not in a table format. In this scenario, we will need to find a way to extract each element, and then put it together into a data frame manually.
The motivating example here will be the website ipaidabribe.com
, which contains a database of self-reports of bribes in India. We want to learn how much people were asked to pay for different services, and by which departments.
url <- 'http://ipaidabribe.com/reports/paid'
We will also be using rvest
, but in a slightly different way: prior to scraping, we need to identify the CSS selector of each element we want to extract.
A very useful tool for this purpose is selectorGadget
, an extension to the Google Chrome browser. Go to the following website to install it: http://selectorgadget.com/
. Now, go back to the ipaidabribe website and open the extension. Then, click on the element you want to extract, and then on the rest of highlighted elements that you do not want to extract. After only the elements you’re interested in are highlighted, copy and paste the CSS selector into R.
Now we’re ready to scrape the website:
library(rvest, warn.conflicts=FALSE)
## Loading required package: xml2
bribes <- read_html(url) # reading the HTML code
amounts <- html_nodes(bribes, ".paid-amount span") # identify the CSS selector
amounts # content of CSS selector
## {xml_nodeset (10)}
## [1] <span>Paid INR 1,500\r\n </span>
## [2] <span>Paid INR 2,400\r\n </span>
## [3] <span>Paid INR 5,000\r\n </span>
## [4] <span>Paid INR 200\r\n </span>
## [5] <span>Paid INR 15,000\r\n </span>
## [6] <span>Paid INR 44,000\r\n </span>
## [7] <span>Paid INR 200\r\n </span>
## [8] <span>Paid INR 43,000\r\n </span>
## [9] <span>Paid INR 3,000\r\n </span>
## [10] <span>Paid INR 2,000\r\n </span>
html_text(amounts)
## [1] "Paid INR 1,500\r\n "
## [2] "Paid INR 2,400\r\n "
## [3] "Paid INR 5,000\r\n "
## [4] "Paid INR 200\r\n "
## [5] "Paid INR 15,000\r\n "
## [6] "Paid INR 44,000\r\n "
## [7] "Paid INR 200\r\n "
## [8] "Paid INR 43,000\r\n "
## [9] "Paid INR 3,000\r\n "
## [10] "Paid INR 2,000\r\n "
We still need to do some cleaning before the data is usable:
amounts <- html_text(amounts)
(amounts <- gsub("Paid INR | |\r|\n|,", "", amounts)) # remove text, white space, and commas
## [1] "1500" "2400" "5000" "200" "15000" "44000" "200" "43000"
## [9] "3000" "2000"
(amounts <- as.numeric(amounts)) # convert to numeric
## [1] 1500 2400 5000 200 15000 44000 200 43000 3000 2000
Let’s do another one: transactions during which the bribe ocurred
transaction <- html_nodes(bribes, ".transaction a")
(transaction <- html_text(transaction))
## [1] "False Allegations"
## [2] "Police Verification for Passport"
## [3] "Registration of Flat or Apartment"
## [4] "Police Verification for Passport"
## [5] "Transfer of Property"
## [6] "School or College Related Activities"
## [7] "Traffic Violations"
## [8] "School or College Related Activities"
## [9] "Arrested by Police"
## [10] "Filing FIR"
And one more: the department that is responsible for these transactions
# and one more
dept <- html_nodes(bribes, ".name a")
(dept <- html_text(dept))
## [1] "Police" "Police"
## [3] "Stamps and Registration" "Passport"
## [5] "Stamps and Registration" "Education"
## [7] "Police" "Education"
## [9] "Police" "Police"
This was just for one page, but note that there are many pages. How do we scrape the rest? First, following the best practices on coding, we will write a function that takes the URL of each page, scrapes it, and returns the information we want.
scrape_bribe <- function(url){
bribes <- read_html(url)
# variables that we're interested in
amounts <- html_text(html_nodes(bribes, ".paid-amount span"))
amounts <- as.numeric(gsub("Paid INR | |\r|\n|,", "", amounts))
transaction <- html_text(html_nodes(bribes, ".transaction a"))
dept <- html_text(html_nodes(bribes, ".name a"))
# putting together into a data frame
df <- data.frame(
amounts = amounts,
transaction = transaction,
dept = dept,
stringsAsFactors=F)
return(df)
}
And we will start a list of data frames, and put the data frame for the initial page in the first position of that list.
bribes <- list()
bribes[[1]] <- scrape_bribe(url)
str(bribes)
## List of 1
## $ :'data.frame': 10 obs. of 3 variables:
## ..$ amounts : num [1:10] 1500 2400 5000 200 15000 44000 200 43000 3000 2000
## ..$ transaction: chr [1:10] "False Allegations" "Police Verification for Passport" "Registration of Flat or Apartment" "Police Verification for Passport" ...
## ..$ dept : chr [1:10] "Police" "Police" "Stamps and Registration" "Passport" ...
How should we go about the following pages? Note that the following urls had page=XX
, where XX
is 10, 20, 30… So we will create a base url and then add these additional numbers. (Note that for this exercise we will only scrape the first 5 pages.)
base_url <- "http://ipaidabribe.com/reports/paid?page="
pages <- seq(0, 40, by=10)
And now we just need to loop over pages, and use the function we created earlier to scrape the information, and add it to the list. Note that we’re adding a couple of seconds between HTTP requests to avoid overloading the page, as well as a message that will informs us of the progress of the loop.
for (i in 2:length(pages)){
# informative message about progress of loop
message(i, '/', length(pages))
# prepare URL
url <- paste(base_url, pages[i], sep="")
# scrape website
bribes[[i]] <- scrape_bribe(url)
# wait a couple of seconds between URL calls
Sys.sleep(2)
}
## 2/5
## 3/5
## 4/5
## 5/5
The final step is to convert the list of data frames into a single data frame that we can work with, using the function do.call(rbind, LIST)
(where LIST
is a list of data frames).
bribes <- do.call(rbind, bribes)
head(bribes)
## amounts transaction dept
## 1 1500 False Allegations Police
## 2 2400 Police Verification for Passport Police
## 3 5000 Registration of Flat or Apartment Stamps and Registration
## 4 200 Police Verification for Passport Passport
## 5 15000 Transfer of Property Stamps and Registration
## 6 44000 School or College Related Activities Education
str(bribes)
## 'data.frame': 50 obs. of 3 variables:
## $ amounts : num 1500 2400 5000 200 15000 44000 200 43000 3000 2000 ...
## $ transaction: chr "False Allegations" "Police Verification for Passport" "Registration of Flat or Apartment" "Police Verification for Passport" ...
## $ dept : chr "Police" "Police" "Stamps and Registration" "Passport" ...
Let’s get some quick descriptive statistics to check everything worked. First, what is the most common transaction during which a bribe was paid?
tab <- table(bribes$transaction) # frequency table
tab <- sort(tab, decreasing=TRUE) # sorting the table from most to least common
head(tab)
##
## Police Verification for Passport Traffic Violations
## 7 6
## Customs Check and Clearance Police Harassment
## 2 2
## School or College Related Activities 7/12 Extract
## 2 1
What was the average bribe payment?
summary(bribes$amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 600 2200 22656 10000 250000
And what was the average payment for each department?
agg <- aggregate(bribes$amount, by=list(dept=bribes$dept), FUN=mean)
agg[order(agg$x, decreasing = TRUE),] # ordering from highest to lowest
## dept x
## 8 Municipal Services 101450.000
## 13 Railways 87000.000
## 5 Education 42333.333
## 6 Electricity and Power Supply 34750.000
## 11 Police 10829.412
## 2 Banking 10000.000
## 4 Customs, Excise and Service Tax 10000.000
## 15 Stamps and Registration 8750.000
## 3 Commercial Tax, Sales Tax, VAT 5000.000
## 7 Health and Family Welfare 4000.000
## 16 Transport 1333.333
## 10 Passport 1075.000
## 9 Others 1000.000
## 14 Revenue 1000.000
## 17 Urban Development Authorities 1000.000
## 12 Post Office 100.000
## 1 Airports 1.000