Scraping unstructured data

Scraping web data in unstructured format

A common scenario for web scraping is when the data we want is available in plain html, but in different parts of the web, and not in a table format. In this scenario, we will need to find a way to extract each element, and then put it together into a data frame manually.

The motivating example here will be the website ipaidabribe.com, which contains a database of self-reports of bribes in India. We want to learn how much people were asked to pay for different services, and by which departments.

url <- 'http://ipaidabribe.com/reports/paid'

We will also be using rvest, but in a slightly different way: prior to scraping, we need to identify the CSS selector of each element we want to extract.

A very useful tool for this purpose is selectorGadget, an extension to the Google Chrome browser. Go to the following website to install it: http://selectorgadget.com/. Now, go back to the ipaidabribe website and open the extension. Then, click on the element you want to extract, and then on the rest of highlighted elements that you do not want to extract. After only the elements you’re interested in are highlighted, copy and paste the CSS selector into R.

Now we’re ready to scrape the website:

library(rvest, warn.conflicts=FALSE)

## Loading required package: xml2

bribes <- read_html(url) # reading the HTML code
amounts <- html_nodes(bribes, ".paid-amount span") # identify the CSS selector
amounts # content of CSS selector

## {xml_nodeset (10)}
##  [1] <span>Paid INR 200\r\n                        </span>
##  [2] <span>Paid INR 54,000\r\n                        </span>
##  [3] <span>Paid INR 1,000\r\n                        </span>
##  [4] <span>Paid INR 5,000\r\n                        </span>
##  [5] <span>Paid INR 3,000\r\n                        </span>
##  [6] <span>Paid INR 500\r\n                        </span>
##  [7] <span>Paid INR 1,00,000\r\n                        </span>
##  [8] <span>Paid INR 2,000\r\n                        </span>
##  [9] <span>Paid INR 1,200\r\n                        </span>
## [10] <span>Paid INR 500\r\n                        </span>

html_text(amounts)

##  [1] "Paid INR 200\r\n                        "     
##  [2] "Paid INR 54,000\r\n                        "  
##  [3] "Paid INR 1,000\r\n                        "   
##  [4] "Paid INR 5,000\r\n                        "   
##  [5] "Paid INR 3,000\r\n                        "   
##  [6] "Paid INR 500\r\n                        "     
##  [7] "Paid INR 1,00,000\r\n                        "
##  [8] "Paid INR 2,000\r\n                        "   
##  [9] "Paid INR 1,200\r\n                        "   
## [10] "Paid INR 500\r\n                        "

We still need to do some cleaning before the data is usable:

amounts <- html_text(amounts)
(amounts <- gsub("Paid INR | |\r|\n|,", "", amounts)) # remove text, white space, and commas

##  [1] "200"    "54000"  "1000"   "5000"   "3000"   "500"    "100000"
##  [8] "2000"   "1200"   "500"

(amounts <- as.numeric(amounts)) # convert to numeric

##  [1]    200  54000   1000   5000   3000    500 100000   2000   1200    500

Let’s do another one: transactions during which the bribe ocurred

transaction <- html_nodes(bribes, ".transaction a")
(transaction <- html_text(transaction))

##  [1] "Customs Check and Clearance"      "Temporary Connection"            
##  [3] "Bribing of Government Officers"   "VAT Registration"                
##  [5] "Duplicate Driving License"        "Police Verification for Passport"
##  [7] "Meter Installation"               "Tax Clearance Certificate"       
##  [9] "Berth Allocation"                 "Police Verification for Passport"

And one more: the department that is responsible for these transactions

# and one more
dept <- html_nodes(bribes, ".name a")
(dept <- html_text(dept))

##  [1] "Customs, Excise and Service Tax" "Electricity and Power Supply"   
##  [3] "Public Services"                 "Commercial Tax, Sales Tax, VAT" 
##  [5] "Transport"                       "Police"                         
##  [7] "Electricity and Power Supply"    "Commercial Tax, Sales Tax, VAT" 
##  [9] "Railways"                        "Police"

This was just for one page, but note that there are many pages. How do we scrape the rest? First, following the best practices on coding, we will write a function that takes the URL of each page, scrapes it, and returns the information we want.

scrape_bribe <- function(url){
    bribes <- read_html(url)
    # variables that we're interested in
    amounts <- html_text(html_nodes(bribes, ".paid-amount span"))
    amounts <- as.numeric(gsub("Paid INR | |\r|\n|,", "", amounts))
    transaction <- html_text(html_nodes(bribes, ".transaction a"))
    dept <- html_text(html_nodes(bribes, ".name a"))
    # putting together into a data frame
    df <- data.frame(
        amounts = amounts,
        transaction = transaction,
        dept = dept,
            stringsAsFactors=F)
    return(df)
}

And we will start a list of data frames, and put the data frame for the initial page in the first position of that list.

bribes <- list()
bribes[[1]] <- scrape_bribe(url)
str(bribes)

## List of 1
##  $ :'data.frame':    10 obs. of  3 variables:
##   ..$ amounts    : num [1:10] 200 54000 1000 5000 3000 500 100000 2000 1200 500
##   ..$ transaction: chr [1:10] "Customs Check and Clearance" "Temporary Connection" "Bribing of Government Officers" "VAT Registration" ...
##   ..$ dept       : chr [1:10] "Customs, Excise and Service Tax" "Electricity and Power Supply" "Public Services" "Commercial Tax, Sales Tax, VAT" ...

How should we go about the following pages? Note that the following urls had page=XX, where XX is 10, 20, 30… So we will create a base url and then add these additional numbers. (Note that for this exercise we will only scrape the first 5 pages.)

base_url <- "http://ipaidabribe.com/reports/paid?page="
pages <- seq(0, 40, by=10)

And now we just need to loop over pages, and use the function we created earlier to scrape the information, and add it to the list. Note that we’re adding a couple of seconds between HTTP requests to avoid overloading the page, as well as a message that will informs us of the progress of the loop.

for (i in 2:length(pages)){
    # informative message about progress of loop
    message(i, '/', length(pages))
    # prepare URL
    url <- paste(base_url, pages[i], sep="")
    # scrape website
    bribes[[i]] <- scrape_bribe(url)
    # wait a couple of seconds between URL calls
    Sys.sleep(2)
}

## 2/5

## 3/5

## 4/5

## 5/5

The final step is to convert the list of data frames into a single data frame that we can work with, using the function do.call(rbind, LIST) (where LIST is a list of data frames).

bribes <- do.call(rbind, bribes)
head(bribes)

##   amounts                      transaction                            dept
## 1     200      Customs Check and Clearance Customs, Excise and Service Tax
## 2   54000             Temporary Connection    Electricity and Power Supply
## 3    1000   Bribing of Government Officers                 Public Services
## 4    5000                 VAT Registration  Commercial Tax, Sales Tax, VAT
## 5    3000        Duplicate Driving License                       Transport
## 6     500 Police Verification for Passport                          Police

str(bribes)

## 'data.frame':    50 obs. of  3 variables:
##  $ amounts    : num  200 54000 1000 5000 3000 500 100000 2000 1200 500 ...
##  $ transaction: chr  "Customs Check and Clearance" "Temporary Connection" "Bribing of Government Officers" "VAT Registration" ...
##  $ dept       : chr  "Customs, Excise and Service Tax" "Electricity and Power Supply" "Public Services" "Commercial Tax, Sales Tax, VAT" ...

Let’s get some quick descriptive statistics to check everything worked. First, what is the most common transaction during which a bribe was paid?

tab <- table(bribes$transaction) # frequency table
tab <- sort(tab, decreasing=TRUE)   # sorting the table from most to least common
head(tab)

## 
## Police Verification for Passport               Traffic Violations 
##                                7                                4 
##          Background Verification                       Check Post 
##                                2                                2 
##                    Contract Work               Garbage Collection 
##                                2                                2

What was the average bribe payment?

summary(bribes$amount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     100     500    2000   32000    3500 1000000

And what was the average payment for each department?

agg <- aggregate(bribes$amount, by=list(dept=bribes$dept), FUN=mean)
agg[order(agg$x, decreasing = TRUE),] # ordering from highest to lowest

##                               dept          x
## 7        Health and Family Welfare 501750.000
## 16                         Revenue 123000.000
## 5     Electricity and Power Supply  77000.000
## 12                          Police  22070.000
## 8                       Income Tax  10000.000
## 9               Municipal Services   6000.000
## 4                        Education   5000.000
## 19   Urban Development Authorities   5000.000
## 17         Stamps and Registration   4166.667
## 2   Commercial Tax, Sales Tax, VAT   3500.000
## 6                           Forest   3000.000
## 3  Customs, Excise and Service Tax   1600.000
## 18                       Transport   1485.714
## 1                         Airports   1000.000
## 11                        Passport   1000.000
## 14                 Public Services   1000.000
## 13      Public Sector Undertakings    800.000
## 15                        Railways    800.000
## 10                          Others    500.000

Scraping unstructured data

Pablo Barbera

August 29, 2017

Scraping web data in unstructured format