Scraping web data behind web forms

The most difficult scenario for web scraping is when data is hidden behind multiple pages that can only be accessed entering information into web forms. There are a few approaches that might work in these cases, with varying degree of difficulty and reliability, but in my experience the best method is to use Selenium.

Selenium automates web browsing sessions, and was originally designed for testing purposes. You can simulate clicks, enter information into web forms, add some waiting time between clicks, etc.

To learn how it works, we will scrape the website Monitor Legislativo, which provides information about the candidates in the recent Venezuelan legislative elections.

url <- 'http://eligetucandidato.org/filtro/'

As you can see, the information we want to scrape is hidden behind these two selectors. Let’s see how we can use Selenium to scrape it.

The first step is to load the two packages associated to RSelenium. Then, we will start a headless browser running in the background.

library(RSelenium)
library(wdman)

server <- phantomjs(port=5000L)
## checking phantomjs versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
browser <- remoteDriver(browserName = "phantomjs", port=5000L)

Note that you may need to change the server port. Now we can open an instance of PhantomJS and navigate to the URL

browser$open()
## [1] "Connecting to remote server"
## $browserName
## [1] "phantomjs"
## 
## $version
## [1] "2.1.1"
## 
## $driverName
## [1] "ghostdriver"
## 
## $driverVersion
## [1] "1.2.0"
## 
## $platform
## [1] "linux-unknown-64bit"
## 
## $javascriptEnabled
## [1] TRUE
## 
## $takesScreenshot
## [1] TRUE
## 
## $handlesAlerts
## [1] FALSE
## 
## $databaseEnabled
## [1] FALSE
## 
## $locationContextEnabled
## [1] FALSE
## 
## $applicationCacheEnabled
## [1] FALSE
## 
## $browserConnectionEnabled
## [1] FALSE
## 
## $cssSelectorsEnabled
## [1] TRUE
## 
## $webStorageEnabled
## [1] FALSE
## 
## $rotatable
## [1] FALSE
## 
## $acceptSslCerts
## [1] FALSE
## 
## $nativeEvents
## [1] TRUE
## 
## $proxy
## $proxy$proxyType
## [1] "direct"
## 
## 
## $id
## [1] "45a66c10-9388-11e8-8b45-5da65e55ac46"
browser$navigate(url)

Here’s how we would check that it worked:

src <- browser$getPageSource()
substr(src, 1, 1000)
## [1] "<!DOCTYPE html><html lang=\"es-ES\" class=\" js flexbox canvas canvastext no-webgl touch no-geolocation postmessage websqldatabase indexeddb hashchange history draganddrop websockets rgba hsla multiplebgs backgroundsize borderimage borderradius boxshadow textshadow opacity cssanimations csscolumns cssgradients cssreflections csstransforms csstransforms3d csstransitions fontface generatedcontent no-video no-audio localstorage sessionstorage webworkers applicationcache svg inlinesvg smil svgclippaths js_active  vc_mobile  vc_transform \" style=\"height: auto; overflow: auto;\"><head>\n\t<meta charset=\"UTF-8\">\n\t\n\t<title>Elige Tu Candidato |   Resultados de la consulta</title>\n\n\t\n\t\t\t\n\t\t\t\t\t\t<meta name=\"viewport\" content=\"width=device-width,initial-scale=1,user-scalable=no\">\n\t\t\n\t<link rel=\"profile\" href=\"http://gmpg.org/xfn/11\">\n\t<link rel=\"pingback\" href=\"http://eligetucandidato.org/xmlrpc.php\">\n\t<link rel=\"shortcut icon\" type=\"image/x-icon\" href=\"http://eligetucandidato.org/wp-content/uploads/2015"

We can see what the website looks like at any time by taking screenshots. This will become very useful as we start playing with the web form.

browser$screenshot(display=TRUE)

Let’s assume we want to see the results of the state of Distrito Capital for the GPP party. First, let’s use selectorGadget to identify the elements that we’re trying to scrape. Then, click on the list of states and parties to find the exact state and party you would want to scrape.

state <- browser$findElement(using = 'id', value="estado")
state$sendKeysToElement(list("Distrito Capital"))
browser$screenshot(display=TRUE)

party <- browser$findElement(using = 'id', value="partido")
party$sendKeysToElement(list("GPP"))
browser$screenshot(display=TRUE)

That seemed to work! Finally, let’s find the information for the Send button and click on it.

send <- browser$findElement(using = 'id', value="enviar")
send$clickElement()
browser$screenshot(display=TRUE)

From this website, we want to scrape the URLs to each candidate’s page. Note that there are some duplicates, which we will need to clean. As in the previous cases, we can use selectorGadget to help us identify the information we want to extract. Then, for each of the elements, we will extract the URL and the name of the candidate.

links <- browser$findElements(using="css selector", value=".portfolio_title a")

urls <- rep(NA, length(links))
names <- rep(NA, length(links))
for (i in 1:length(links)){
    urls[i] <- links[[i]]$getElementAttribute('href')[[1]]
    names[i] <- links[[i]]$getElementText()[[1]]
}

# removing duplicates
urls <- urls[!duplicated(urls)]
urls
## [1] "http://eligetucandidato.org/perfil-candidato/?id=6432672"
## [2] "http://eligetucandidato.org/perfil-candidato/?id=4376240"

The final step would be to scrape the information from each of these candidate websites, and do all of this inside loops over states and parties, and then candidates. I will not cover the rest of the example in class, but I’m pasting the code below in case you want to use parts of it for your projects.

Before we switch topics, one more thing: it’s important to clean up after we scrape some data by closing the browser that we opened in the background.

browser$close()

And now, here’s the rest of the code:

############################################################
## first, we scrape the list of districts and parties
############################################################

url <- 'http://eligetucandidato.org/filtro/'
txt <- readLines(url)

estados <- txt[288:311]
estados <- gsub(".*>(.*)<.*", estados, repl="\\1")

partidos <- txt[315:317]
partidos <- gsub(".*>(.*)<.*", partidos, repl="\\1")

############################################################
## now, loop over pages to extract the URL for each candidate
############################################################

library(RSelenium)
library(RSelenium)
library(wdman)

server <- phantomjs(port=4446L, verbose=FALSE)
browser <- remoteDriver(browserName = "phantomjs", port=4446L)

browser$open()

candidatos <- c()

for (estado in estados){
    message(estado)
    for (partido in partidos){
        message(partido)

        browser$navigate(url)
        # input: estado
        Sys.sleep(1)
        more <- browser$findElement(using = 'id', value="estado")
        more$sendKeysToElement(list(estado))
        # input: partido
        Sys.sleep(1)
        more <- browser$findElement(using = 'id', value="partido")
        more$sendKeysToElement(list(partido))
        # click on "buscar"
        more <- browser$findElement(using = 'id', value="enviar")
        more$clickElement()
        # extracting URLS
        more <- browser$findElements(using = 'xpath', value="//a[@target='_self']")
        count <- 0
        while (length(more)<2 & count < 10){
            Sys.sleep(1)
            more <- browser$findElements(using = 'xpath', value="//a[@target='_self']")
            count <- count + 1
        }
        urls <- unlist(sapply(more, function(x) x$getElementAttribute('href')))
        urls <- unique(urls)

        candidatos <- c(candidatos, urls)
        message(length(urls), ' candidatos nuevos, ', length(candidatos), ' en total')


    }
}

writeLines(candidatos, con=file("candidatos-urls.txt"))

############################################################
## download the html code of candidates' URLS
############################################################

for (url in candidatos){
    id <- gsub('.*id=', '', url)
    filename <- paste0('html/', id, '.html')
    if (file.exists(filename)){ next }
    message(url)
    html <- readLines(url)
    writeLines(html, con=file(filename))
    Sys.sleep(2)
}

############################################################
## extract information from html
############################################################

fls <- list.files("html", full.names=TRUE)
df <- list()

for (i in 1:length(fls)){

    txt <- readLines(fls[i])

    # id
    id <- gsub('html/(.*).html', fls[i], repl="\\1")
    # name
    name <- txt[grep("displayCenter vc_single_image", txt)]
    name <- gsub(".*title=\"(.*)\">.*", name, repl='\\1')
    # partido
    partido <- txt[grep("Partido Político", txt)]
    partido <- gsub('.*</strong> (.*)</span.*', partido, repl="\\1")
    # edad
    edad <- txt[grep("Edad", txt)]
    edad <- gsub('.*ong>(.*)</span.*', edad, repl="\\1")
    # circuito
    circuito <- txt[grep("Circuito", txt)]
    circuito <- gsub('.*Circuito (.*)</span.*', circuito, repl="\\1")[1]
    # estado
    estado <- txt[grep("Circuito", txt)]
    estado <- gsub('.*\">(.*) –&nbsp;.*', estado, repl="\\1")[1]
    # twitter
    twitter <- txt[grep("fa-twitter.*@", txt)]
    twitter <- gsub('.*@(.*)</spa.*', twitter, repl="\\1")
    if (length(twitter)==0){ twitter <- NA }

    df[[i]] <- data.frame(
        id = id, name = name, partido = partido, edad = edad,
        circuito = circuito, estado = estado, twitter = twitter,
        stringsAsFactors=F)
}

df <- do.call(rbind, df)
df <- df[order(df$estado, df$partido, df$circuito),]

write.csv(df, file="venezuela-monitor-legislativo-data.csv",
    row.names=FALSE)