For reasons that often escape my understanding, many governmental agencies do not release data in a machine-readable format; instead, they just upload a series of PDF files to their website. Similarly, textual documents (parliamentary speeches, press releases, etc.) are commonly released just in PDF format. PDF (Portable Document Format) documents are just containers for a series of different types of objects (text, images, fonts, and metadata), stored in such a way that it can be displayed in exactly the same way across different operating systems. Precisely because of its versatility, it is hard to come up with a single method to extract data contained in a PDF file. But there are two general cases, which we will cover today: table data (e.g. election results) and textual data (e.g. speeches). Note that all the cases below assume that the actual text or data is embedded as such in the document, and not just as images (e.g. scanned text). If you cannot select text or data in your document, and copy and paste somewhere else, the examples here unfortunately won't be useful. For those cases, other approaches based on OCR (Optical Character Recognization) would be more appropriate, but go beyond the scope of this course. #### Extracting tables from PDF files First, we'll learn how to use the [tabulizer package](https://github.com/ropensci/tabulizer), created by Thomas Leeper and currently maintained by Tom Paskhalis. It connects R to the [Tabula java library](https://github.com/tabulapdf/tabula), which can be used to extract tables from PDF documents. Note that `tabulizer` depends on rJava, which can be somewhat complicated to install on a Windows computer. See [here](https://github.com/ropensci/tabulizer#installation) for instructions on how to install it in your own laptop. The first example will be relatively easy -- the file `docs/2016results.pdf` contains the certified election results for the 2016 presidential election. The goal here is to extract the table on the first page. We will use the `extract_tables` function. ```{r} library(tabulizer) d <- extract_tables("~/data/2016results.pdf", pages=1) ``` Note that `tabula` is sufficiently smart to extract only the table and discard the rest. Just like `html_table` in `rvest`, `extract_tables` will return a list of data frames, so we'll have to select just the first element. ```{r} results <- d[[1]] ``` As usual, we will need to clean the data -- removing the first and last row, assigning variable names, removing characters from numeric elements... ```{r} results <- results[-c(1, nrow(results)),] results <- data.frame(results, stringsAsFactors=F) names(results) <- c("state", "total", "trump", "clinton") results$trump <- gsub(" .*$", "", results$trump) results$clinton <- gsub(" .*$", "", results$clinton) ``` Let's now check what states did each candidate win: ```{r} results$state[results$clinton > results$trump] results$state[results$clinton < results$trump] ``` Let's now work on a more complex example. What happens when there are multiple tables in the same page? `tabulizer` has an interactive tool that will help you identify the specific parts of a page that contain the table, the `extract_areas` function. It will display a viewer window where you can see the entire page, and then you can select the part of the page that contains the table. ```{r, eval=FALSE} d <- extract_areas("~/data/multiple-tables.pdf") ``` In this case it may not be as useful, because the regular `extract_tables` would have also worked, but for very large pages, or when you want to modify the default selected area, this can come in handy: ```{r} d <- extract_tables("~/data/multiple-tables.pdf") tab <- d[[3]][-(1:2),] performance <- as.numeric(substr(tab[,3], 1, 6)) # and now produce a bar plot par(mai=c(1,2,1,1)) barplot( performance, names = tab[,1], horiz=TRUE, las=1) ``` #### Extracting text from PDF files The other common scenario consists on extracting text that is embedded in PDF files. As noted above, how easy it is to convert the PDF file into machine-readable text will depend on whether the text is internally stored as such, and not as an image. There are different methods to extract the text. Here I'll show the one that in my opinion is better -- `pdftotext`, an open-source tool that is part of the [Xpdf project](http://www.foolabs.com/xpdf/about.html). If you want to install it in your laptop, you can [download it here](http://www.foolabs.com/xpdf/download.html), but it is already installed in your RStudio Server. We will run `pdftotext` not in the R console, but in your operating system's console. For example, if you're using a Mac, you would open the terminal and type the code. Since the way to do this varies across systems, we will instead run it from within R using the `system` function. ```{r} system("pdftotext") ``` As you can see, we can ran pdftotext with different configurations. Which one is best will depend on your application. As a first example, let's look at a press release from the European Court of Human Rights about the outcome a case. ```{r} system("pdftotext ~/data/press-release.pdf") ``` The output file (in plain text) will have the same name unless we change it. ```{r} system("pdftotext ~/data/press-release.pdf ~/data/press-release-output.txt") ``` If you look at the text of the file, you can see some of what we discussed earlier - any text that is internally represented as an image cannot be parsed. We can also choose the specific pages to parse: ```{r} system("pdftotext -f 1 -l 2 ~/data/press-release.pdf") ``` Let's now work on a more advanced example. The document `docs/arrests.pdf` contains a [list of Argentinian citizens arrested during the Pinochet regime](http://ftp2.errepar.com/bo/2013/04/SUPLE_17-04-2013.pdf). We'll try to parse the list here into a data frame format. ```{r} system("pdftotext ~/data/arrests.pdf") ``` Note that by default `pdftotext` will try to ignore the column layout, but if we wanted we would keep it: ```{r} system("pdftotext -layout ~/data/arrests.pdf ~/data/arrests-layout.txt") ``` We can now use regular expressions to identify the blocks of text with the names of the arrests (because they are always in the first article of each decree), as well as the dates (because they all have "Bs. As." right before): ```{r} txt <- readLines("~/data/arrests.txt") txt[200:250] # names of those arrested ar.init <- grep("Arréstese.*", txt) ar.ends <- grep("Art. 2.*", txt) txt[ ar.init[1] : ar.ends[1] ] # dates of arrests dates <- grep("^Bs\\. As\\.", txt) txt[ dates[1] ] ``` Let's try to scrape the data for the first set of arrests and later on we'll generalize: ```{r} init <- ar.init[1] end <- ar.ends[ar.ends > init][1] # the first end line after the line we just chose # this is what we will try to extract: (data <- txt[init:end]) # now let's convert everything into a single string data <- paste(data, collapse=" ") # note that everything before "Ejecutivo Nacional" and starting with "Art. 2" is useless data <- gsub(".*Nacional a:? (.*)\\. Art.*", data, repl="\\1") # and let's split it back into substrings divided by ";" data <- strsplit(data, ";")[[1]] # we're almost there! Note that the name is everything *before* the parenthesis names <- gsub(" ?(.*) \\(.*", data, repl="\\1") # and the DNI (ID number) is everything *after* dni <- gsub(".*\\((.*)\\)", data, repl="\\1") # now, let's go back to the date date <- txt[tail(dates[dates < init], n=1)] # first date after init date <- gsub("Bs\\. As\\., ", "", date) # remove Bs. As. # put everything into a data frame df <- data.frame(date, names, dni, stringsAsFactors=F) ``` That seemed to work! Let's now replicate it for the entire dataset, inside a loop: ```{r} arrests <- c() for (init in ar.init){ # extracting text end <- ar.ends[ar.ends > init][1] data <- txt[init:end] # cleaning text data <- paste(data, collapse=" ") data <- gsub(".*Nacional a:? (.*)\\. Art.*", data, repl="\\1") data <- strsplit(data, ";")[[1]] # extracting names and DNI names <- gsub("(.*) \\(.*", data, repl="\\1") dni <- gsub(".*\\((.*)\\)", data, repl="\\1") # extracting dates date <- txt[tail(dates[dates < init], n=1)] # first date after init date <- gsub("Bs\\. As\\., ", "", date) # remove Bs. As. # everything into a data frame df <- data.frame(date, names, dni, stringsAsFactors=F) arrests <- rbind(arrests, df) } ``` If you look at the data frame, you'll see it's not completely perfect, but the rest we could edit them by hand, or tweak the function until it works.