For reasons that often escape my understanding, many governmental agencies do not release data in a machine-readable format; instead, they just upload a series of PDF files to their website. Similarly, textual documents (parliamentary speeches, press releases, etc.) are commonly released just in PDF format.
PDF (Portable Document Format) documents are just containers for a series of different types of objects (text, images, fonts, and metadata), stored in such a way that it can be displayed in exactly the same way across different operating systems.
Precisely because of its versatility, it is hard to come up with a single method to extract data contained in a PDF file. But there are two general cases, which we will cover today: table data (e.g. election results) and textual data (e.g. speeches).
Note that all the cases below assume that the actual text or data is embedded as such in the document, and not just as images (e.g. scanned text). If you cannot select text or data in your document, and copy and paste somewhere else, the examples here unfortunately won’t be useful. For those cases, other approaches based on OCR (Optical Character Recognization) would be more appropriate, but go beyond the scope of this course.
First, we’ll learn how to use the tabulizer package, created by Thomas Leeper and currently maintained by Tom Paskhalis. It connects R to the Tabula java library, which can be used to extract tables from PDF documents.
Note that tabulizer
depends on rJava, which can be somewhat complicated to install on a Windows computer. See here for instructions on how to install it in your own laptop.
The first example will be relatively easy – the file docs/2016results.pdf
contains the certified election results for the 2016 presidential election. The goal here is to extract the table on the first page. We will use the extract_tables
function.
library(tabulizer)
d <- extract_tables("~/data/2016results.pdf", pages=1)
Note that tabula
is sufficiently smart to extract only the table and discard the rest.
Just like html_table
in rvest
, extract_tables
will return a list of data frames, so we’ll have to select just the first element.
results <- d[[1]]
As usual, we will need to clean the data – removing the first and last row, assigning variable names, removing characters from numeric elements…
results <- results[-c(1, nrow(results)),]
results <- data.frame(results, stringsAsFactors=F)
names(results) <- c("state", "total", "trump", "clinton")
results$trump <- gsub(" .*$", "", results$trump)
results$clinton <- gsub(" .*$", "", results$clinton)
Let’s now check what states did each candidate win:
results$state[results$clinton > results$trump]
## [1] "CA" "CO" "CT" "DE" "DC" "HI" "IL" "ME" "MD" "MA" "MN" "NV" "NH" "NJ"
## [15] "NM" "NY" "OR" "RI" "VT" "VA" "WA"
results$state[results$clinton < results$trump]
## [1] "AL" "AK" "AZ" "AR" "FL" "GA" "ID" "IN" "IA" "KS" "KY" "LA" "MI" "MS"
## [15] "MO" "MT" "NE" "NC" "ND" "OH" "OK" "PA" "SC" "SD" "TN" "TX" "UT" "WV"
## [29] "WI" "WY"
Let’s now work on a more complex example. What happens when there are multiple tables in the same page? tabulizer
has an interactive tool that will help you identify the specific parts of a page that contain the table, the extract_areas
function. It will display a viewer window where you can see the entire page, and then you can select the part of the page that contains the table.
d <- extract_areas("~/data/multiple-tables.pdf")
In this case it may not be as useful, because the regular extract_tables
would have also worked, but for very large pages, or when you want to modify the default selected area, this can come in handy:
d <- extract_tables("~/data/multiple-tables.pdf")
tab <- d[[3]][-(1:2),]
performance <- as.numeric(substr(tab[,3], 1, 6))
# and now produce a bar plot
par(mai=c(1,2,1,1))
barplot( performance, names = tab[,1], horiz=TRUE, las=1)
The other common scenario consists on extracting text that is embedded in PDF files. As noted above, how easy it is to convert the PDF file into machine-readable text will depend on whether the text is internally stored as such, and not as an image.
There are different methods to extract the text. Here I’ll show the one that in my opinion is better – pdftotext
, an open-source tool that is part of the Xpdf project. If you want to install it in your laptop, you can download it here, but it is already installed in your RStudio Server.
We will run pdftotext
not in the R console, but in your operating system’s console. For example, if you’re using a Mac, you would open the terminal and type the code. Since the way to do this varies across systems, we will instead run it from within R using the system
function.
system("pdftotext")
As you can see, we can ran pdftotext with different configurations. Which one is best will depend on your application. As a first example, let’s look at a press release from the European Court of Human Rights about the outcome a case.
system("pdftotext ~/data/press-release.pdf")
The output file (in plain text) will have the same name unless we change it.
system("pdftotext ~/data/press-release.pdf ~/data/press-release-output.txt")
If you look at the text of the file, you can see some of what we discussed earlier - any text that is internally represented as an image cannot be parsed.
We can also choose the specific pages to parse:
system("pdftotext -f 1 -l 2 ~/data/press-release.pdf")
Let’s now work on a more advanced example. The document docs/arrests.pdf
contains a list of Argentinian citizens arrested during the Pinochet regime. We’ll try to parse the list here into a data frame format.
system("pdftotext ~/data/arrests.pdf")
Note that by default pdftotext
will try to ignore the column layout, but if we wanted we would keep it:
system("pdftotext -layout ~/data/arrests.pdf ~/data/arrests-layout.txt")
We can now use regular expressions to identify the blocks of text with the names of the arrests (because they are always in the first article of each decree), as well as the dates (because they all have “Bs. As.” right before):
txt <- readLines("~/data/arrests.txt")
## Warning in readLines("~/data/arrests.txt"): incomplete final line found on
## '~/data/arrests.txt'
txt[200:250]
## [1] ""
## [2] "Artículo 1° — Déjase sin efecto el arresto a"
## [3] "disposición del Poder Ejecutivo Nacional para"
## [4] "con las personas de: José Luis GHIGO (Cl"
## [5] "7.490.739); Jorge Luis VIAGGIO (CI 969.148);"
## [6] "Jorge Martín VIAGGIO (MI 11.313.823)."
## [7] ""
## [8] "Bs. As., 2/6/1976"
## [9] ""
## [10] "Art. 2° — Comuníquese, cúmplase y ARCHIVESE. — VIDELA."
## [11] "#F4429688F#"
## [12] "#I4429691I#"
## [13] "Decreto S 660/1976"
## [14] ""
## [15] "Que constituye una primordial responsabilidad de Gobierno consolidar la paz interior,"
## [16] "asegurar la tranquilidad y el orden públicos"
## [17] "y preservar los permanentes intereses de la"
## [18] "República,"
## [19] ""
## [20] "VISTO los Decretos 1368 del 6 de noviembre de"
## [21] "1974 y 2717 del 1° de octubre de 1975, y"
## [22] ""
## [23] "EL PRESIDENTE"
## [24] "DE LA NACION ARGENTINA"
## [25] "DECRETA:"
## [26] "Artículo 1° — Arréstese a disposición del Poder Ejecutivo Nacional a: Carlos Alberto CAMPAGNOLE (CI 3.052.291); Anibal Pablo CARIBONI (CI 2.842.010); Eladio CASTRO (CI 1.084.218);"
## [27] "Jorge Elchanan KICK (CI 2.764.855); Jesús COTTU (MI 5.990.093); Alfredo Ernesto ROSSI (MI"
## [28] "8.238.382); José Nicanor CASAS (MI 8.563.258);"
## [29] "Carlos Alberto ALAGA (MI 7.639.595); Francisco CAMACHO LOPEZ (DNI 11.516.388); Daniel"
## [30] "ILLANES (MI 8.666.915); Oscar Jorge COMAS"
## [31] "(MI 8.150.103); Víctor Eduardo CARVAJAL (MI"
## [32] "7.808.205); Tristán BALAGUER ZAPATA (MI"
## [33] "3.153.865); Alfredo Rafael AVILA (MI 7.639.727);"
## [34] "Sohar Ramón COSTA (MI 6.764.667); Adolfo"
## [35] "Edgardo SILVEIRA (DNI 10.800.925); Héctor"
## [36] "Gustavo LOPEZ (DNI 11.836.730); Juan Luis"
## [37] "FEFA (MI 8.619.248); César Ambrosio GIOJA (M.I. 6.941.527); Jorge Alfredo FRIAS (MI"
## [38] "11.388.797); Ramón FABREGA (MI 6.766.994);"
## [39] "Antonio D’AMICO LICATA (DNI 10.029.790); José"
## [40] "Luis GIOJA (MI 7.807.820); Miguel Angel NEIRA (MI 5.404.917); Juan Carlos SALGADO (MI"
## [41] "7.646.641); Raúl Héctor CANO (MI 5.543.644);"
## [42] "Damián MARTHOFLEARCH (CI 123.071); Ricardo Sergio Ramón VIERA (MI 4.098.387); Margarita Juana HOBSON (LC 1.829.060); Hugo Anibal"
## [43] "BALCAZA (MI 5.083.543); Miguel Angel RAGONESE (C.I. nº 9.512.615); Horacio Oscar SARAGOVI (CI 6.115.554)."
## [44] "Art. 2° — Las personas mencionadas en el"
## [45] "Artículo 1° deberán permanecer en el lugar de"
## [46] "detención que al efecto se determine."
## [47] "Art. 3° — Comuníquese, cúmplase y ARCHIVESE. — VIDELA."
## [48] "#F4429685F#"
## [49] ""
## [50] "#I4429688I#"
## [51] "Decreto S 659/1976"
# names of those arrested
ar.init <- grep("Arréstese.*", txt)
ar.ends <- grep("Art. 2.*", txt)
txt[ ar.init[1] : ar.ends[1] ]
## [1] "Artículo 1° — Arréstese a disposición del"
## [2] "Poder Ejecutivo Nacional a: Hermes Carlos ACCATOLI (MI 7.358.480); Orlando Jesús"
## [3] "SCARTZZINI (MI 6.715.577); Carlos Alberto BARRERA (MI 4.145.476); Hugo Ignacio MOLINA"
## [4] "OLIVA (MI 7.970.980); Blanca Nélida HOYOS"
## [5] "(DNI 10.941.195); Aurelio Santos CHAPARRO"
## [6] "(MI 7.058.160); Miguel Angel PEREZ VALDEZ"
## [7] "(MI 6.831.888); Santos Luis ORTEGA SARA (MI"
## [8] "12.484.795); Eduardo Federico LAVINI ROSSI (MI 8.441.378); Daniel Ernesto HERRERA"
## [9] "(DNI 10.012.765); Roberto Angel RUCCI (MI"
## [10] ""
## [11] "8.556.965); Miguel Angel ERREGUERENA (DNI"
## [12] "10.370.835); Patricia Yolanda MOLINARI (DNI"
## [13] "11.991.647); Guillermo Eduardo CANGARO (DNI"
## [14] "13.267.111); Ricardo Alfredo VALENTE (DNI"
## [15] "13.233.200)."
## [16] "Art. 2° — Las personas mencionadas en el"
# dates of arrests
dates <- grep("^Bs\\. As\\.", txt)
txt[ dates[1] ]
## [1] "Bs. As., 18/8/1976"
Let’s try to scrape the data for the first set of arrests and later on we’ll generalize:
init <- ar.init[1]
end <- ar.ends[ar.ends > init][1] # the first end line after the line we just chose
# this is what we will try to extract:
(data <- txt[init:end])
## [1] "Artículo 1° — Arréstese a disposición del"
## [2] "Poder Ejecutivo Nacional a: Hermes Carlos ACCATOLI (MI 7.358.480); Orlando Jesús"
## [3] "SCARTZZINI (MI 6.715.577); Carlos Alberto BARRERA (MI 4.145.476); Hugo Ignacio MOLINA"
## [4] "OLIVA (MI 7.970.980); Blanca Nélida HOYOS"
## [5] "(DNI 10.941.195); Aurelio Santos CHAPARRO"
## [6] "(MI 7.058.160); Miguel Angel PEREZ VALDEZ"
## [7] "(MI 6.831.888); Santos Luis ORTEGA SARA (MI"
## [8] "12.484.795); Eduardo Federico LAVINI ROSSI (MI 8.441.378); Daniel Ernesto HERRERA"
## [9] "(DNI 10.012.765); Roberto Angel RUCCI (MI"
## [10] ""
## [11] "8.556.965); Miguel Angel ERREGUERENA (DNI"
## [12] "10.370.835); Patricia Yolanda MOLINARI (DNI"
## [13] "11.991.647); Guillermo Eduardo CANGARO (DNI"
## [14] "13.267.111); Ricardo Alfredo VALENTE (DNI"
## [15] "13.233.200)."
## [16] "Art. 2° — Las personas mencionadas en el"
# now let's convert everything into a single string
data <- paste(data, collapse=" ")
# note that everything before "Ejecutivo Nacional" and starting with "Art. 2" is useless
data <- gsub(".*Nacional a:? (.*)\\. Art.*", data, repl="\\1")
# and let's split it back into substrings divided by ";"
data <- strsplit(data, ";")[[1]]
# we're almost there! Note that the name is everything *before* the parenthesis
names <- gsub(" ?(.*) \\(.*", data, repl="\\1")
# and the DNI (ID number) is everything *after*
dni <- gsub(".*\\((.*)\\)", data, repl="\\1")
# now, let's go back to the date
date <- txt[tail(dates[dates < init], n=1)] # first date after init
date <- gsub("Bs\\. As\\., ", "", date) # remove Bs. As.
# put everything into a data frame
df <- data.frame(date, names, dni, stringsAsFactors=F)
That seemed to work! Let’s now replicate it for the entire dataset, inside a loop:
arrests <- c()
for (init in ar.init){
# extracting text
end <- ar.ends[ar.ends > init][1]
data <- txt[init:end]
# cleaning text
data <- paste(data, collapse=" ")
data <- gsub(".*Nacional a:? (.*)\\. Art.*", data, repl="\\1")
data <- strsplit(data, ";")[[1]]
# extracting names and DNI
names <- gsub("(.*) \\(.*", data, repl="\\1")
dni <- gsub(".*\\((.*)\\)", data, repl="\\1")
# extracting dates
date <- txt[tail(dates[dates < init], n=1)] # first date after init
date <- gsub("Bs\\. As\\., ", "", date) # remove Bs. As.
# everything into a data frame
df <- data.frame(date, names, dni, stringsAsFactors=F)
arrests <- rbind(arrests, df)
}
If you look at the data frame, you’ll see it’s not completely perfect, but the rest we could edit them by hand, or tweak the function until it works.