PDF (Portable Document Format) documents are just containers for a series of different types of objects (text, images, fonts, and metadata), stored in such a way that it can be displayed in exactly the same way across different operating systems.

Precisely because of its versatility, it is hard to come up with a single method to extract data contained in a PDF file. Today will cover perhaps the most general case: textual data (e.g. speeches).

Note that all the cases below assume that the actual text is embedded as such in the document, and not just as images (e.g. scanned text). If you cannot select text or data in your document, and copy and paste somewhere else, the examples here unfortunately won’t be useful. For those cases, other approaches based on OCR (Optical Character Recognization) would be more appropriate, but go beyond the scope of this course.

Extracting text from PDF files

As noted above, how easy it is to convert the PDF file into machine-readable text will depend on whether the text is internally stored as such, and not as an image.

There are different methods to extract the text. Here I’ll show the one that in my opinion is better – pdftotext, an open-source tool that is part of the Xpdf project. If you want to install it in your laptop, you can download it here. You’ll then need to place the pdftotext binary in your home folder.

We will run pdftotext not in the R console, but in your operating system’s console. For example, if you’re using a Mac, you would open the terminal and type the code. Since the way to do this varies across systems, we will instead run it from within R using the system function.

system("~/pdftotext")

As you can see, we can ran pdftotext with different configurations. Which one is best will depend on your application. As a first example, let’s look at a press release from the European Court of Human Rights about the outcome a case.

system("~/pdftotext ../data/press-release.pdf")

The output file (in plain text) will have the same name unless we change it.

system("~/pdftotext ../data/press-release.pdf ../data/press-release-output.txt")

If you look at the text of the file, you can see some of what we discussed earlier - any text that is internally represented as an image cannot be parsed.

We can also choose the specific pages to parse:

system("~/pdftotext -f 1 -l 2 ../data/press-release.pdf")

Let’s now work on a more advanced example. The document docs/arrests.pdf contains a list of Argentinian citizens arrested during the military dictatorship in 1976. We’ll try to parse the list here into a data frame format.

system("~/pdftotext -enc 'UTF-8' ../data/arrests.pdf")

Note that by default pdftotext will try to ignore the column layout, but if we wanted we would keep it:

system("~/pdftotext -enc 'UTF-8' -layout ../data/arrests.pdf ../data/arrests-layout.txt")

We can now use regular expressions to identify the blocks of text with the names of the arrests (because they are always in the first article of each decree), as well as the dates (because they all have “Bs. As.” right before):

txt <- readLines("../data/arrests.txt")
## Warning in readLines("../data/arrests.txt"): incomplete final line found on '../
## data/arrests.txt'
# txt[200:250]
# names of those arrested
ar.init <- grep("Arréstese.*", txt)
ar.ends <- grep("Art. 2.*", txt)
txt[ ar.init[1] : ar.ends[1] ]
## [1] "Artículo 1° — Arréstese a disposición del Poder Ejecutivo Nacional a: Hermes Carlos ACCATOLI (MI 7.358.480); Orlando Jesús SCARTZZINI (MI 6.715.577); Carlos Alberto BARRERA (MI 4.145.476); Hugo Ignacio MOLINA OLIVA (MI 7.970.980); Blanca Nélida HOYOS (DNI 10.941.195); Aurelio Santos CHAPARRO (MI 7.058.160); Miguel Angel PEREZ VALDEZ (MI 6.831.888); Santos Luis ORTEGA SARA (MI 12.484.795); Eduardo Federico LAVINI ROSSI (MI 8.441.378); Daniel Ernesto HERRERA (DNI 10.012.765); Roberto Angel RUCCI (MI"
## [2] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## [3] "8.556.965); Miguel Angel ERREGUERENA (DNI 10.370.835); Patricia Yolanda MOLINARI (DNI 11.991.647); Guillermo Eduardo CANGARO (DNI 13.267.111); Ricardo Alfredo VALENTE (DNI 13.233.200)."                                                                                                                                                                                                                                                                                                                              
## [4] "Art. 2° — Las personas mencionadas en el Artículo 1° deberán permanecer en el lugar de detención que al efecto se determine."
# dates of arrests
dates <- grep("^Bs\\. As\\.", txt)
txt[ dates[1] ]
## [1] "Bs. As., 18/8/1976"

Let’s try to scrape the data for the first set of arrests and later on we’ll generalize:

init <- ar.init[1]
end <- ar.ends[ar.ends > init][1] # the first end line after the line we just chose
# this is what we will try to extract:
(data <- txt[init:end])
## [1] "Artículo 1° — Arréstese a disposición del Poder Ejecutivo Nacional a: Hermes Carlos ACCATOLI (MI 7.358.480); Orlando Jesús SCARTZZINI (MI 6.715.577); Carlos Alberto BARRERA (MI 4.145.476); Hugo Ignacio MOLINA OLIVA (MI 7.970.980); Blanca Nélida HOYOS (DNI 10.941.195); Aurelio Santos CHAPARRO (MI 7.058.160); Miguel Angel PEREZ VALDEZ (MI 6.831.888); Santos Luis ORTEGA SARA (MI 12.484.795); Eduardo Federico LAVINI ROSSI (MI 8.441.378); Daniel Ernesto HERRERA (DNI 10.012.765); Roberto Angel RUCCI (MI"
## [2] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## [3] "8.556.965); Miguel Angel ERREGUERENA (DNI 10.370.835); Patricia Yolanda MOLINARI (DNI 11.991.647); Guillermo Eduardo CANGARO (DNI 13.267.111); Ricardo Alfredo VALENTE (DNI 13.233.200)."                                                                                                                                                                                                                                                                                                                              
## [4] "Art. 2° — Las personas mencionadas en el Artículo 1° deberán permanecer en el lugar de detención que al efecto se determine."
# now let's convert everything into a single string
data <- paste(data, collapse=" ")
# note that everything before "Ejecutivo Nacional" and starting with "Art. 2" is useless
data <- gsub(".*Nacional a:? (.*)\\. Art.*", data, repl="\\1")
# and let's split it back into substrings divided by ";"
data <- strsplit(data, ";")[[1]]
# we're almost there! Note that the name is everything *before* the parenthesis
names <- gsub(" ?(.*) \\(.*", data, repl="\\1")
# and the DNI (ID number) is everything *after*
dni <- gsub(".*\\((.*)\\)", data, repl="\\1")

# now, let's go back to the date
date <- txt[tail(dates[dates < init], n=1)] # first date after init
date <- gsub("Bs\\. As\\., ", "", date) # remove Bs. As.

# put everything into a data frame
df <- data.frame(date, names, dni, stringsAsFactors=F)

That seemed to work! Let’s now replicate it for the entire dataset, inside a loop:

arrests <- c()

for (init in ar.init){
    
    # extracting text
    end <- ar.ends[ar.ends > init][1]
    data <- txt[init:end]
    # cleaning text
    data <- paste(data, collapse=" ")
    data <- gsub(".*Nacional a:? (.*)\\. Art.*", data, repl="\\1")
    data <- strsplit(data, ";")[[1]]
    # extracting names and DNI
    names <- gsub("(.*) \\(.*", data, repl="\\1")
    dni <- gsub(".*\\((.*)\\)", data, repl="\\1")
    # extracting dates
    date <- txt[tail(dates[dates < init], n=1)] # first date after init
    date <- gsub("Bs\\. As\\., ", "", date) # remove Bs. As.
    # everything into a data frame
    df <- data.frame(date, names, dni, stringsAsFactors=F)
    arrests <- rbind(arrests, df)
}

If you look at the data frame, you’ll see it’s not completely perfect, but the rest we could edit them by hand, or tweak the function until it works.