library(quanteda)
## Package version: 3.2.3
## Unicode version: 14.0
## ICU version: 70.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

Creating your own dictionary

Dictionaries are named lists, consisting of a “key” and a set of entries defining the equivalence class for the given key. To create a simple dictionary of parts of speech, for instance, we could define a dictionary consisting of articles and conjunctions, using the dictionary() constructor

posDict <- dictionary(list(articles = c("the", "a", "an"),
                           conjunctions = c("and", "but", "or", "nor", "for", "yet", "so")))

Let’s create a DFM with the data_corpus_inaugural corpus (which comes with quanteda) and apply the dictionary.

posDfm <- dfm_lookup(
  dfm(tokens(data_corpus_inaugural))
    , dictionary = posDict)
head(posDfm)
## Document-feature matrix of: 6 documents, 2 features (0.00% sparse) and 4 docvars.
##                  features
## docs              articles conjunctions
##   1789-Washington      140           73
##   1793-Washington       14            4
##   1797-Adams           232          192
##   1801-Jefferson       154          109
##   1805-Jefferson       168          126
##   1809-Madison         128           63

If we plot the values of articles and conjunction over the time (across the speeches) we see that there is a lot of variation. The reason for that is that the raw number of articles and conjunctions will be a function of document length.

plot(x = docvars(data_corpus_inaugural, "Year"), 
     y = posDfm[, "articles"],
     type = "p", pch = 16, col = "orange",
     ylim = range(posDfm),
     xlab = "Year", ylab = "Term frequency")
points(x = docvars(data_corpus_inaugural, "Year"), 
     y = posDfm[, "conjunctions"],
     pch = 3, col = "blue", new = FALSE)

If we replicate the graph, but this time using weights in the posDfm that take into account the differences in the length of each speech, we still find a similar pattern: usage of articles vs conjunctions is not stable over time.

During the 19th century the usage is more or less stable but then the relative importance of conjunctions compared to articles increases more and more. Since the 1990s both types of function words seem to be used about equally. This probably has to do with the fact that recent speeches have been shown on TV, with more frequent pauses for applause, which means presidents probably prefer to speak in shorter, simpler sentences.

posDfmWeight <- dfm_weight(posDfm, scheme = "prop")
head(posDfmWeight)
## Document-feature matrix of: 6 documents, 2 features (0.00% sparse) and 4 docvars.
##                  features
## docs               articles conjunctions
##   1789-Washington 0.6572770    0.3427230
##   1793-Washington 0.7777778    0.2222222
##   1797-Adams      0.5471698    0.4528302
##   1801-Jefferson  0.5855513    0.4144487
##   1805-Jefferson  0.5714286    0.4285714
##   1809-Madison    0.6701571    0.3298429
# base R plot
plot(x = docvars(data_corpus_inaugural, "Year"), 
     y = posDfmWeight[, "articles"],
     type = "p", pch = 16, col = "orange",
     ylim = range(posDfmWeight),
     xlab = "Year", ylab = "Relative term frequency")
points(x = docvars(data_corpus_inaugural, "Year"), 
     y = posDfmWeight[, "conjunctions"],
     pch = 3, col = "blue", new = FALSE)

# Plot with easier to see trends
library(ggplot2)
library(reshape2)
pdw <- convert(posDfmWeight, to="data.frame")
pdw$year <- as.numeric(substr(docnames(posDfmWeight), 1, 4))
pdw <- melt(pdw, id.vars = c("year", "doc_id"))
ggplot(pdw, aes(x = year, y = value, colour = variable)) +
  geom_point() +
  geom_smooth(method = "loess") +
  labs(x = "Year", y = "Relative term frequency")
## `geom_smooth()` using formula 'y ~ x'

Hierarchical dictionaries.

Dictionaries may also be hierarchical, where a top-level key can consist of subordinate keys, each a list of its own. For instance, list(articles = list(definite="the", indefinite=c("a", "an")) defines a valid list for articles.

Let’s explore this idea by creating a dictionary of articles and conjunctions with two levels, one for definite and indefinite articles, and one for coordinating and subordinating conjunctions.

posDictHier <- list(
  article = list(definite = "the", indefinite = c("a", "an")),
  conjunction = list(
    coordinating = c("and", "but", "or", "nor", "for", "yet", "so"),
    subordinating = c("although", "because", "since", "unless")
  )
)

Now let’s apply this to the data_corpus_inaugural object, and examine the resulting features.

posDfmHier <- dfm_lookup(
  dfm(tokens(data_corpus_inaugural)), 
      dictionary = dictionary(posDictHier))
head(posDfmHier)
## Document-feature matrix of: 6 documents, 4 features (8.33% sparse) and 4 docvars.
##                  features
## docs              article.definite article.indefinite conjunction.coordinating
##   1789-Washington              116                 24                       73
##   1793-Washington               13                  1                        4
##   1797-Adams                   163                 69                      192
##   1801-Jefferson               130                 24                      109
##   1805-Jefferson               143                 25                      126
##   1809-Madison                 104                 24                       63
##                  features
## docs              conjunction.subordinating
##   1789-Washington                         4
##   1793-Washington                         0
##   1797-Adams                              1
##   1801-Jefferson                          0
##   1805-Jefferson                          3
##   1809-Madison                            2

What happened to the hierarchies, to make them into “features”? The different levels are joined by a dot (“.”).

Do the subcategories sum to the two general categories? Let’s double check…

posDfmHierAlt <- dfm_lookup(posDfmHier, dictionary = dictionary(list(
  article = c("article.definite", "article.indefinite"), 
  conjunction = c("conjunction.coordinating", "conjunction.subordinating")
)))
head(posDfmHierAlt)
## Document-feature matrix of: 6 documents, 2 features (0.00% sparse) and 4 docvars.
##                  features
## docs              article conjunction
##   1789-Washington     140          77
##   1793-Washington      14           4
##   1797-Adams          232         193
##   1801-Jefferson      154         109
##   1805-Jefferson      168         129
##   1809-Madison        128          65
head(posDfm)
## Document-feature matrix of: 6 documents, 2 features (0.00% sparse) and 4 docvars.
##                  features
## docs              articles conjunctions
##   1789-Washington      140           73
##   1793-Washington       14            4
##   1797-Adams           232          192
##   1801-Jefferson       154          109
##   1805-Jefferson       168          126
##   1809-Madison         128           63

Note that for the article category they do but not for the conjunctions. This is to be expected as in the first version (posDfm) only coordinating conjunctions were included whereas the hierarchical categories also contains subordinating conjunctions.