library(quanteda)
## Package version: 3.2.3
## Unicode version: 14.0
## ICU version: 70.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
Dictionaries are named lists, consisting of a “key” and a set of
entries defining the equivalence class for the given key. To create a
simple dictionary of parts of speech, for instance, we could define a
dictionary consisting of articles and conjunctions, using the
dictionary()
constructor
posDict <- dictionary(list(articles = c("the", "a", "an"),
conjunctions = c("and", "but", "or", "nor", "for", "yet", "so")))
Let’s create a DFM with the data_corpus_inaugural
corpus
(which comes with quanteda) and apply the dictionary.
posDfm <- dfm_lookup(
dfm(tokens(data_corpus_inaugural))
, dictionary = posDict)
head(posDfm)
## Document-feature matrix of: 6 documents, 2 features (0.00% sparse) and 4 docvars.
## features
## docs articles conjunctions
## 1789-Washington 140 73
## 1793-Washington 14 4
## 1797-Adams 232 192
## 1801-Jefferson 154 109
## 1805-Jefferson 168 126
## 1809-Madison 128 63
If we plot the values of articles and conjunction over the time (across the speeches) we see that there is a lot of variation. The reason for that is that the raw number of articles and conjunctions will be a function of document length.
plot(x = docvars(data_corpus_inaugural, "Year"),
y = posDfm[, "articles"],
type = "p", pch = 16, col = "orange",
ylim = range(posDfm),
xlab = "Year", ylab = "Term frequency")
points(x = docvars(data_corpus_inaugural, "Year"),
y = posDfm[, "conjunctions"],
pch = 3, col = "blue", new = FALSE)
If we replicate the graph, but this time using weights in the
posDfm
that take into account the differences in the length
of each speech, we still find a similar pattern: usage of articles vs
conjunctions is not stable over time.
During the 19th century the usage is more or less stable but then the relative importance of conjunctions compared to articles increases more and more. Since the 1990s both types of function words seem to be used about equally. This probably has to do with the fact that recent speeches have been shown on TV, with more frequent pauses for applause, which means presidents probably prefer to speak in shorter, simpler sentences.
posDfmWeight <- dfm_weight(posDfm, scheme = "prop")
head(posDfmWeight)
## Document-feature matrix of: 6 documents, 2 features (0.00% sparse) and 4 docvars.
## features
## docs articles conjunctions
## 1789-Washington 0.6572770 0.3427230
## 1793-Washington 0.7777778 0.2222222
## 1797-Adams 0.5471698 0.4528302
## 1801-Jefferson 0.5855513 0.4144487
## 1805-Jefferson 0.5714286 0.4285714
## 1809-Madison 0.6701571 0.3298429
# base R plot
plot(x = docvars(data_corpus_inaugural, "Year"),
y = posDfmWeight[, "articles"],
type = "p", pch = 16, col = "orange",
ylim = range(posDfmWeight),
xlab = "Year", ylab = "Relative term frequency")
points(x = docvars(data_corpus_inaugural, "Year"),
y = posDfmWeight[, "conjunctions"],
pch = 3, col = "blue", new = FALSE)
# Plot with easier to see trends
library(ggplot2)
library(reshape2)
pdw <- convert(posDfmWeight, to="data.frame")
pdw$year <- as.numeric(substr(docnames(posDfmWeight), 1, 4))
pdw <- melt(pdw, id.vars = c("year", "doc_id"))
ggplot(pdw, aes(x = year, y = value, colour = variable)) +
geom_point() +
geom_smooth(method = "loess") +
labs(x = "Year", y = "Relative term frequency")
## `geom_smooth()` using formula 'y ~ x'
Dictionaries may also be hierarchical, where a top-level key can
consist of subordinate keys, each a list of its own. For instance,
list(articles = list(definite="the", indefinite=c("a", "an"))
defines a valid list for articles.
Let’s explore this idea by creating a dictionary of articles and conjunctions with two levels, one for definite and indefinite articles, and one for coordinating and subordinating conjunctions.
posDictHier <- list(
article = list(definite = "the", indefinite = c("a", "an")),
conjunction = list(
coordinating = c("and", "but", "or", "nor", "for", "yet", "so"),
subordinating = c("although", "because", "since", "unless")
)
)
Now let’s apply this to the data_corpus_inaugural
object, and examine the resulting features.
posDfmHier <- dfm_lookup(
dfm(tokens(data_corpus_inaugural)),
dictionary = dictionary(posDictHier))
head(posDfmHier)
## Document-feature matrix of: 6 documents, 4 features (8.33% sparse) and 4 docvars.
## features
## docs article.definite article.indefinite conjunction.coordinating
## 1789-Washington 116 24 73
## 1793-Washington 13 1 4
## 1797-Adams 163 69 192
## 1801-Jefferson 130 24 109
## 1805-Jefferson 143 25 126
## 1809-Madison 104 24 63
## features
## docs conjunction.subordinating
## 1789-Washington 4
## 1793-Washington 0
## 1797-Adams 1
## 1801-Jefferson 0
## 1805-Jefferson 3
## 1809-Madison 2
What happened to the hierarchies, to make them into “features”? The different levels are joined by a dot (“.”).
Do the subcategories sum to the two general categories? Let’s double check…
posDfmHierAlt <- dfm_lookup(posDfmHier, dictionary = dictionary(list(
article = c("article.definite", "article.indefinite"),
conjunction = c("conjunction.coordinating", "conjunction.subordinating")
)))
head(posDfmHierAlt)
## Document-feature matrix of: 6 documents, 2 features (0.00% sparse) and 4 docvars.
## features
## docs article conjunction
## 1789-Washington 140 77
## 1793-Washington 14 4
## 1797-Adams 232 193
## 1801-Jefferson 154 109
## 1805-Jefferson 168 129
## 1809-Madison 128 65
head(posDfm)
## Document-feature matrix of: 6 documents, 2 features (0.00% sparse) and 4 docvars.
## features
## docs articles conjunctions
## 1789-Washington 140 73
## 1793-Washington 14 4
## 1797-Adams 232 192
## 1801-Jefferson 154 109
## 1805-Jefferson 168 126
## 1809-Madison 128 63
Note that for the article category they do but not for the
conjunctions. This is to be expected as in the first version
(posDfm
) only coordinating conjunctions were included
whereas the hierarchical categories also contains subordinating
conjunctions.