As we discussed earlier, before we can do any type of automated text
analysis, we will need to go through several “preprocessing” steps
before it can be passed to a statistical model. We’ll use the
quanteda
package quanteda here.
You can install the packages we will use in this script with the code below:
install.packages("quanteda")
install.packages("quanteda.textplots")
The basic unit of work for the quanteda
package is
called a corpus
, which represents a collection of text
documents with some associated metadata. Documents are the subunits of a
corpus. You can use summary
to get some information about
your corpus.
library(quanteda)
## Package version: 3.2.3
## Unicode version: 14.0
## ICU version: 70.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
tweets <- read.csv("../data/trump-tweets.csv",
stringsAsFactors = FALSE)
twcorpus <- corpus(tweets)
summary(twcorpus, n=10)
## Corpus consisting of 56571 documents, showing 10 documents:
##
## Text Types Tokens Sentences id isRetweet isDeleted
## text1 10 10 1 9.845497e+16 f f
## text2 42 50 3 1.234653e+18 f f
## text3 23 24 1 1.218011e+18 t f
## text4 49 61 3 1.304875e+18 f f
## text5 25 25 2 1.218160e+18 t f
## text6 21 24 2 1.217963e+18 t f
## text7 8 8 2 1.223641e+18 f f
## text8 1 1 1 1.319502e+18 f f
## text9 1 1 1 1.319501e+18 f f
## text10 1 1 1 1.319501e+18 f f
## device favorites retweets date isFlagged
## TweetDeck 49 255 2011-08-02 18:07:48 f
## Twitter for iPhone 73748 17404 2020-03-03 01:34:50 f
## Twitter for iPhone 0 7396 2020-01-17 03:22:47 f
## Twitter for iPhone 80527 23502 2020-09-12 20:10:58 f
## Twitter for iPhone 0 9081 2020-01-17 13:13:59 f
## Twitter for iPhone 0 25048 2020-01-17 00:11:56 f
## Twitter for iPhone 285863 30209 2020-02-01 16:14:02 f
## Twitter for iPhone 130822 19127 2020-10-23 04:52:14 f
## Twitter for iPhone 153446 20275 2020-10-23 04:46:53 f
## Twitter for iPhone 102150 14815 2020-10-23 04:46:49 f
Once we have a corpus, we can convert it to tokens using the
tokens
function.
toks <- tokens(twcorpus)
toks[[1]]
## [1] "Republicans" "and" "Democrats" "have" "both"
## [6] "created" "our" "economic" "problems" "."
tokens
has many useful options (check out
?tokens
for more information). Let’s actually use it to
remove punctuation, keep Twitter features…
toks <- tokens(twcorpus,
what="word",
remove_punct = TRUE,
remove_url=TRUE,
verbose=TRUE)
## Creating a tokens object from a corpus input...
## ...starting tokenization
## ...text1 to text10000
## ...preserving hyphens
## ...preserving social media tags (#, @)
## ...segmenting into words
## ...text10001 to text20000
## ...preserving hyphens
## ...preserving social media tags (#, @)
## ...segmenting into words
## ...text20001 to text30000
## ...preserving hyphens
## ...preserving social media tags (#, @)
## ...segmenting into words
## ...text30001 to text40000
## ...preserving hyphens
## ...preserving social media tags (#, @)
## ...segmenting into words
## ...text40001 to text50000
## ...preserving hyphens
## ...preserving social media tags (#, @)
## ...segmenting into words
## ...text50001 to text56571
## ...preserving hyphens
## ...preserving social media tags (#, @)
## ...segmenting into words
## ...76,293 unique types
## ...removing separators, punctuation, URLs
## ...complete, elapsed time: 3.79 seconds.
## Finished constructing tokens from 56,571 documents.
toks[[1]]
## [1] "Republicans" "and" "Democrats" "have" "both"
## [6] "created" "our" "economic" "problems"
By default, tokens
will keep just entire words, but for
example we can use tokens_ngrams
to create ngrams – all
combinations of one, two, three, etc words (e.g. it will consider both
“human”, “rights”, and “human rights” as tokens).
toks_ngrams <- tokens_ngrams(toks, n=1:2)
toks_ngrams[[1]]
## [1] "Republicans" "and" "Democrats"
## [4] "have" "both" "created"
## [7] "our" "economic" "problems"
## [10] "Republicans_and" "and_Democrats" "Democrats_have"
## [13] "have_both" "both_created" "created_our"
## [16] "our_economic" "economic_problems"
Another text pre-processing technique we can apply to the tokens
object is stemming. In quanteda, stemming relies on the
SnowballC
package’s implementation of the Porter
stemmer:
toks_stems <- tokens_wordstem(toks)
toks_stems[[1]]
## [1] "Republican" "and" "Democrat" "have" "both"
## [6] "creat" "our" "econom" "problem"
A very useful feature of tokens objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.
kwic(toks, "immigration", window=5)[1:5,]
## Keyword-in-context with 5 matches.
## [text1172, 16] U.S History My opponent's insane | immigration |
## [text2944, 16] Are the Strongest Opponents of | Immigration |
## [text3052, 32] make excuses for their failed | immigration |
## [text3063, 34] Executive Order to temporarily suspend | immigration |
## [text4461, 28] on the Border amp Illegal | Immigration |
##
## plan completely eliminates U.S borders
##
## policies I wonder what O
## into the United States
## He loves our Military amp
kwic(toks, "healthcare", window=5)[1:5,]
## Keyword-in-context with 5 matches.
## [text470, 31] Second Amendment and Deliver Great | Healthcare |
## [text847, 37] very well be bigger than | healthcare |
## [text1054, 18] about his plan to keep | healthcare |
## [text1057, 27] Nominee Well she didn't support | Healthcare |
## [text1778, 19] underlying conditions as well as | healthcare |
##
## Eric has my Complete and
## itself Congratulations America
## affordable protect
## or my opening up 5000
## workers and
kwic(toks, "clinton", window=5)[1:5,]
## Keyword-in-context with 5 matches.
## [text1542, 6] Obama worked harder for Hillary | Clinton |
## [text1542, 10] Hillary Clinton and the losing | Clinton |
## [text3245, 27] political campaign by Crooked Hillary | Clinton |
## [text3314, 9] Biden's town hall with Hillary | Clinton |
## [text3662, 13] but refusing to touch the | Clinton |
##
## and the losing Clinton Campaign
## Campaign than she worked for
## while Hillary was under FBI
## got off to a fantastic
## Foundation
Finally, we can convert a tokens object into a document-feature
matrix using the dfm
function.
twdfm <- dfm(toks, verbose=TRUE)
## Creating a dfm from a tokens input...
## ...lowercasing
## ...found 56,571 documents, 50,086 features
## ...complete, elapsed time: 0.648 seconds.
## Finished constructing a 56,571 x 50,086 sparse dfm.
twdfm
## Document-feature matrix of: 56,571 documents, 50,086 features (99.96% sparse) and 8 docvars.
## features
## docs republicans and democrats have both created our economic problems i
## text1 1 1 1 1 1 1 1 1 1 0
## text2 0 1 0 0 0 0 3 0 0 1
## text3 0 1 0 0 0 0 0 0 0 0
## text4 0 0 1 1 0 0 1 0 0 0
## text5 0 1 0 0 0 0 0 0 0 0
## text6 0 0 0 0 0 0 0 0 0 0
## [ reached max_ndoc ... 56,565 more documents, reached max_nfeat ... 50,076 more features ]
The dfm
will show the count of times each word appears
in each document (tweet):
twdfm[1:5, 1:10]
## Document-feature matrix of: 5 documents, 10 features (66.00% sparse) and 8 docvars.
## features
## docs republicans and democrats have both created our economic problems i
## text1 1 1 1 1 1 1 1 1 1 0
## text2 0 1 0 0 0 0 3 0 0 1
## text3 0 1 0 0 0 0 0 0 0 0
## text4 0 0 1 1 0 0 1 0 0 0
## text5 0 1 0 0 0 0 0 0 0 0
In a large corpus like this, many features often only appear in one
or two documents. In some case it’s a good idea to remove those
features, to speed up the analysis or because they’re not relevant. We
can trim
the dfm:
twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
## Removing features occurring:
## - in fewer than 3 documents: 35,060
## Total features removed: 35,060 (70.0%).
It’s often also desirable to take a look at a wordcloud of the most frequent features to see if there’s anything weird.
library(quanteda.textplots)
textplot_wordcloud(twdfm, rotation=0,
min_size=2, max_size=5,
max_words=50)
What is going on? We probably want to remove words and symbols which
are not of interest to our data, such as http here. This class of words
which is not relevant are called stopwords. These are words which are
common connectors in a given language (e.g. “a”, “the”, “is”). We can
also see the list using topFeatures
topfeatures(twdfm, 25)
## the to and a
## 45989 26284 21105 19123
## of is in for
## 18010 16193 15809 12835
## you @realdonaldtrump i rt
## 11565 11107 10955 10158
## on will be great
## 10132 8303 8194 7649
## that are it with
## 7558 7416 7092 6558
## we trump our have
## 6354 6339 6080 5865
## amp
## 5682
We can remove the stopwords when we create the dfm
object:
twdfm <- dfm(toks, remove=c(
stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), verbose=TRUE)
## Creating a dfm from a tokens input...
## ...lowercasing
## ...found 56,571 documents, 50,086 features
## Warning: 'remove' is deprecated; use dfm_remove() instead
## ...
## removed 180 features
## ...complete, elapsed time: 0.822 seconds.
## Finished constructing a 56,571 x 49,906 sparse dfm.
textplot_wordcloud(twdfm, rotation=0, min_size=2, max_size=5, max_words=50)
## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## @realdonaldtrump could not be fit on page. It will not be plotted.