As we discussed earlier, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.

You can install the packages we will use in this script with the code below:


Pre-processing steps

1. Corpus objects

The basic unit of work for the quanteda package is called a corpus, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary to get some information about your corpus.

tweets <- read.csv("../data/trump-tweets.csv", 
                      stringsAsFactors = FALSE)
twcorpus <- corpus(tweets)
summary(twcorpus, n=10)
2. Tokenization

Once we have a corpus, we can convert it to tokens using the tokens function.

toks <- tokens(twcorpus)
##  [1] "Republicans" "and"         "Democrats"   "have"        "both"       
##  [6] "created"     "our"         "economic"    "problems"    "."

tokens has many useful options (check out ?tokens for more information). Let’s actually use it to remove punctuation, keep Twitter features…

toks <- tokens(twcorpus,
               remove_punct = TRUE,
## [1] "Republicans" "and"         "Democrats"   "have"        "both"       
## [6] "created"     "our"         "economic"    "problems"

By default, tokens will keep just entire words, but for example we can use tokens_ngrams to create ngrams – all combinations of one, two, three, etc words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens).

toks_ngrams <- tokens_ngrams(toks, n=1:2)
##  [1] "Republicans"       "and"               "Democrats"        
##  [4] "have"              "both"              "created"          
##  [7] "our"               "economic"          "problems"         
## [10] "Republicans_and"   "and_Democrats"     "Democrats_have"   
## [13] "have_both"         "both_created"      "created_our"      
## [16] "our_economic"      "economic_problems"

Another text pre-processing technique we can apply to the tokens object is stemming. In quanteda, stemming relies on the SnowballC package’s implementation of the Porter stemmer:

toks_stems <- tokens_wordstem(toks)
## [1] "Republican" "and"        "Democrat"   "have"       "both"      
## [6] "creat"      "our"        "econom"     "problem"

A very useful feature of tokens objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.

kwic(toks, "immigration", window=5)[1:5,]
3. Creating the document-feature matrix

Finally, we can convert a tokens object into a document-feature matrix using the dfm function.

twdfm <- dfm(toks, verbose=TRUE)
## Document-feature matrix of: 56,571 documents, 50,086 features (99.96% sparse) and 8 docvars.
##        features
## docs    republicans and democrats have both created our economic problems i
##   text1           1   1         1    1    1       1   1        1        1 0
##   text2           0   1         0    0    0       0   3        0        0 1
##   text3           0   1         0    0    0       0   0        0        0 0
##   text4           0   0         1    1    0       0   1        0        0 0
##   text5           0   1         0    0    0       0   0        0        0 0
##   text6           0   0         0    0    0       0   0        0        0 0
## [ reached max_ndoc ... 56,565 more documents, reached max_nfeat ... 50,076 more features ]

The dfm will show the count of times each word appears in each document (tweet):

twdfm[1:5, 1:10]
In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim the dfm:

twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
It’s often also desirable to take a look at a wordcloud of the most frequent features to see if there’s anything weird.

textplot_wordcloud(twdfm, rotation=0, 
                   min_size=2, max_size=5, 

What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures

topfeatures(twdfm, 25)
##              the               to              and                a 
##            45989            26284            21105            19123 
##               of               is               in              for 
##            18010            16193            15809            12835 
##              you @realdonaldtrump                i               rt 
##            11565            11107            10955            10158 
##               on             will               be            great 
##            10132             8303             8194             7649 
##             that              are               it             with 
##             7558             7416             7092             6558 
##               we            trump              our             have 
##             6354             6339             6080             5865 
##              amp 
##             5682

We can remove the stopwords when we create the dfm object:

twdfm <- dfm(toks, remove=c(
  stopwords("english"), "", "https", "rt", "amp", "http", "t.c", "can", "u"), verbose=TRUE)
textplot_wordcloud(twdfm, rotation=0, min_size=2, max_size=5, max_words=50)
