Preprocessing text with quanteda

As we discussed earlier, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.

Pre-processing steps

1. Corpus objects

The basic unit of work for the quanteda package is called a corpus, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary to get some information about your corpus.

library(quanteda)

## Package version: 3.2.3
## Unicode version: 14.0
## ICU version: 70.1

## Parallel computing: 8 of 8 threads used.

## See https://quanteda.io for tutorials and examples.

tweets <- read.csv("../data/trump-tweets.csv", 
                      stringsAsFactors = FALSE)
twcorpus <- corpus(tweets)
summary(twcorpus, n=10)

## Corpus consisting of 56571 documents, showing 10 documents:
## 
##    Text Types Tokens Sentences           id isRetweet isDeleted
##   text1    10     10         1 9.845497e+16         f         f
##   text2    42     50         3 1.234653e+18         f         f
##   text3    23     24         1 1.218011e+18         t         f
##   text4    49     61         3 1.304875e+18         f         f
##   text5    25     25         2 1.218160e+18         t         f
##   text6    21     24         2 1.217963e+18         t         f
##   text7     8      8         2 1.223641e+18         f         f
##   text8     1      1         1 1.319502e+18         f         f
##   text9     1      1         1 1.319501e+18         f         f
##  text10     1      1         1 1.319501e+18         f         f
##              device favorites retweets                date isFlagged
##           TweetDeck        49      255 2011-08-02 18:07:48         f
##  Twitter for iPhone     73748    17404 2020-03-03 01:34:50         f
##  Twitter for iPhone         0     7396 2020-01-17 03:22:47         f
##  Twitter for iPhone     80527    23502 2020-09-12 20:10:58         f
##  Twitter for iPhone         0     9081 2020-01-17 13:13:59         f
##  Twitter for iPhone         0    25048 2020-01-17 00:11:56         f
##  Twitter for iPhone    285863    30209 2020-02-01 16:14:02         f
##  Twitter for iPhone    130822    19127 2020-10-23 04:52:14         f
##  Twitter for iPhone    153446    20275 2020-10-23 04:46:53         f
##  Twitter for iPhone    102150    14815 2020-10-23 04:46:49         f

2. Tokenization

Once we have a corpus, we can convert it to tokens using the tokens function.

toks <- tokens(twcorpus)
toks[[1]]

##  [1] "Republicans" "and"         "Democrats"   "have"        "both"       
##  [6] "created"     "our"         "economic"    "problems"    "."

tokens has many useful options (check out ?tokens for more information). Let’s actually use it to remove punctuation, keep Twitter features…

toks <- tokens(twcorpus,
               what="word",
               remove_punct = TRUE,
               remove_url=TRUE,
               verbose=TRUE)

## Creating a tokens object from a corpus input...

##  ...starting tokenization

##  ...text1 to text10000

##  ...preserving hyphens

##  ...preserving social media tags (#, @)

##  ...segmenting into words

##  ...text10001 to text20000

##  ...preserving hyphens

##  ...preserving social media tags (#, @)

##  ...segmenting into words

##  ...text20001 to text30000

##  ...preserving hyphens

##  ...preserving social media tags (#, @)

##  ...segmenting into words

##  ...text30001 to text40000

##  ...preserving hyphens

##  ...preserving social media tags (#, @)

##  ...segmenting into words

##  ...text40001 to text50000

##  ...preserving hyphens

##  ...preserving social media tags (#, @)

##  ...segmenting into words

##  ...text50001 to text56571

##  ...preserving hyphens

##  ...preserving social media tags (#, @)

##  ...segmenting into words

##  ...76,293 unique types

##  ...removing separators, punctuation, URLs

##  ...complete, elapsed time: 3.79 seconds.

## Finished constructing tokens from 56,571 documents.

toks[[1]]

## [1] "Republicans" "and"         "Democrats"   "have"        "both"       
## [6] "created"     "our"         "economic"    "problems"

By default, tokens will keep just entire words, but for example we can use tokens_ngrams to create ngrams – all combinations of one, two, three, etc words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens).

toks_ngrams <- tokens_ngrams(toks, n=1:2)
toks_ngrams[[1]]

##  [1] "Republicans"       "and"               "Democrats"        
##  [4] "have"              "both"              "created"          
##  [7] "our"               "economic"          "problems"         
## [10] "Republicans_and"   "and_Democrats"     "Democrats_have"   
## [13] "have_both"         "both_created"      "created_our"      
## [16] "our_economic"      "economic_problems"

Another text pre-processing technique we can apply to the tokens object is stemming. In quanteda, stemming relies on the SnowballC package’s implementation of the Porter stemmer:

toks_stems <- tokens_wordstem(toks)
toks_stems[[1]]

## [1] "Republican" "and"        "Democrat"   "have"       "both"      
## [6] "creat"      "our"        "econom"     "problem"

A very useful feature of tokens objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.

kwic(toks, "immigration", window=5)[1:5,]

## Keyword-in-context with 5 matches.                                                                      
##  [text1172, 16]       U.S History My opponent's insane | immigration |
##  [text2944, 16]         Are the Strongest Opponents of | Immigration |
##  [text3052, 32]          make excuses for their failed | immigration |
##  [text3063, 34] Executive Order to temporarily suspend | immigration |
##  [text4461, 28]              on the Border amp Illegal | Immigration |
##                                        
##  plan completely eliminates U.S borders
##                                        
##  policies I wonder what O              
##  into the United States                
##  He loves our Military amp

kwic(toks, "healthcare", window=5)[1:5,]

## Keyword-in-context with 5 matches.                                                                 
##   [text470, 31] Second Amendment and Deliver Great | Healthcare |
##   [text847, 37]           very well be bigger than | healthcare |
##  [text1054, 18]             about his plan to keep | healthcare |
##  [text1057, 27]    Nominee Well she didn't support | Healthcare |
##  [text1778, 19]   underlying conditions as well as | healthcare |
##                                
##  Eric has my Complete and      
##  itself Congratulations America
##  affordable protect            
##  or my opening up 5000         
##  workers and

kwic(toks, "clinton", window=5)[1:5,]

## Keyword-in-context with 5 matches.                                                                 
##   [text1542, 6]       Obama worked harder for Hillary | Clinton |
##  [text1542, 10]        Hillary Clinton and the losing | Clinton |
##  [text3245, 27] political campaign by Crooked Hillary | Clinton |
##   [text3314, 9]        Biden's town hall with Hillary | Clinton |
##  [text3662, 13]             but refusing to touch the | Clinton |
##                                 
##  and the losing Clinton Campaign
##  Campaign than she worked for   
##  while Hillary was under FBI    
##  got off to a fantastic         
##  Foundation

3. Creating the document-feature matrix

Finally, we can convert a tokens object into a document-feature matrix using the dfm function.

twdfm <- dfm(toks, verbose=TRUE)

## Creating a dfm from a tokens input...

##  ...lowercasing

##  ...found 56,571 documents, 50,086 features

##  ...complete, elapsed time: 0.648 seconds.

## Finished constructing a 56,571 x 50,086 sparse dfm.

twdfm

## Document-feature matrix of: 56,571 documents, 50,086 features (99.96% sparse) and 8 docvars.
##        features
## docs    republicans and democrats have both created our economic problems i
##   text1           1   1         1    1    1       1   1        1        1 0
##   text2           0   1         0    0    0       0   3        0        0 1
##   text3           0   1         0    0    0       0   0        0        0 0
##   text4           0   0         1    1    0       0   1        0        0 0
##   text5           0   1         0    0    0       0   0        0        0 0
##   text6           0   0         0    0    0       0   0        0        0 0
## [ reached max_ndoc ... 56,565 more documents, reached max_nfeat ... 50,076 more features ]

The dfm will show the count of times each word appears in each document (tweet):

twdfm[1:5, 1:10]

## Document-feature matrix of: 5 documents, 10 features (66.00% sparse) and 8 docvars.
##        features
## docs    republicans and democrats have both created our economic problems i
##   text1           1   1         1    1    1       1   1        1        1 0
##   text2           0   1         0    0    0       0   3        0        0 1
##   text3           0   1         0    0    0       0   0        0        0 0
##   text4           0   0         1    1    0       0   1        0        0 0
##   text5           0   1         0    0    0       0   0        0        0 0

In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim the dfm:

twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)

## Removing features occurring:

##   - in fewer than 3 documents: 35,060

##   Total features removed: 35,060 (70.0%).

It’s often also desirable to take a look at a wordcloud of the most frequent features to see if there’s anything weird.

library(quanteda.textplots)
textplot_wordcloud(twdfm, rotation=0, 
                   min_size=2, max_size=5, 
                   max_words=50)

What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures

topfeatures(twdfm, 25)

##              the               to              and                a 
##            45989            26284            21105            19123 
##               of               is               in              for 
##            18010            16193            15809            12835 
##              you @realdonaldtrump                i               rt 
##            11565            11107            10955            10158 
##               on             will               be            great 
##            10132             8303             8194             7649 
##             that              are               it             with 
##             7558             7416             7092             6558 
##               we            trump              our             have 
##             6354             6339             6080             5865 
##              amp 
##             5682

We can remove the stopwords when we create the dfm object:

twdfm <- dfm(toks, remove=c(
  stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), verbose=TRUE)

## Creating a dfm from a tokens input...

##  ...lowercasing

##  ...found 56,571 documents, 50,086 features

## Warning: 'remove' is deprecated; use dfm_remove() instead

## ...

## removed 180 features
##  ...complete, elapsed time: 0.822 seconds.
## Finished constructing a 56,571 x 49,906 sparse dfm.

textplot_wordcloud(twdfm, rotation=0, min_size=2, max_size=5, max_words=50)

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## @realdonaldtrump could not be fit on page. It will not be plotted.

Preprocessing text with quanteda

Pablo Barbera

Pre-processing steps

1. Corpus objects

2. Tokenization

3. Creating the document-feature matrix