Preprocessing text with quanteda

As we discussed earlier, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.

The basic unit of work for the quanteda package is called a corpus, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary to get some information about your corpus.

library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library(streamR)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
## Loading required package: ndjson
## Warning: package 'ndjson' was built under R version 3.4.4
tweets <- parseTweets("../data/trump-tweets.json")
## 3866 tweets have been parsed.
twcorpus <- corpus(tweets$text)
summary(twcorpus, n=10)
## Corpus consisting of 3866 documents, showing 10 documents:
## 
##    Text Types Tokens Sentences
##   text1    40     54         3
##   text2    20     23         3
##   text3    20     22         3
##   text4    32     41         4
##   text5    48     56         4
##   text6    12     14         2
##   text7    20     22         2
##   text8    29     31         2
##   text9    44     50         3
##  text10    22     24         2
## 
## Source: /Users/pablobarbera/git/social-media-upf/code/* on x86_64 by pablobarbera
## Created: Mon Jul  2 10:41:58 2018
## Notes:

A very useful feature of corpus objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.

kwic(twcorpus, "immigration", window=10)[1:5,]
##                                                                          
##   [text1, 14] today to hear directly from the AMERICAN VICTIMS of ILLEGAL
##  [text10, 17] today to hear directly from the AMERICAN VICTIMS of ILLEGAL
##  [text14, 11]                               .... If this is done, illegal
##   [text15, 9]           HOUSE REPUBLICANS SHOULD PASS THE STRONG BUT FAIR
##   [text41, 6]                                                    .... Our
##                 
##  | IMMIGRATION |
##  | IMMIGRATION |
##  | immigration |
##  | IMMIGRATION |
##  | Immigration |
##                                                                    
##  . These are the American Citizens permanently separated from their
##  . These are the American Citize…                                  
##  will be stopped in it's tracks- and at very                       
##  BILL, KNOWN AS GOODLATTE II, IN THEIR AFTERNOON                   
##  policy, laughed at all over the world, is
kwic(twcorpus, "healthcare", window=10)[1:5,]
##                                                                         
##   [text46, 17]         help to me on Cutting Taxes, creating great new |
##  [text182, 37]             He is tough on Crime and Strong on Borders, |
##  [text507, 48] Warren lines, loves sanctuary cities, bad and expensive |
##   [text530, 6]                           The American people deserve a |
##  [text554, 27]          will be a great Governor with a heavy focus on |
##                                                                      
##  healthcare | programs at low cost, fighting for Border Security,    
##  Healthcare | , the Military and our great Vets. Henry has           
##  healthcare | ...                                                    
##  healthcare | system that takes care of them- not one that           
##  HealthCare | and Jobs. His Socialist opponent in November should not
kwic(twcorpus, "clinton", window=10)[1:5,]
##                                                                         
##  [text141, 23]                the Bush Dynasty, then I had to beat the |
##  [text161, 20]                the Bush Dynasty, then I had to beat the |
##   [text204, 9]                  FBI Agent Peter Strzok, who headed the |
##  [text216, 13] :.@jasoninthehouse: All of this started because Hillary |
##  [text252, 10]                          .... Schneiderman, who ran the |
##                                                             
##  Clinton | Dynasty, and now I…                              
##  Clinton | Dynasty, and now I have to beat a phony          
##  Clinton | & amp; Russia investigations, texted to his lover
##  Clinton | set up her private server https:// t.co          
##  Clinton | campaign in New York, never had the guts to

We can then convert a corpus into a document-feature matrix using the dfm function.

twdfm <- dfm(twcorpus, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 3,866 documents, 9,930 features
##    ... created a 3,866 x 9,930 sparse dfm
##    ... complete. 
## Elapsed time: 0.24 seconds.
twdfm
## Document-feature matrix of: 3,866 documents, 9,930 features (99.7% sparse).

The dfm will show the count of times each word appears in each document (tweet):

twdfm[1:5, 1:10]
## Document-feature matrix of: 5 documents, 10 features (72% sparse).
## 5 x 10 sparse Matrix of class "dfm"
##        features
## docs    we are gathered today to hear directly from the american
##   text1  1   3        1     1  1    1        1    2   4        2
##   text2  0   0        0     0  0    0        0    0   0        0
##   text3  0   0        0     0  0    0        0    0   0        0
##   text4  0   0        0     0  2    0        0    0   2        0
##   text5  0   0        0     0  2    0        0    0   2        0

dfm has many useful options (check out ?dfm for more information). Let’s actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features…

twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, remove_url=TRUE, ngrams=1:3, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 3,866 documents, 128,909 features
##    ... stemming features (English)
## , trimmed 5431 feature variants
##    ... created a 3,866 x 123,478 sparse dfm
##    ... complete. 
## Elapsed time: 5.52 seconds.
twdfm
## Document-feature matrix of: 3,866 documents, 123,478 features (99.9% sparse).

Note that here we use ngrams – this will extract all combinations of one, two, and three words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens in the matrix).

Stemming relies on the SnowballC package’s implementation of the Porter stemmer:

example <- tolower(tweets$text[1])
tokens(example)
## tokens from 1 document.
## text1 :
##  [1] "we"          "are"         "gathered"    "today"       "to"         
##  [6] "hear"        "directly"    "from"        "the"         "american"   
## [11] "victims"     "of"          "illegal"     "immigration" "."          
## [16] "these"       "are"         "the"         "american"    "citizens"   
## [21] "permanently" "separated"   "from"        "their"       "loved"      
## [26] "ones"        "b"           "/"           "c"           "they"       
## [31] "were"        "killed"      "by"          "criminal"    "illegal"    
## [36] "aliens"      "."           "these"       "are"         "the"        
## [41] "families"    "the"         "media"       "ignores"     "."          
## [46] "."           "."           "https"       ":"           "/"          
## [51] "/"           "t.co"        "/"           "zjxesyacjy"
tokens_wordstem(tokens(example))
## tokens from 1 document.
## text1 :
##  [1] "we"         "are"        "gather"     "today"      "to"        
##  [6] "hear"       "direct"     "from"       "the"        "american"  
## [11] "victim"     "of"         "illeg"      "immigr"     "."         
## [16] "these"      "are"        "the"        "american"   "citizen"   
## [21] "perman"     "separ"      "from"       "their"      "love"      
## [26] "one"        "b"          "/"          "c"          "they"      
## [31] "were"       "kill"       "by"         "crimin"     "illeg"     
## [36] "alien"      "."          "these"      "are"        "the"       
## [41] "famili"     "the"        "media"      "ignor"      "."         
## [46] "."          "."          "https"      ":"          "/"         
## [51] "/"          "t.co"       "/"          "zjxesyacji"

In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim the dfm:

twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
## Removing features occurring:
##   - in fewer than 3 documents: 112,440
##   Total features removed: 112,440 (91.1%).
twdfm
## Document-feature matrix of: 3,866 documents, 11,038 features (99.7% sparse).

It’s often a good idea to take a look at a wordcloud of the most frequent features to see if there’s anything weird.

textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)

What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures

topfeatures(twdfm, 25)
##   the    to   and    of     a    in    is   for    on   our    be  will 
##  4580  2697  2493  1945  1549  1456  1299  1088   920   894   846   842 
## great  with   are    we     i  that    it   amp  have    at   you   was 
##   836   815   793   764   735   733   729   637   573   523   520   492 
##  they 
##   474

We can remove the stopwords when we create the dfm object:

twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
  stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), remove_url=TRUE, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 3,866 documents, 8,456 features
##    ... removed 165 features
##    ... created a 3,866 x 8,291 sparse dfm
##    ... complete. 
## Elapsed time: 0.315 seconds.
textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)