Preprocessing text with quanteda

As we discussed earlier, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.

The basic unit of work for the quanteda package is called a corpus, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary to get some information about your corpus.

library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library(streamR)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
## Warning: package 'rjson' was built under R version 3.4.4
## Loading required package: ndjson
## Warning: package 'ndjson' was built under R version 3.4.4
tweets <- parseTweets("~/data/trump-tweets.json")
## 3866 tweets have been parsed.
twcorpus <- corpus(tweets$text)
summary(twcorpus, n=10)
## Corpus consisting of 3866 documents, showing 10 documents:
## 
##    Text Types Tokens Sentences
##   text1    40     54         3
##   text2    20     23         3
##   text3    20     22         3
##   text4    32     41         4
##   text5    48     56         4
##   text6    12     14         2
##   text7    20     22         2
##   text8    29     31         2
##   text9    44     50         3
##  text10    22     24         2
## 
## Source: /Users/pablobarbera/git/ECPR-SC105/code/* on x86_64 by pablobarbera
## Created: Thu Aug  9 11:18:56 2018
## Notes:

A very useful feature of corpus objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.

kwic(twcorpus, "immigration", window=10)[1:5,]
##                                                                          
##   [text1, 14] today to hear directly from the AMERICAN VICTIMS of ILLEGAL
##  [text10, 17] today to hear directly from the AMERICAN VICTIMS of ILLEGAL
##  [text14, 11]                               .... If this is done, illegal
##   [text15, 9]           HOUSE REPUBLICANS SHOULD PASS THE STRONG BUT FAIR
##   [text41, 6]                                                    .... Our
##                 
##  | IMMIGRATION |
##  | IMMIGRATION |
##  | immigration |
##  | IMMIGRATION |
##  | Immigration |
##                                                                    
##  . These are the American Citizens permanently separated from their
##  . These are the American Citize…                                  
##  will be stopped in it's tracks- and at very                       
##  BILL, KNOWN AS GOODLATTE II, IN THEIR AFTERNOON                   
##  policy, laughed at all over the world, is
kwic(twcorpus, "healthcare", window=10)[1:5,]
##                                                                         
##   [text46, 17]         help to me on Cutting Taxes, creating great new |
##  [text182, 37]             He is tough on Crime and Strong on Borders, |
##  [text507, 48] Warren lines, loves sanctuary cities, bad and expensive |
##   [text530, 6]                           The American people deserve a |
##  [text554, 27]          will be a great Governor with a heavy focus on |
##                                                                      
##  healthcare | programs at low cost, fighting for Border Security,    
##  Healthcare | , the Military and our great Vets. Henry has           
##  healthcare | ...                                                    
##  healthcare | system that takes care of them- not one that           
##  HealthCare | and Jobs. His Socialist opponent in November should not
kwic(twcorpus, "clinton", window=10)[1:5,]
##                                                                         
##  [text141, 23]                the Bush Dynasty, then I had to beat the |
##  [text161, 20]                the Bush Dynasty, then I had to beat the |
##   [text204, 9]                  FBI Agent Peter Strzok, who headed the |
##  [text216, 13] :.@jasoninthehouse: All of this started because Hillary |
##  [text252, 10]                          .... Schneiderman, who ran the |
##                                                             
##  Clinton | Dynasty, and now I…                              
##  Clinton | Dynasty, and now I have to beat a phony          
##  Clinton | & amp; Russia investigations, texted to his lover
##  Clinton | set up her private server https:// t.co          
##  Clinton | campaign in New York, never had the guts to

We can then convert a corpus into a document-feature matrix using the dfm function.

twdfm <- dfm(twcorpus, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 3,866 documents, 9,930 features
##    ... created a 3,866 x 9,930 sparse dfm
##    ... complete. 
## Elapsed time: 0.332 seconds.
twdfm
## Document-feature matrix of: 3,866 documents, 9,930 features (99.7% sparse).

The dfm will show the count of times each word appears in each document (tweet):

twdfm[1:5, 1:10]
## Document-feature matrix of: 5 documents, 10 features (72% sparse).
## 5 x 10 sparse Matrix of class "dfm"
##        features
## docs    we are gathered today to hear directly from the american
##   text1  1   3        1     1  1    1        1    2   4        2
##   text2  0   0        0     0  0    0        0    0   0        0
##   text3  0   0        0     0  0    0        0    0   0        0
##   text4  0   0        0     0  2    0        0    0   2        0
##   text5  0   0        0     0  2    0        0    0   2        0

dfm has many useful options (check out ?dfm for more information). Let’s actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features…

twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, remove_url=TRUE, ngrams=1:3, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 3,866 documents, 128,909 features
##    ... stemming features (English)
## , trimmed 5431 feature variants
##    ... created a 3,866 x 123,478 sparse dfm
##    ... complete. 
## Elapsed time: 6.38 seconds.
twdfm
## Document-feature matrix of: 3,866 documents, 123,478 features (99.9% sparse).

Note that here we use ngrams – this will extract all combinations of one, two, and three words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens in the matrix).

Stemming relies on the SnowballC package’s implementation of the Porter stemmer:

example <- tolower(tweets$text[1])
tokens(example)
## tokens from 1 document.
## text1 :
##  [1] "we"          "are"         "gathered"    "today"       "to"         
##  [6] "hear"        "directly"    "from"        "the"         "american"   
## [11] "victims"     "of"          "illegal"     "immigration" "."          
## [16] "these"       "are"         "the"         "american"    "citizens"   
## [21] "permanently" "separated"   "from"        "their"       "loved"      
## [26] "ones"        "b"           "/"           "c"           "they"       
## [31] "were"        "killed"      "by"          "criminal"    "illegal"    
## [36] "aliens"      "."           "these"       "are"         "the"        
## [41] "families"    "the"         "media"       "ignores"     "."          
## [46] "."           "."           "https"       ":"           "/"          
## [51] "/"           "t.co"        "/"           "zjxesyacjy"
tokens_wordstem(tokens(example))
## tokens from 1 document.
## text1 :
##  [1] "we"         "are"        "gather"     "today"      "to"        
##  [6] "hear"       "direct"     "from"       "the"        "american"  
## [11] "victim"     "of"         "illeg"      "immigr"     "."         
## [16] "these"      "are"        "the"        "american"   "citizen"   
## [21] "perman"     "separ"      "from"       "their"      "love"      
## [26] "one"        "b"          "/"          "c"          "they"      
## [31] "were"       "kill"       "by"         "crimin"     "illeg"     
## [36] "alien"      "."          "these"      "are"        "the"       
## [41] "famili"     "the"        "media"      "ignor"      "."         
## [46] "."          "."          "https"      ":"          "/"         
## [51] "/"          "t.co"       "/"          "zjxesyacji"

In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim the dfm:

twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
## Removing features occurring:
##   - in fewer than 3 documents: 112,440
##   Total features removed: 112,440 (91.1%).
twdfm
## Document-feature matrix of: 3,866 documents, 11,038 features (99.7% sparse).

It’s often a good idea to take a look at a wordcloud of the most frequent features to see if there’s anything weird.

textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)

What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures

topfeatures(twdfm, 25)
##   the    to   and    of     a    in    is   for    on   our    be  will 
##  4580  2697  2493  1945  1549  1456  1299  1088   920   894   846   842 
## great  with   are    we     i  that    it   amp  have    at   you   was 
##   836   815   793   764   735   733   729   637   573   523   520   492 
##  they 
##   474

We can remove the stopwords when we create the dfm object:

twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
  stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), remove_url=TRUE, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 3,866 documents, 8,456 features
##    ... removed 165 features
##    ... created a 3,866 x 8,291 sparse dfm
##    ... complete. 
## Elapsed time: 0.463 seconds.
textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)

Dictionary methods

One of the most common applications of dictionary methods is sentiment analysis: using a dictionary of positive and negative words, we compute a sentiment score for each individual document.

Let’s apply this technique to tweets by the four leading candidates in the 2016 Presidential primaries.

library(quanteda)
tweets <- read.csv('~/data/candidate-tweets.csv', stringsAsFactors=F)

We will use the LIWC dictionary to measure the extent to which these candidates adopted a positive or negative tone during the election campaign. (Note: LIWC is provided here for teaching purposes only and will not be distributed publicly.) LIWC has many other categories, but for now let’s just use positive and negative

liwc <- read.csv("~/data/liwc-dictionary.csv",
                 stringsAsFactors = FALSE)
table(liwc$class)
## 
##     adjective        affect         anger       anxiety         cause 
##           235           445            46            92            46 
##     cognition       compare        differ   discrepancy        female 
##           252           101            46            92            46 
##        future       insight interrogation          male        negate 
##            46            92            47            46            47 
##      negative        number          past      positive         power 
##           230            36           123           211           184 
##       present         quant        reward          risk        social 
##           138            47            46            46           230 
##     tentative          verb 
##            23           329
pos.words <- liwc$word[liwc$class=="positive"]
neg.words <- liwc$word[liwc$class=="negative"]
# a look at a random sample of positive and negative words
sample(pos.words, 10)
##  [1] "proudly"   "admir*"    "kind"      "wealthy"   "sexy"     
##  [6] "respect"   "excelled"  "wellness"  "kindly"    "excellent"
sample(neg.words, 10)
##  [1] "saddest"    "ugliest"    "annoy"      "immoral*"   "anxious"   
##  [6] "anxiously"  "fake"       "distrust*"  "upset"      "uncontrol*"

As earlier today, we will convert our text to a corpus object.

twcorpus <- corpus(tweets)

Now we’re ready to run the sentiment analysis! First we will construct a dictionary object.

mydict <- dictionary(list(positive = pos.words,
                          negative = neg.words))

And now we apply it to the corpus in order to count the number of words that appear in each category

sent <- dfm(twcorpus, dictionary = mydict)

We can then extract the score and add it to the data frame as a new variable

tweets$score <- as.numeric(sent[,1]) - as.numeric(sent[,2])

And now start answering some descriptive questions…

# what is the average sentiment score?
mean(tweets$score)
## [1] 0.2056106
# what is the most positive and most negative tweet?
tweets[which.max(tweets$score),]
##          screen_name
## 3125 realDonaldTrump
##                                                                                                                                              text
## 3125 .@robertjeffress I greatly appreciate your kind words last night on @FoxNews. Have great love for the evangelicals -- great respect for you.
##                 datetime
## 3125 2015-09-11 19:24:44
##                                                                  source
## 3125 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
##      lang score
## 3125   en     5
tweets[which.min(tweets$score),]
##          screen_name
## 6642 realDonaldTrump
##                                                                                                                                           text
## 6642 Lindsey Graham is all over T.V., much like failed 47% candidate Mitt Romney. These nasty, angry, jealous  failures have ZERO credibility!
##                 datetime
## 6642 2016-03-07 13:03:59
##                                                                                    source
## 6642 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
##      lang score
## 6642   en    -4
# what is the proportion of positive, neutral, and negative tweets?
tweets$sentiment <- "neutral"
tweets$sentiment[tweets$score<0] <- "negative"
tweets$sentiment[tweets$score>0] <- "positive"
table(tweets$sentiment)
## 
## negative  neutral positive 
##     1265    19602     5868

We can also disaggregate by groups of tweets, for example according to the party they mention.

# loop over candidates
candidates <- c("realDonaldTrump", "HillaryClinton", "tedcruz", "BernieSanders")

for (cand in candidates){
  message(cand, " -- average sentiment: ",
      round(mean(tweets$score[tweets$screen_name==cand]), 4)
    )
}
## realDonaldTrump -- average sentiment: 0.2911
## HillaryClinton -- average sentiment: 0.1736
## tedcruz -- average sentiment: 0.1853
## BernieSanders -- average sentiment: 0.1384

But what happens if we now run the analysis excluding a single word?

pos.words <- pos.words[-which(pos.words=="great")]

mydict <- dictionary(list(positive = pos.words,
                          negative = neg.words))
sent <- dfm(twcorpus, dictionary = mydict)
tweets$score <- as.numeric(sent[,1]) - as.numeric(sent[,2])

for (cand in candidates){
  message(cand, " -- average sentiment: ",
      round(mean(tweets$score[tweets$screen_name==cand]), 4)
    )
}
## realDonaldTrump -- average sentiment: 0.1431
## HillaryClinton -- average sentiment: 0.1547
## tedcruz -- average sentiment: 0.1573
## BernieSanders -- average sentiment: 0.1265

How would we normalize by text length? (Maybe not necessary here given that tweets have roughly the same length.)

# collapse by account into 4 documents
twdfm <- dfm(twcorpus, groups = "screen_name")
twdfm
## Document-feature matrix of: 4 documents, 43,426 features (66.9% sparse).
# turn word counts into proportions
twdfm[1:4, 1:10]
## Document-feature matrix of: 4 documents, 10 features (30% sparse).
## 4 x 10 sparse Matrix of class "dfm"
##                  features
## docs                rt @geraldorivera     : recruit @realdonaldtrump   to
##   BernieSanders   1018              0  4186       0               11 2407
##   HillaryClinton  1449              0  7800       0               33 3389
##   realDonaldTrump  607              8  7138       2             2278 2537
##   tedcruz         4464              0 18871       3              203 4045
##                  features
## docs              finish that horrid eyesore
##   BernieSanders        0  747      0       0
##   HillaryClinton       5  561      0       0
##   realDonaldTrump      7  714      2       1
##   tedcruz              6  429      0       0
twdfm <- dfm_weight(twdfm, scheme="prop")
twdfm[1:4, 1:10]
## Document-feature matrix of: 4 documents, 10 features (30% sparse).
## 4 x 10 sparse Matrix of class "dfm"
##                  features
## docs                       rt @geraldorivera          :      recruit
##   BernieSanders   0.010252175   0            0.04215678 0           
##   HillaryClinton  0.009177857   0            0.04940461 0           
##   realDonaldTrump 0.003413027   4.498223e-05 0.04013540 1.124556e-05
##   tedcruz         0.018250652   0            0.07715234 1.226522e-05
##                  features
## docs              @realdonaldtrump         to       finish        that
##   BernieSanders       0.0001107799 0.02424065 0            0.007522962
##   HillaryClinton      0.0002090195 0.02146567 3.166962e-05 0.003553332
##   realDonaldTrump     0.0128086906 0.01426499 3.935945e-05 0.004014664
##   tedcruz             0.0008299468 0.01653761 2.453045e-05 0.001753927
##                  features
## docs                    horrid      eyesore
##   BernieSanders   0            0           
##   HillaryClinton  0            0           
##   realDonaldTrump 1.124556e-05 5.622779e-06
##   tedcruz         0            0
# Apply dictionary using `dfm_lookup()` function:
sent <- dfm_lookup(twdfm, dictionary = mydict)
sent
## Document-feature matrix of: 4 documents, 2 features (0% sparse).
## 4 x 2 sparse Matrix of class "dfm"
##                  features
## docs                 positive    negative
##   BernieSanders   0.008237995 0.003111908
##   HillaryClinton  0.007467697 0.001868508
##   realDonaldTrump 0.010553956 0.004486978
##   tedcruz         0.007494051 0.001418677
(sent[,1]-sent[,2])*100
## 4 x 1 sparse Matrix of class "dgCMatrix"
##                  features
## docs               positive
##   BernieSanders   0.5126088
##   HillaryClinton  0.5599189
##   realDonaldTrump 0.6066979
##   tedcruz         0.6075374