Sentiment detection using dictionary methods

One of the most common applications of dictionary methods is sentiment analysis: using a dictionary of positive and negative words, we compute a sentiment score for each individual document.

Let’s apply this technique to tweets by the four leading candidates in the 2016 Presidential primaries. Which candidate was using positive rhetoric most frequently? Which candidate was most negative in their public messages on Twitter?

library(quanteda)
## Package version: 3.2.3
## Unicode version: 14.0
## ICU version: 70.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
tweets <- read.csv('../data/candidate-tweets.csv', stringsAsFactors=F)

We will use the positive and negative categories in the augmented General Inquirer dictionary to measure the extent to which these candidates adopted a positive or negative tone during the election campaign.

Note that first you will need to install the quanteda.dictionaries and quanteda.sentiment packages from GitHub

library(devtools)
install_github("kbenoit/quanteda.dictionaries")
install_github("quanteda/quanteda.sentiment")

First, we load the dictionary object. Note that we can apply the dictionary object directly (as we will see later), but for now let’s learn how to do this if we had a list of positive and negative words on a different file.

library(quanteda.dictionaries)
library(quanteda.sentiment)
## 
## Attaching package: 'quanteda.sentiment'
## The following object is masked from 'package:quanteda':
## 
##     data_dictionary_LSD2015
data(data_dictionary_geninqposneg)

pos.words <- data_dictionary_geninqposneg[['positive']]
neg.words <- data_dictionary_geninqposneg[['negative']]
# a look at a random sample of positive and negative words
sample(pos.words, 10)
##  [1] "signify"     "accommodate" "refinement"  "amenable"    "befit"      
##  [6] "romance"     "sweeten"     "mediate"     "poetic"      "thrift"
sample(neg.words, 10)
##  [1] "ignorant"      "bewilder"      "strain"        "flimsy"       
##  [5] "overrun"       "mourn"         "unemployed"    "nuts"         
##  [9] "strike"        "superstitious"

As earlier in the course, we will convert our text to a corpus object, tokenize the documents, and then create the DFM.

twcorpus <- corpus(tweets)
toks <- tokens(twcorpus)
tdfm <- dfm(toks)

Now we’re ready to run the sentiment analysis! First we will construct a dictionary object.

mydict <- dictionary(list(positive = pos.words,
                          negative = neg.words))

And now we apply it to the corpus in order to count the number of words that appear in each category

sent <- dfm_lookup(tdfm, dictionary = mydict)

We can then extract the score and add it to the data frame as a new variable

tweets$score <- as.numeric(sent[,1]) - as.numeric(sent[,2])

And now start answering some descriptive questions…

# what is the average sentiment score?
mean(tweets$score)
## [1] 0.4775762
# what is the most positive and most negative tweet?
tweets[which.max(tweets$score),]
##       screen_name
## 22177     tedcruz
##                                                                                                                                                       text
## 22177 We will restore our spirit. We will free our minds &amp; imagination. We will create a better world. We will bring back jobs, freedom &amp; security
##                  datetime
## 22177 2016-04-20 02:42:28
##                                                                   source lang
## 22177 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>   en
##       score
## 22177     8
tweets[which.min(tweets$score),]
##       screen_name
## 13843     tedcruz
##                                                                                                                                          text
## 13843 Can't win battle against radical Islamic terrorism if you're unwilling to utter words radical Islamic terrorism https://t.co/HASzpU0kuy
##                  datetime
## 13843 2014-06-18 03:02:50
##                                                                                    source
## 13843 <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>
##       lang score
## 13843   en    -7
# what is the proportion of positive, neutral, and negative tweets?
tweets$sentiment <- "neutral"
tweets$sentiment[tweets$score<0] <- "negative"
tweets$sentiment[tweets$score>0] <- "positive"
table(tweets$sentiment)
## 
## negative  neutral positive 
##     5255     8974    12506

We can also compute it at the candidate level by taking the average of the sentiment scores:

# loop over candidates
candidates <- c("realDonaldTrump", "HillaryClinton", "tedcruz", "BernieSanders")

for (cand in candidates){
  message(cand, " -- average sentiment: ",
      round(mean(tweets$score[tweets$screen_name==cand]), 4)
    )
}
## realDonaldTrump -- average sentiment: 0.5392
## HillaryClinton -- average sentiment: 0.3785
## tedcruz -- average sentiment: 0.5278
## BernieSanders -- average sentiment: 0.3848

But what happens if we now run the analysis excluding a single word?

pos.words <- pos.words[-which(pos.words=="great")]

mydict <- dictionary(list(positive = pos.words,
                          negative = neg.words))
sent <- dfm_lookup(tdfm, dictionary = mydict)
tweets$score <- as.numeric(sent[,1]) - as.numeric(sent[,2])

for (cand in candidates){
  message(cand, " -- average sentiment: ",
      round(mean(tweets$score[tweets$screen_name==cand]), 4)
    )
}
## realDonaldTrump -- average sentiment: 0.3912
## HillaryClinton -- average sentiment: 0.3596
## tedcruz -- average sentiment: 0.4997
## BernieSanders -- average sentiment: 0.3729

How would we normalize by text length? (Maybe not necessary here given that tweets have roughly the same length.)

# collapse by account into 4 documents
grouped_dfm <- dfm_group(tdfm, groups = docvars(tdfm, "screen_name"))
grouped_dfm
## Document-feature matrix of: 4 documents, 43,753 features (66.94% sparse) and 1 docvar.
##                  features
## docs                rt @geraldorivera    : recruit @realdonaldtrump   to finish
##   BernieSanders   1018              0 2030       0               13 2407      0
##   HillaryClinton  1449              0 3322       0               35 3389      5
##   realDonaldTrump  607              8 4487       2             2296 2537      7
##   tedcruz         4464              0 9303       3              210 4045      6
##                  features
## docs              that horrid eyesore
##   BernieSanders    747      0       0
##   HillaryClinton   561      0       0
##   realDonaldTrump  714      2       1
##   tedcruz          429      0       0
## [ reached max_nfeat ... 43,743 more features ]
# turn word counts into proportions
grouped_dfm[1:4, 1:10]
## Document-feature matrix of: 4 documents, 10 features (30.00% sparse) and 1 docvar.
##                  features
## docs                rt @geraldorivera    : recruit @realdonaldtrump   to finish
##   BernieSanders   1018              0 2030       0               13 2407      0
##   HillaryClinton  1449              0 3322       0               35 3389      5
##   realDonaldTrump  607              8 4487       2             2296 2537      7
##   tedcruz         4464              0 9303       3              210 4045      6
##                  features
## docs              that horrid eyesore
##   BernieSanders    747      0       0
##   HillaryClinton   561      0       0
##   realDonaldTrump  714      2       1
##   tedcruz          429      0       0
grouped_dfm <- dfm_weight(grouped_dfm, scheme="prop")
grouped_dfm[1:4, 1:10]
## Document-feature matrix of: 4 documents, 10 features (30.00% sparse) and 1 docvar.
##                  features
## docs                       rt @geraldorivera          :      recruit
##   BernieSanders   0.011783089   0            0.02349673 0           
##   HillaryClinton  0.011074764   0            0.02539018 0           
##   realDonaldTrump 0.003759468   4.954818e-05 0.02779034 1.238705e-05
##   tedcruz         0.023844753   0            0.04969259 1.602470e-05
##                  features
## docs              @realdonaldtrump         to       finish        that
##   BernieSanders       0.0001504717 0.02786041 0            0.008646334
##   HillaryClinton      0.0002675064 0.02590226 3.821520e-05 0.004287745
##   realDonaldTrump     0.0142203284 0.01571297 4.335466e-05 0.004422175
##   tedcruz             0.0011217290 0.02160664 3.204940e-05 0.002291532
##                  features
## docs                    horrid      eyesore
##   BernieSanders   0            0           
##   HillaryClinton  0            0           
##   realDonaldTrump 1.238705e-05 6.193523e-06
##   tedcruz         0            0
# Apply dictionary using `dfm_lookup()` function:
sent <- dfm_lookup(grouped_dfm, dictionary = mydict)
sent
## Document-feature matrix of: 4 documents, 2 features (0.00% sparse) and 1 docvar.
##                  features
## docs                positive   negative
##   BernieSanders   0.05460964 0.03723595
##   HillaryClinton  0.04787600 0.03216955
##   realDonaldTrump 0.05266352 0.03438025
##   tedcruz         0.04935073 0.02412786
(sent[,1]-sent[,2])*100
## 4 x 1 sparse Matrix of class "dgCMatrix"
##                  features
## docs              positive
##   BernieSanders   1.737369
##   HillaryClinton  1.570645
##   realDonaldTrump 1.828328
##   tedcruz         2.522288

Finally, let’s apply a different dictionary so that we can practice with dictionaries in different formats:

data(data_dictionary_MFD)
# dictionary keys
names(data_dictionary_MFD)
##  [1] "care.virtue"      "care.vice"        "fairness.virtue"  "fairness.vice"   
##  [5] "loyalty.virtue"   "loyalty.vice"     "authority.virtue" "authority.vice"  
##  [9] "sanctity.virtue"  "sanctity.vice"
# looking at words within first key
data_dictionary_MFD$care.virtue[1:10]
##  [1] "alleviate"   "alleviated"  "alleviates"  "alleviating" "alleviation"
##  [6] "altruism"    "altruist"    "beneficence" "beneficiary" "benefit"
# applying dictionary
# 1) collapse by account
grouped_dfm <- dfm_group(tdfm, groups = docvars(tdfm, "screen_name"))
# 2) turn words into proportions
twdfm <- dfm_weight(grouped_dfm, scheme="prop")
# 3) apply dictionary
moral <- dfm_lookup(twdfm, dictionary = data_dictionary_MFD)

# are liberals more sensitive to care and virtue?
moral[,'care.virtue']*100
## Document-feature matrix of: 4 documents, 1 feature (0.00% sparse) and 1 docvar.
##                  features
## docs              care.virtue
##   BernieSanders     0.7870826
##   HillaryClinton    0.8155123
##   realDonaldTrump   0.3183471
##   tedcruz           0.3060718
moral[,'fairness.virtue']*100
## Document-feature matrix of: 4 documents, 1 feature (0.00% sparse) and 1 docvar.
##                  features
## docs              fairness.virtue
##   BernieSanders        0.27200648
##   HillaryClinton       0.39438084
##   realDonaldTrump      0.09166414
##   tedcruz              0.16505440
# are conservatives more sensitive to sanctity and authority?
moral[,'sanctity.virtue']*100
## Document-feature matrix of: 4 documents, 1 feature (0.00% sparse) and 1 docvar.
##                  features
## docs              sanctity.virtue
##   BernieSanders         0.1064876
##   HillaryClinton        0.1146456
##   realDonaldTrump       0.1151995
##   tedcruz               0.2569293
moral[,'authority.virtue']*100
## Document-feature matrix of: 4 documents, 1 feature (0.00% sparse) and 1 docvar.
##                  features
## docs              authority.virtue
##   BernieSanders          0.5000289
##   HillaryClinton         0.5747566
##   realDonaldTrump        0.6125394
##   tedcruz                0.7227139