One of the most common applications of dictionary methods is sentiment analysis: using a dictionary of positive and negative words, we compute a sentiment score for each individual document.
Let’s apply this technique to tweets by the four leading candidates in the 2016 Presidential primaries. Which candidate was using positive rhetoric most frequently? Which candidate was most negative in their public messages on Twitter?
library(quanteda)
## Package version: 3.2.3
## Unicode version: 14.0
## ICU version: 70.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
tweets <- read.csv('../data/candidate-tweets.csv', stringsAsFactors=F)
We will use the positive and negative categories in the augmented General Inquirer dictionary to measure the extent to which these candidates adopted a positive or negative tone during the election campaign.
Note that first you will need to install the
quanteda.dictionaries
and quanteda.sentiment
packages from GitHub
library(devtools)
install_github("kbenoit/quanteda.dictionaries")
install_github("quanteda/quanteda.sentiment")
First, we load the dictionary object. Note that we can apply the dictionary object directly (as we will see later), but for now let’s learn how to do this if we had a list of positive and negative words on a different file.
library(quanteda.dictionaries)
library(quanteda.sentiment)
##
## Attaching package: 'quanteda.sentiment'
## The following object is masked from 'package:quanteda':
##
## data_dictionary_LSD2015
data(data_dictionary_geninqposneg)
pos.words <- data_dictionary_geninqposneg[['positive']]
neg.words <- data_dictionary_geninqposneg[['negative']]
# a look at a random sample of positive and negative words
sample(pos.words, 10)
## [1] "signify" "accommodate" "refinement" "amenable" "befit"
## [6] "romance" "sweeten" "mediate" "poetic" "thrift"
sample(neg.words, 10)
## [1] "ignorant" "bewilder" "strain" "flimsy"
## [5] "overrun" "mourn" "unemployed" "nuts"
## [9] "strike" "superstitious"
As earlier in the course, we will convert our text to a corpus object, tokenize the documents, and then create the DFM.
twcorpus <- corpus(tweets)
toks <- tokens(twcorpus)
tdfm <- dfm(toks)
Now we’re ready to run the sentiment analysis! First we will construct a dictionary object.
mydict <- dictionary(list(positive = pos.words,
negative = neg.words))
And now we apply it to the corpus in order to count the number of words that appear in each category
sent <- dfm_lookup(tdfm, dictionary = mydict)
We can then extract the score and add it to the data frame as a new variable
tweets$score <- as.numeric(sent[,1]) - as.numeric(sent[,2])
And now start answering some descriptive questions…
# what is the average sentiment score?
mean(tweets$score)
## [1] 0.4775762
# what is the most positive and most negative tweet?
tweets[which.max(tweets$score),]
## screen_name
## 22177 tedcruz
## text
## 22177 We will restore our spirit. We will free our minds & imagination. We will create a better world. We will bring back jobs, freedom & security
## datetime
## 22177 2016-04-20 02:42:28
## source lang
## 22177 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a> en
## score
## 22177 8
tweets[which.min(tweets$score),]
## screen_name
## 13843 tedcruz
## text
## 13843 Can't win battle against radical Islamic terrorism if you're unwilling to utter words radical Islamic terrorism https://t.co/HASzpU0kuy
## datetime
## 13843 2014-06-18 03:02:50
## source
## 13843 <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>
## lang score
## 13843 en -7
# what is the proportion of positive, neutral, and negative tweets?
tweets$sentiment <- "neutral"
tweets$sentiment[tweets$score<0] <- "negative"
tweets$sentiment[tweets$score>0] <- "positive"
table(tweets$sentiment)
##
## negative neutral positive
## 5255 8974 12506
We can also compute it at the candidate level by taking the average of the sentiment scores:
# loop over candidates
candidates <- c("realDonaldTrump", "HillaryClinton", "tedcruz", "BernieSanders")
for (cand in candidates){
message(cand, " -- average sentiment: ",
round(mean(tweets$score[tweets$screen_name==cand]), 4)
)
}
## realDonaldTrump -- average sentiment: 0.5392
## HillaryClinton -- average sentiment: 0.3785
## tedcruz -- average sentiment: 0.5278
## BernieSanders -- average sentiment: 0.3848
But what happens if we now run the analysis excluding a single word?
pos.words <- pos.words[-which(pos.words=="great")]
mydict <- dictionary(list(positive = pos.words,
negative = neg.words))
sent <- dfm_lookup(tdfm, dictionary = mydict)
tweets$score <- as.numeric(sent[,1]) - as.numeric(sent[,2])
for (cand in candidates){
message(cand, " -- average sentiment: ",
round(mean(tweets$score[tweets$screen_name==cand]), 4)
)
}
## realDonaldTrump -- average sentiment: 0.3912
## HillaryClinton -- average sentiment: 0.3596
## tedcruz -- average sentiment: 0.4997
## BernieSanders -- average sentiment: 0.3729
How would we normalize by text length? (Maybe not necessary here given that tweets have roughly the same length.)
# collapse by account into 4 documents
grouped_dfm <- dfm_group(tdfm, groups = docvars(tdfm, "screen_name"))
grouped_dfm
## Document-feature matrix of: 4 documents, 43,753 features (66.94% sparse) and 1 docvar.
## features
## docs rt @geraldorivera : recruit @realdonaldtrump to finish
## BernieSanders 1018 0 2030 0 13 2407 0
## HillaryClinton 1449 0 3322 0 35 3389 5
## realDonaldTrump 607 8 4487 2 2296 2537 7
## tedcruz 4464 0 9303 3 210 4045 6
## features
## docs that horrid eyesore
## BernieSanders 747 0 0
## HillaryClinton 561 0 0
## realDonaldTrump 714 2 1
## tedcruz 429 0 0
## [ reached max_nfeat ... 43,743 more features ]
# turn word counts into proportions
grouped_dfm[1:4, 1:10]
## Document-feature matrix of: 4 documents, 10 features (30.00% sparse) and 1 docvar.
## features
## docs rt @geraldorivera : recruit @realdonaldtrump to finish
## BernieSanders 1018 0 2030 0 13 2407 0
## HillaryClinton 1449 0 3322 0 35 3389 5
## realDonaldTrump 607 8 4487 2 2296 2537 7
## tedcruz 4464 0 9303 3 210 4045 6
## features
## docs that horrid eyesore
## BernieSanders 747 0 0
## HillaryClinton 561 0 0
## realDonaldTrump 714 2 1
## tedcruz 429 0 0
grouped_dfm <- dfm_weight(grouped_dfm, scheme="prop")
grouped_dfm[1:4, 1:10]
## Document-feature matrix of: 4 documents, 10 features (30.00% sparse) and 1 docvar.
## features
## docs rt @geraldorivera : recruit
## BernieSanders 0.011783089 0 0.02349673 0
## HillaryClinton 0.011074764 0 0.02539018 0
## realDonaldTrump 0.003759468 4.954818e-05 0.02779034 1.238705e-05
## tedcruz 0.023844753 0 0.04969259 1.602470e-05
## features
## docs @realdonaldtrump to finish that
## BernieSanders 0.0001504717 0.02786041 0 0.008646334
## HillaryClinton 0.0002675064 0.02590226 3.821520e-05 0.004287745
## realDonaldTrump 0.0142203284 0.01571297 4.335466e-05 0.004422175
## tedcruz 0.0011217290 0.02160664 3.204940e-05 0.002291532
## features
## docs horrid eyesore
## BernieSanders 0 0
## HillaryClinton 0 0
## realDonaldTrump 1.238705e-05 6.193523e-06
## tedcruz 0 0
# Apply dictionary using `dfm_lookup()` function:
sent <- dfm_lookup(grouped_dfm, dictionary = mydict)
sent
## Document-feature matrix of: 4 documents, 2 features (0.00% sparse) and 1 docvar.
## features
## docs positive negative
## BernieSanders 0.05460964 0.03723595
## HillaryClinton 0.04787600 0.03216955
## realDonaldTrump 0.05266352 0.03438025
## tedcruz 0.04935073 0.02412786
(sent[,1]-sent[,2])*100
## 4 x 1 sparse Matrix of class "dgCMatrix"
## features
## docs positive
## BernieSanders 1.737369
## HillaryClinton 1.570645
## realDonaldTrump 1.828328
## tedcruz 2.522288
Finally, let’s apply a different dictionary so that we can practice with dictionaries in different formats:
data(data_dictionary_MFD)
# dictionary keys
names(data_dictionary_MFD)
## [1] "care.virtue" "care.vice" "fairness.virtue" "fairness.vice"
## [5] "loyalty.virtue" "loyalty.vice" "authority.virtue" "authority.vice"
## [9] "sanctity.virtue" "sanctity.vice"
# looking at words within first key
data_dictionary_MFD$care.virtue[1:10]
## [1] "alleviate" "alleviated" "alleviates" "alleviating" "alleviation"
## [6] "altruism" "altruist" "beneficence" "beneficiary" "benefit"
# applying dictionary
# 1) collapse by account
grouped_dfm <- dfm_group(tdfm, groups = docvars(tdfm, "screen_name"))
# 2) turn words into proportions
twdfm <- dfm_weight(grouped_dfm, scheme="prop")
# 3) apply dictionary
moral <- dfm_lookup(twdfm, dictionary = data_dictionary_MFD)
# are liberals more sensitive to care and virtue?
moral[,'care.virtue']*100
## Document-feature matrix of: 4 documents, 1 feature (0.00% sparse) and 1 docvar.
## features
## docs care.virtue
## BernieSanders 0.7870826
## HillaryClinton 0.8155123
## realDonaldTrump 0.3183471
## tedcruz 0.3060718
moral[,'fairness.virtue']*100
## Document-feature matrix of: 4 documents, 1 feature (0.00% sparse) and 1 docvar.
## features
## docs fairness.virtue
## BernieSanders 0.27200648
## HillaryClinton 0.39438084
## realDonaldTrump 0.09166414
## tedcruz 0.16505440
# are conservatives more sensitive to sanctity and authority?
moral[,'sanctity.virtue']*100
## Document-feature matrix of: 4 documents, 1 feature (0.00% sparse) and 1 docvar.
## features
## docs sanctity.virtue
## BernieSanders 0.1064876
## HillaryClinton 0.1146456
## realDonaldTrump 0.1151995
## tedcruz 0.2569293
moral[,'authority.virtue']*100
## Document-feature matrix of: 4 documents, 1 feature (0.00% sparse) and 1 docvar.
## features
## docs authority.virtue
## BernieSanders 0.5000289
## HillaryClinton 0.5747566
## realDonaldTrump 0.6125394
## tedcruz 0.7227139