A common and simple type of automated text analysis consists on the application of dictionary methods, or lexicon-based approaches to the measurement of tone or the prediction of diferent categories related to the content of the text.
One of these applications is sentiment analysis: using a dictionary of positive and negative words, we compute a sentiment score for each individual document.
Let’s apply this technique to tweets by the four leading candidates in the 2016 Presidential primaries, which I collected from the REST API
library(quanteda)
## quanteda version 0.99.22
## Using 3 of 4 threads for parallel computing
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
tweets <- read.csv('../data/candidate-tweets.csv', stringsAsFactors=F)
# loading lexicon of positive and negative words (from Neal Caren)
lexicon <- read.csv("../data/lexicon.csv", stringsAsFactors=F)
pos.words <- lexicon$word[lexicon$polarity=="positive"]
neg.words <- lexicon$word[lexicon$polarity=="negative"]
# a look at a random sample of positive and negative words
sample(pos.words, 10)
## [1] "miraculous" "authentic" "placid" "poignant" "sensational"
## [6] "loyalty" "goodwill" "ingenuously" "intimate" "pacifists"
sample(neg.words, 10)
## [1] "hallucination" "glib" "contamination" "tepid"
## [5] "deteriorate" "truant" "bleak" "infringements"
## [9] "audacious" "accursed"
We will use the quanteda package to convert our text to a corpus object and detect whether each document mentions the words in the dictionary.
twcorpus <- corpus(tweets$text)
# first we construct a dictionary object
mydict <- dictionary(list(negative = neg.words,
positive = pos.words))
# apply it to our corpus
sent <- dfm(twcorpus, dictionary = mydict)
# and add it as a new variable
tweets$score <- as.numeric(sent[,2]) - as.numeric(sent[,1])
We’re now ready to start analyzing the results:
# what is the average sentiment score?
mean(tweets$score)
## [1] 0.4887226
# what is the most positive and most negative tweet?
tweets[which.max(tweets$score),]
## screen_name
## 22177 tedcruz
## text
## 22177 We will restore our spirit. We will free our minds & imagination. We will create a better world. We will bring back jobs, freedom & security
## datetime
## 22177 2016-04-20 02:42:28
## source
## 22177 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## lang score
## 22177 en 9
tweets[which.min(tweets$score),]
## screen_name
## 14386 tedcruz
## text
## 14386 You can't win a war against radical Islamic terrorism with an Admin thats unwilling to utter the words radical Islamic terrorism" #Opp4All
## datetime
## 14386 2015-01-12 22:41:14
## source
## 14386 <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>
## lang score
## 14386 en -8
# what is the proportion of positive, neutral, and negative tweets?
tweets$sentiment <- "neutral"
tweets$sentiment[tweets$score<0] <- "negative"
tweets$sentiment[tweets$score>0] <- "positive"
table(tweets$sentiment)
##
## negative neutral positive
## 4044 10524 12167
We can also disaggregate by groups of tweets, for example according to the party they mention.
# loop over candidates
candidates <- c("realDonaldTrump", "HillaryClinton", "tedcruz", "BernieSanders")
for (cand in candidates){
message(cand, " -- average sentiment: ",
round(mean(tweets$score[tweets$screen_name==cand]), 4)
)
}
## realDonaldTrump -- average sentiment: 0.5872
## HillaryClinton -- average sentiment: 0.4282
## tedcruz -- average sentiment: 0.4961
## BernieSanders -- average sentiment: 0.3727
A somewhat interesting result: what happens when we replicate the sentiment analysis above excluding the word “great”?
# remove word "great" from dictionary
lexicon <- lexicon[-which(lexicon$word=="great"),]
pos.words <- lexicon$word[lexicon$polarity=="positive"]
neg.words <- lexicon$word[lexicon$polarity=="negative"]
twcorpus <- corpus(tweets$text)
# first we construct a dictionary object
mydict <- dictionary(list(negative = neg.words,
positive = pos.words))
# apply it to our corpus
sent <- dfm(twcorpus, dictionary = mydict)
# and add it as a new variable
tweets$score <- as.numeric(sent[,2]) - as.numeric(sent[,1])
# loop over candidates
candidates <- c("realDonaldTrump", "HillaryClinton", "tedcruz", "BernieSanders")
for (cand in candidates){
message(cand, " -- average sentiment: ",
round(mean(tweets$score[tweets$screen_name==cand]), 4)
)
}
## realDonaldTrump -- average sentiment: 0.4392
## HillaryClinton -- average sentiment: 0.4093
## tedcruz -- average sentiment: 0.4681
## BernieSanders -- average sentiment: 0.3607