Dictionary methods

A different type of keyword analysis consists on the application of dictionary methods, or lexicon-based approaches to the measurement of tone or the prediction of diferent categories related to the content of the text.

The most common application is sentiment analysis: using a dictionary of positive and negative words, we compute a sentiment score for each individual document.

Let’s apply this technique to tweets by the four leading candidates in the 2016 Presidential primaries.

library(quanteda)

## quanteda version 0.9.9.65

## Using 3 of 4 cores for parallel computing

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

tweets <- read.csv('data/candidate-tweets.csv', stringsAsFactors=F)

# loading lexicon of positive and negative words (from Neal Caren)
lexicon <- read.csv("data/lexicon.csv", stringsAsFactors=F)
pos.words <- lexicon$word[lexicon$polarity=="positive"]
neg.words <- lexicon$word[lexicon$polarity=="negative"]
# a look at a random sample of positive and negative words
sample(pos.words, 10)

##  [1] "soften"       "wide-ranging" "awe"          "punctual"    
##  [5] "indulgent"    "verifiable"   "enchanted"    "adherent"    
##  [9] "great"        "manageable"

sample(neg.words, 10)

##  [1] "misconceptions" "decrease"       "battle-lines"   "incorrigibly"  
##  [5] "inexplainable"  "inappropriate"  "underlings"     "dishonesty"    
##  [9] "unpredictable"  "deteriorating"

As earlier today, we will convert our text to a corpus object.

twcorpus <- corpus(tweets$text)

Now we’re ready to run the sentiment analysis!

# first we construct a dictionary object
mydict <- dictionary(list(negative = neg.words,
                          positive = pos.words))
# apply it to our corpus
sent <- dfm(twcorpus, dictionary = mydict)
# and add it as a new variable
tweets$score <- as.numeric(sent[,2]) - as.numeric(sent[,1])

# what is the average sentiment score?
mean(tweets$score)

## [1] 0.4908921

# what is the most positive and most negative tweet?
tweets[which.max(tweets$score),]

##       screen_name
## 22177     tedcruz
##                                                                                                                                                       text
## 22177 We will restore our spirit. We will free our minds &amp; imagination. We will create a better world. We will bring back jobs, freedom &amp; security
##                  datetime
## 22177 2016-04-20 02:42:28
##                                                                   source
## 22177 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
##       lang score
## 22177   en     9

tweets[which.min(tweets$score),]

##       screen_name
## 14386     tedcruz
##                                                                                                                                             text
## 14386 You can't win a war against radical Islamic terrorism with an Admin thats unwilling to utter the words radical Islamic terrorism" #Opp4All
##                  datetime
## 14386 2015-01-12 22:41:14
##                                                                                    source
## 14386 <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>
##       lang score
## 14386   en    -8

# what is the proportion of positive, neutral, and negative tweets?
tweets$sentiment <- "neutral"
tweets$sentiment[tweets$score<0] <- "negative"
tweets$sentiment[tweets$score>0] <- "positive"
table(tweets$sentiment)

## 
## negative  neutral positive 
##     4062    10475    12198

We can also disaggregate by groups of tweets, for example according to the party they mention.

# loop over candidates
candidates <- c("realDonaldTrump", "HillaryClinton", "tedcruz", "BernieSanders")

for (cand in candidates){
  message(cand, " -- average sentiment: ",
      round(mean(tweets$score[tweets$screen_name==cand]), 4)
    )
}

## realDonaldTrump -- average sentiment: 0.5883

## HillaryClinton -- average sentiment: 0.4276

## tedcruz -- average sentiment: 0.4994

## BernieSanders -- average sentiment: 0.3781

One important note: dictionary methods can be very sensitive to specific words that appear very often. Let’s see one example…

# remove word "great" from dictionary
lexicon <- lexicon[-which(lexicon$word=="great"),]
pos.words <- lexicon$word[lexicon$polarity=="positive"]
neg.words <- lexicon$word[lexicon$polarity=="negative"]
# construct dictionary object again
mydict <- dictionary(list(negative = neg.words,
                          positive = pos.words))
# apply it to our corpus
sent <- dfm(twcorpus, dictionary = mydict)
# and add it as a new variable
tweets$score <- as.numeric(sent[,2]) - as.numeric(sent[,1])
# loop over candidates
candidates <- c("realDonaldTrump", "HillaryClinton", "tedcruz", "BernieSanders")

for (cand in candidates){
  message(cand, " -- average sentiment: ",
      round(mean(tweets$score[tweets$screen_name==cand]), 4)
    )
}

## realDonaldTrump -- average sentiment: 0.44

## HillaryClinton -- average sentiment: 0.4086

## tedcruz -- average sentiment: 0.4714

## BernieSanders -- average sentiment: 0.3662

Dictionary methods

Pablo Barbera

June 28, 2017

Dictionary methods