As we discussed earlier, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda
package quanteda here.
The basic unit of work for the quanteda
package is called a corpus
, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary
to get some information about your corpus.
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library(streamR)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
## Warning: package 'rjson' was built under R version 3.4.4
## Loading required package: ndjson
## Warning: package 'ndjson' was built under R version 3.4.4
tweets <- parseTweets("~/data/trump-tweets.json")
## 3866 tweets have been parsed.
twcorpus <- corpus(tweets$text)
summary(twcorpus, n=10)
## Corpus consisting of 3866 documents, showing 10 documents:
##
## Text Types Tokens Sentences
## text1 40 54 3
## text2 20 23 3
## text3 20 22 3
## text4 32 41 4
## text5 48 56 4
## text6 12 14 2
## text7 20 22 2
## text8 29 31 2
## text9 44 50 3
## text10 22 24 2
##
## Source: /Users/pablobarbera/git/ECPR-SC105/code/* on x86_64 by pablobarbera
## Created: Thu Aug 9 11:18:56 2018
## Notes:
A very useful feature of corpus objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.
kwic(twcorpus, "immigration", window=10)[1:5,]
##
## [text1, 14] today to hear directly from the AMERICAN VICTIMS of ILLEGAL
## [text10, 17] today to hear directly from the AMERICAN VICTIMS of ILLEGAL
## [text14, 11] .... If this is done, illegal
## [text15, 9] HOUSE REPUBLICANS SHOULD PASS THE STRONG BUT FAIR
## [text41, 6] .... Our
##
## | IMMIGRATION |
## | IMMIGRATION |
## | immigration |
## | IMMIGRATION |
## | Immigration |
##
## . These are the American Citizens permanently separated from their
## . These are the American Citize…
## will be stopped in it's tracks- and at very
## BILL, KNOWN AS GOODLATTE II, IN THEIR AFTERNOON
## policy, laughed at all over the world, is
kwic(twcorpus, "healthcare", window=10)[1:5,]
##
## [text46, 17] help to me on Cutting Taxes, creating great new |
## [text182, 37] He is tough on Crime and Strong on Borders, |
## [text507, 48] Warren lines, loves sanctuary cities, bad and expensive |
## [text530, 6] The American people deserve a |
## [text554, 27] will be a great Governor with a heavy focus on |
##
## healthcare | programs at low cost, fighting for Border Security,
## Healthcare | , the Military and our great Vets. Henry has
## healthcare | ...
## healthcare | system that takes care of them- not one that
## HealthCare | and Jobs. His Socialist opponent in November should not
kwic(twcorpus, "clinton", window=10)[1:5,]
##
## [text141, 23] the Bush Dynasty, then I had to beat the |
## [text161, 20] the Bush Dynasty, then I had to beat the |
## [text204, 9] FBI Agent Peter Strzok, who headed the |
## [text216, 13] :.@jasoninthehouse: All of this started because Hillary |
## [text252, 10] .... Schneiderman, who ran the |
##
## Clinton | Dynasty, and now I…
## Clinton | Dynasty, and now I have to beat a phony
## Clinton | & amp; Russia investigations, texted to his lover
## Clinton | set up her private server https:// t.co
## Clinton | campaign in New York, never had the guts to
We can then convert a corpus into a document-feature matrix using the dfm
function.
twdfm <- dfm(twcorpus, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 3,866 documents, 9,930 features
## ... created a 3,866 x 9,930 sparse dfm
## ... complete.
## Elapsed time: 0.332 seconds.
twdfm
## Document-feature matrix of: 3,866 documents, 9,930 features (99.7% sparse).
The dfm
will show the count of times each word appears in each document (tweet):
twdfm[1:5, 1:10]
## Document-feature matrix of: 5 documents, 10 features (72% sparse).
## 5 x 10 sparse Matrix of class "dfm"
## features
## docs we are gathered today to hear directly from the american
## text1 1 3 1 1 1 1 1 2 4 2
## text2 0 0 0 0 0 0 0 0 0 0
## text3 0 0 0 0 0 0 0 0 0 0
## text4 0 0 0 0 2 0 0 0 2 0
## text5 0 0 0 0 2 0 0 0 2 0
dfm
has many useful options (check out ?dfm
for more information). Let’s actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features…
twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, remove_url=TRUE, ngrams=1:3, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 3,866 documents, 128,909 features
## ... stemming features (English)
## , trimmed 5431 feature variants
## ... created a 3,866 x 123,478 sparse dfm
## ... complete.
## Elapsed time: 6.38 seconds.
twdfm
## Document-feature matrix of: 3,866 documents, 123,478 features (99.9% sparse).
Note that here we use ngrams – this will extract all combinations of one, two, and three words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens in the matrix).
Stemming relies on the SnowballC
package’s implementation of the Porter stemmer:
example <- tolower(tweets$text[1])
tokens(example)
## tokens from 1 document.
## text1 :
## [1] "we" "are" "gathered" "today" "to"
## [6] "hear" "directly" "from" "the" "american"
## [11] "victims" "of" "illegal" "immigration" "."
## [16] "these" "are" "the" "american" "citizens"
## [21] "permanently" "separated" "from" "their" "loved"
## [26] "ones" "b" "/" "c" "they"
## [31] "were" "killed" "by" "criminal" "illegal"
## [36] "aliens" "." "these" "are" "the"
## [41] "families" "the" "media" "ignores" "."
## [46] "." "." "https" ":" "/"
## [51] "/" "t.co" "/" "zjxesyacjy"
tokens_wordstem(tokens(example))
## tokens from 1 document.
## text1 :
## [1] "we" "are" "gather" "today" "to"
## [6] "hear" "direct" "from" "the" "american"
## [11] "victim" "of" "illeg" "immigr" "."
## [16] "these" "are" "the" "american" "citizen"
## [21] "perman" "separ" "from" "their" "love"
## [26] "one" "b" "/" "c" "they"
## [31] "were" "kill" "by" "crimin" "illeg"
## [36] "alien" "." "these" "are" "the"
## [41] "famili" "the" "media" "ignor" "."
## [46] "." "." "https" ":" "/"
## [51] "/" "t.co" "/" "zjxesyacji"
In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim
the dfm:
twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
## Removing features occurring:
## - in fewer than 3 documents: 112,440
## Total features removed: 112,440 (91.1%).
twdfm
## Document-feature matrix of: 3,866 documents, 11,038 features (99.7% sparse).
It’s often a good idea to take a look at a wordcloud of the most frequent features to see if there’s anything weird.
textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)
What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures
topfeatures(twdfm, 25)
## the to and of a in is for on our be will
## 4580 2697 2493 1945 1549 1456 1299 1088 920 894 846 842
## great with are we i that it amp have at you was
## 836 815 793 764 735 733 729 637 573 523 520 492
## they
## 474
We can remove the stopwords when we create the dfm
object:
twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), remove_url=TRUE, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 3,866 documents, 8,456 features
## ... removed 165 features
## ... created a 3,866 x 8,291 sparse dfm
## ... complete.
## Elapsed time: 0.463 seconds.
textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)
One of the most common applications of dictionary methods is sentiment analysis: using a dictionary of positive and negative words, we compute a sentiment score for each individual document.
Let’s apply this technique to tweets by the four leading candidates in the 2016 Presidential primaries.
library(quanteda)
tweets <- read.csv('~/data/candidate-tweets.csv', stringsAsFactors=F)
We will use the LIWC dictionary to measure the extent to which these candidates adopted a positive or negative tone during the election campaign. (Note: LIWC is provided here for teaching purposes only and will not be distributed publicly.) LIWC has many other categories, but for now let’s just use positive
and negative
liwc <- read.csv("~/data/liwc-dictionary.csv",
stringsAsFactors = FALSE)
table(liwc$class)
##
## adjective affect anger anxiety cause
## 235 445 46 92 46
## cognition compare differ discrepancy female
## 252 101 46 92 46
## future insight interrogation male negate
## 46 92 47 46 47
## negative number past positive power
## 230 36 123 211 184
## present quant reward risk social
## 138 47 46 46 230
## tentative verb
## 23 329
pos.words <- liwc$word[liwc$class=="positive"]
neg.words <- liwc$word[liwc$class=="negative"]
# a look at a random sample of positive and negative words
sample(pos.words, 10)
## [1] "proudly" "admir*" "kind" "wealthy" "sexy"
## [6] "respect" "excelled" "wellness" "kindly" "excellent"
sample(neg.words, 10)
## [1] "saddest" "ugliest" "annoy" "immoral*" "anxious"
## [6] "anxiously" "fake" "distrust*" "upset" "uncontrol*"
As earlier today, we will convert our text to a corpus object.
twcorpus <- corpus(tweets)
Now we’re ready to run the sentiment analysis! First we will construct a dictionary object.
mydict <- dictionary(list(positive = pos.words,
negative = neg.words))
And now we apply it to the corpus in order to count the number of words that appear in each category
sent <- dfm(twcorpus, dictionary = mydict)
We can then extract the score and add it to the data frame as a new variable
tweets$score <- as.numeric(sent[,1]) - as.numeric(sent[,2])
And now start answering some descriptive questions…
# what is the average sentiment score?
mean(tweets$score)
## [1] 0.2056106
# what is the most positive and most negative tweet?
tweets[which.max(tweets$score),]
## screen_name
## 3125 realDonaldTrump
## text
## 3125 .@robertjeffress I greatly appreciate your kind words last night on @FoxNews. Have great love for the evangelicals -- great respect for you.
## datetime
## 3125 2015-09-11 19:24:44
## source
## 3125 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## lang score
## 3125 en 5
tweets[which.min(tweets$score),]
## screen_name
## 6642 realDonaldTrump
## text
## 6642 Lindsey Graham is all over T.V., much like failed 47% candidate Mitt Romney. These nasty, angry, jealous failures have ZERO credibility!
## datetime
## 6642 2016-03-07 13:03:59
## source
## 6642 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## lang score
## 6642 en -4
# what is the proportion of positive, neutral, and negative tweets?
tweets$sentiment <- "neutral"
tweets$sentiment[tweets$score<0] <- "negative"
tweets$sentiment[tweets$score>0] <- "positive"
table(tweets$sentiment)
##
## negative neutral positive
## 1265 19602 5868
We can also disaggregate by groups of tweets, for example according to the party they mention.
# loop over candidates
candidates <- c("realDonaldTrump", "HillaryClinton", "tedcruz", "BernieSanders")
for (cand in candidates){
message(cand, " -- average sentiment: ",
round(mean(tweets$score[tweets$screen_name==cand]), 4)
)
}
## realDonaldTrump -- average sentiment: 0.2911
## HillaryClinton -- average sentiment: 0.1736
## tedcruz -- average sentiment: 0.1853
## BernieSanders -- average sentiment: 0.1384
But what happens if we now run the analysis excluding a single word?
pos.words <- pos.words[-which(pos.words=="great")]
mydict <- dictionary(list(positive = pos.words,
negative = neg.words))
sent <- dfm(twcorpus, dictionary = mydict)
tweets$score <- as.numeric(sent[,1]) - as.numeric(sent[,2])
for (cand in candidates){
message(cand, " -- average sentiment: ",
round(mean(tweets$score[tweets$screen_name==cand]), 4)
)
}
## realDonaldTrump -- average sentiment: 0.1431
## HillaryClinton -- average sentiment: 0.1547
## tedcruz -- average sentiment: 0.1573
## BernieSanders -- average sentiment: 0.1265
How would we normalize by text length? (Maybe not necessary here given that tweets have roughly the same length.)
# collapse by account into 4 documents
twdfm <- dfm(twcorpus, groups = "screen_name")
twdfm
## Document-feature matrix of: 4 documents, 43,426 features (66.9% sparse).
# turn word counts into proportions
twdfm[1:4, 1:10]
## Document-feature matrix of: 4 documents, 10 features (30% sparse).
## 4 x 10 sparse Matrix of class "dfm"
## features
## docs rt @geraldorivera : recruit @realdonaldtrump to
## BernieSanders 1018 0 4186 0 11 2407
## HillaryClinton 1449 0 7800 0 33 3389
## realDonaldTrump 607 8 7138 2 2278 2537
## tedcruz 4464 0 18871 3 203 4045
## features
## docs finish that horrid eyesore
## BernieSanders 0 747 0 0
## HillaryClinton 5 561 0 0
## realDonaldTrump 7 714 2 1
## tedcruz 6 429 0 0
twdfm <- dfm_weight(twdfm, scheme="prop")
twdfm[1:4, 1:10]
## Document-feature matrix of: 4 documents, 10 features (30% sparse).
## 4 x 10 sparse Matrix of class "dfm"
## features
## docs rt @geraldorivera : recruit
## BernieSanders 0.010252175 0 0.04215678 0
## HillaryClinton 0.009177857 0 0.04940461 0
## realDonaldTrump 0.003413027 4.498223e-05 0.04013540 1.124556e-05
## tedcruz 0.018250652 0 0.07715234 1.226522e-05
## features
## docs @realdonaldtrump to finish that
## BernieSanders 0.0001107799 0.02424065 0 0.007522962
## HillaryClinton 0.0002090195 0.02146567 3.166962e-05 0.003553332
## realDonaldTrump 0.0128086906 0.01426499 3.935945e-05 0.004014664
## tedcruz 0.0008299468 0.01653761 2.453045e-05 0.001753927
## features
## docs horrid eyesore
## BernieSanders 0 0
## HillaryClinton 0 0
## realDonaldTrump 1.124556e-05 5.622779e-06
## tedcruz 0 0
# Apply dictionary using `dfm_lookup()` function:
sent <- dfm_lookup(twdfm, dictionary = mydict)
sent
## Document-feature matrix of: 4 documents, 2 features (0% sparse).
## 4 x 2 sparse Matrix of class "dfm"
## features
## docs positive negative
## BernieSanders 0.008237995 0.003111908
## HillaryClinton 0.007467697 0.001868508
## realDonaldTrump 0.010553956 0.004486978
## tedcruz 0.007494051 0.001418677
(sent[,1]-sent[,2])*100
## 4 x 1 sparse Matrix of class "dgCMatrix"
## features
## docs positive
## BernieSanders 0.5126088
## HillaryClinton 0.5599189
## realDonaldTrump 0.6066979
## tedcruz 0.6075374