When faced with a new corpus of social media text whose characteristics are unknown, it’s a good idea to start by conducting some descriptive analysis to understand how documents are similar or different.
Let’s learn some of those techniques with our previous example containing tweets by the four leading candidates in the 2016 Presidential primaries.
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
tweets <- read.csv("../data/candidate-tweets.csv", stringsAsFactors=F)
# extract month data and subset only data during campaign
tweets$month <- substr(tweets$datetime, 1, 7)
tweets <- tweets[tweets$month>"2015-06",]
# create DFM at the candidate and month level
twcorpus <- corpus(tweets)
Before we start our analysis, let’s pause for a second to discuss the “preprocessing” steps we need to take before we can analyze the data. We’ll use the quanteda
package quanteda here.
The basic unit of work for the quanteda
package is called a corpus
, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary
to get some information about your corpus.
twcorpus <- corpus(tweets)
summary(twcorpus, n=10)
## Corpus consisting of 22022 documents, showing 10 documents:
##
## Text Types Tokens Sentences screen_name datetime
## 1618 15 19 1 realDonaldTrump 2015-07-01 00:21:55
## 1619 11 11 1 realDonaldTrump 2015-07-01 00:52:18
## 1620 24 28 2 realDonaldTrump 2015-07-01 17:21:53
## 1621 24 29 2 realDonaldTrump 2015-07-01 18:38:27
## 1622 25 25 1 realDonaldTrump 2015-07-01 18:59:51
## 1623 24 25 2 realDonaldTrump 2015-07-01 19:00:19
## 1624 20 21 2 realDonaldTrump 2015-07-01 21:33:55
## 1625 24 27 2 realDonaldTrump 2015-07-01 21:45:29
## 1626 24 27 2 realDonaldTrump 2015-07-01 22:14:37
## 1627 21 23 2 realDonaldTrump 2015-07-01 23:09:35
## source
## <a href="http://www.twitter.com" rel="nofollow">Twitter for BlackBerry</a>
## <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## <a href="http://instagram.com" rel="nofollow">Instagram</a>
## <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## lang month
## en 2015-07
## en 2015-07
## en 2015-07
## en 2015-07
## en 2015-07
## en 2015-07
## en 2015-07
## en 2015-07
## en 2015-07
## en 2015-07
##
## Source: /Users/pablobarbera/git/eitm/code/* on x86_64 by pablobarbera
## Created: Fri Jul 6 12:35:43 2018
## Notes:
A very useful feature of corpus objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.
kwic(twcorpus, "immigration", window=10)[1:5,]
##
## [1620, 19] We must have strong borders& amp; stop illegal |
## [1622, 11] Those who believe in tight border security, stopping illegal |
## [1623, 24] are weak on border security& amp; stopping illegal |
## [1632, 8] Make our borders strong and stop illegal |
## [1666, 5] TRUMP DECLARES VICTORY ON |
##
## immigration | now! https:// t.co/ HLCboRmTbl
## immigration | & amp; SMART trade deals w/ other countries
## immigration | .
## immigration | . Even President Obama agrees- https://
## IMMIGRATION | AS OBAMA ADMITS SOME ILLEGALS ARE GANG BANGERS http:
kwic(twcorpus, "healthcare", window=10)[1:5,]
##
## [2546, 6] RT@ericbolling:@realDonaldTrump on | Healthcare |
## [4228, 8] Just left Virginia where I unveiled my | healthcare |
## [6065, 22] , home not worth what I paid for it, | healthcare |
## [6133, 27] We will save$' s and have much better | healthcare |
## [6292, 5] I was asked about | healthcare |
##
## .." repeal and replace Obamacare"..
## and other plans for our great Veterans! They will
## is a joke Obama is a liar. TRUMP 2016
## !
## by Anderson Cooper& amp; have been consistent-
kwic(twcorpus, "clinton", window=10)[1:5,]
##
## [1657, 10] Via@trscoop: Mark Levin DEFENDS Trump: Hillary |
## [1707, 25] You know he's not doing it to enrich himself like |
## [1845, 10] Via@businessinsider by@hunterw: TRUMP UNLOADS: Hillary |
## [1950, 8] Can you envision Jeb Bush or Hillary |
## [1985, 4] Response to Hillary |
##
## Clinton | is a CROOK and a FRAUD and shes not treated
## Clinton |
## Clinton | was' the worst' and is' extremely bad
## Clinton | negotiating with' El Chapo', the Mexican drug
## Clinton | - http:// t.co/ nzYfehyURa
We can then convert a corpus into a document-feature matrix using the dfm
function.
twdfm <- dfm(twcorpus, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 22,022 documents, 36,423 features
## ... created a 22,022 x 36,423 sparse dfm
## ... complete.
## Elapsed time: 1.56 seconds.
twdfm
## Document-feature matrix of: 22,022 documents, 36,423 features (99.9% sparse).
The dfm
will show the count of times each word appears in each document (tweet):
twdfm[1:5, 1:10]
## Document-feature matrix of: 5 documents, 10 features (76% sparse).
## 5 x 10 sparse Matrix of class "dfm"
## features
## docs via @thesharktank1 : " donald trump's controversial mexican
## 1618 1 1 2 2 1 1 1 1
## 1619 0 0 0 0 0 0 0 0
## 1620 0 0 2 0 0 0 0 0
## 1621 0 0 2 0 0 0 0 0
## 1622 0 0 0 0 0 0 0 0
## features
## docs comments are
## 1618 1 1
## 1619 0 0
## 1620 0 0
## 1621 0 0
## 1622 0 0
dfm
has many useful options (check out ?dfm
for more information). Let’s actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features…
twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, remove_url=TRUE, ngrams=1:3, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 22,022 documents, 399,616 features
## ... stemming features (English)
## , trimmed 17477 feature variants
## ... created a 22,022 x 382,139 sparse dfm
## ... complete.
## Elapsed time: 22 seconds.
twdfm
## Document-feature matrix of: 22,022 documents, 382,139 features (100% sparse).
Note that here we use ngrams – this will extract all combinations of one, two, and three words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens in the matrix).
In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim
the dfm:
twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
## Removing features occurring:
## - in fewer than 3 documents: 340,886
## Total features removed: 340,886 (89.2%).
twdfm
## Document-feature matrix of: 22,022 documents, 41,253 features (99.9% sparse).
It’s often a good idea to take a look at a wordcloud of the most frequent features to see if there’s anything weird.
textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)
What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures
topfeatures(twdfm, 25)
## the to rt in a u and of
## 11928 9972 6742 6294 6128 5926 5604 5055
## for is on you we i it @tedcruz
## 4974 4414 3787 3555 3544 3129 2738 2580
## that be with will at this 2026 u_2026
## 2337 2292 2089 2075 2043 2030 1982 1982
## our
## 1897
We can remove the stopwords when we create the dfm
object:
twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), remove_url=TRUE, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 22,022 documents, 22,971 features
## ... removed 178 features
## ... created a 22,022 x 22,793 sparse dfm
## ... complete.
## Elapsed time: 1.3 seconds.
textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)
Keyness is a measure of to what extent some features are specific to a (group of) document in comparison to the rest of the corpus, taking into account that some features may be too rare.
twdfm <- dfm(twcorpus, groups=c("screen_name"), verbose=TRUE)
head(textstat_keyness(twdfm, target="realDonaldTrump",
measure="chi2"), n=20)
head(textstat_keyness(twdfm, target="HillaryClinton",
measure="chi2"), n=20)
head(textstat_keyness(twdfm, target="tedcruz",
measure="chi2"), n=20)
head(textstat_keyness(twdfm, target="BernieSanders",
measure="chi2"), n=20)
twdfm <- dfm(twcorpus, groups=c("screen_name"), remove_punct=TRUE,
remove=c(stopwords("english"), 'rt', 'u', 's'), verbose=TRUE)
head(textstat_keyness(twdfm, target="realDonaldTrump",
measure="chi2"), n=20)
head(textstat_keyness(twdfm, target="HillaryClinton",
measure="chi2"), n=20)
head(textstat_keyness(twdfm, target="tedcruz",
measure="chi2"), n=20)
head(textstat_keyness(twdfm, target="BernieSanders",
measure="chi2"), n=20)
trump <- corpus_subset(twcorpus, screen_name=="realDonaldTrump")
twdfm <- dfm(trump, remove_punct=TRUE,
remove=c(stopwords("english"), 'rt', 'u', 's'), verbose=TRUE)
head(textstat_keyness(twdfm, target=docvars(twdfm)$month<"2016-01",
measure="chi2"), n=20)
head(textstat_keyness(twdfm, target=docvars(twdfm)$month>"2016-03",
measure="chi2"), n=20)
clinton <- corpus_subset(twcorpus, screen_name=="HillaryClinton")
twdfm <- dfm(clinton, remove_punct=TRUE,
remove=c(stopwords("english"), 'rt', 'u', 's'), verbose=TRUE)
head(textstat_keyness(twdfm, target=docvars(twdfm)$month<"2016-01",
measure="chi2"), n=20)
head(textstat_keyness(twdfm, target=docvars(twdfm)$month>"2016-03",
measure="chi2"), n=20)
We can use textplot_xray
to visualize where some words appear in the corpus.
trump <- paste(
tweets$text[tweets$screen_name=="realDonaldTrump"], collapse=" ")
textplot_xray(kwic(trump, "hillary"), scale="relative")
textplot_xray(kwic(trump, "crooked"), scale="relative")
textplot_xray(kwic(trump, "mexic*"), scale="relative")
textplot_xray(kwic(trump, "fake"), scale="relative")
textplot_xray(kwic(trump, "immigr*"), scale="relative")
textplot_xray(kwic(trump, "muslim*"), scale="relative")
clinton <- paste(
tweets$text[tweets$screen_name=="HillaryClinton"], collapse=" ")
textplot_xray(kwic(clinton, "trump"), scale="relative")
textplot_xray(kwic(clinton, "sanders"), scale="relative")
textplot_xray(kwic(clinton, "gun*"), scale="relative")
We can identify documents that are similar to one another based on the frequency of words, using similarity
. There’s different metrics to compute similarity. Here we explore two of them: Jaccard distance and Cosine distance.
twdfm <- dfm(twcorpus, groups=c("screen_name"), verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 22,022 documents, 36,423 features
## ... grouping texts
## ... created a 4 x 36,423 sparse dfm
## ... complete.
## Elapsed time: 1.36 seconds.
docnames(twdfm)
## [1] "BernieSanders" "HillaryClinton" "realDonaldTrump" "tedcruz"
textstat_simil(twdfm, margin="documents", method="jaccard")
## BernieSanders HillaryClinton realDonaldTrump
## HillaryClinton 0.2007540
## realDonaldTrump 0.1704245 0.1587010
## tedcruz 0.1455431 0.1393852 0.1584593
textstat_simil(twdfm, margin="documents", method="cosine")
## BernieSanders HillaryClinton realDonaldTrump
## HillaryClinton 0.9115943
## realDonaldTrump 0.8754789 0.8391635
## tedcruz 0.8782221 0.9307062 0.7829703
And the opposite: term similarity based on the frequency with which they appear in documents:
twdfm <- dfm(twcorpus, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 22,022 documents, 36,423 features
## ... created a 22,022 x 36,423 sparse dfm
## ... complete.
## Elapsed time: 1.35 seconds.
# term similarities
simils <- textstat_simil(twdfm, "america", margin="features", method="cosine")
# most similar features
df <- data.frame(
featname = rownames(simils),
simil = as.numeric(simils),
stringsAsFactors=F
)
head(df[order(simils, decreasing=TRUE),], n=5)
## featname simil
## 119 america 1.0000000
## 141 again 0.3868689
## 185 make 0.3202199
## 121 great 0.2750211
## 20757 reignite 0.2111842
# another example
simils <- textstat_simil(twdfm, "immigration", margin="features", method="cosine")
# most similar features
df <- data.frame(
featname = rownames(simils),
simil = as.numeric(simils),
stringsAsFactors=F
)
head(df[order(simils, decreasing=TRUE),], n=5)
## featname simil
## 42 immigration 1.0000000
## 41 illegal 0.4369491
## 12710 comprehensive 0.2698504
## 3336 reform 0.2561071
## 87 weak 0.2152661
We can then used these distances to create a dendogram that can help us cluster documents.
twdfm <- dfm(twcorpus, groups=c("screen_name"), verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 22,022 documents, 36,423 features
## ... grouping texts
## ... created a 4 x 36,423 sparse dfm
## ... complete.
## Elapsed time: 1.32 seconds.
# compute distances
distances <- textstat_dist(twdfm, margin="documents")
as.matrix(distances)
## BernieSanders HillaryClinton realDonaldTrump tedcruz
## BernieSanders 0.000 8643.537 7253.074 23081.70
## HillaryClinton 8643.537 0.000 9546.685 17224.74
## realDonaldTrump 7253.074 9546.685 0.000 22754.95
## tedcruz 23081.702 17224.742 22754.952 0.00
# clustering
cluster <- hclust(distances)
plot(cluster)