Exploring large-scale text datasets

When faced with a new corpus of social media text whose characteristics are unknown, it’s a good idea to start by conducting some descriptive analysis to understand how documents are similar or different.

Let’s learn some of those techniques with our previous example containing tweets by the four leading candidates in the 2016 Presidential primaries.

library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
tweets <- read.csv("../data/candidate-tweets.csv", stringsAsFactors=F)
# extract month data and subset only data during campaign
tweets$month <- substr(tweets$datetime, 1, 7)
tweets <- tweets[tweets$month>"2015-06",]
# create DFM at the candidate and month level
twcorpus <- corpus(tweets)

Preprocessing text with quanteda

Before we start our analysis, let’s pause for a second to discuss the “preprocessing” steps we need to take before we can analyze the data. We’ll use the quanteda package quanteda here.

The basic unit of work for the quanteda package is called a corpus, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary to get some information about your corpus.

twcorpus <- corpus(tweets)
summary(twcorpus, n=10)
## Corpus consisting of 22022 documents, showing 10 documents:
## 
##  Text Types Tokens Sentences     screen_name            datetime
##  1618    15     19         1 realDonaldTrump 2015-07-01 00:21:55
##  1619    11     11         1 realDonaldTrump 2015-07-01 00:52:18
##  1620    24     28         2 realDonaldTrump 2015-07-01 17:21:53
##  1621    24     29         2 realDonaldTrump 2015-07-01 18:38:27
##  1622    25     25         1 realDonaldTrump 2015-07-01 18:59:51
##  1623    24     25         2 realDonaldTrump 2015-07-01 19:00:19
##  1624    20     21         2 realDonaldTrump 2015-07-01 21:33:55
##  1625    24     27         2 realDonaldTrump 2015-07-01 21:45:29
##  1626    24     27         2 realDonaldTrump 2015-07-01 22:14:37
##  1627    21     23         2 realDonaldTrump 2015-07-01 23:09:35
##                                                                      source
##  <a href="http://www.twitter.com" rel="nofollow">Twitter for BlackBerry</a>
##          <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
##                 <a href="http://instagram.com" rel="nofollow">Instagram</a>
##          <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
##          <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
##          <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
##          <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
##          <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
##          <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
##          <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
##  lang   month
##    en 2015-07
##    en 2015-07
##    en 2015-07
##    en 2015-07
##    en 2015-07
##    en 2015-07
##    en 2015-07
##    en 2015-07
##    en 2015-07
##    en 2015-07
## 
## Source: /Users/pablobarbera/git/eitm/code/* on x86_64 by pablobarbera
## Created: Fri Jul  6 12:35:43 2018
## Notes:

A very useful feature of corpus objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.

kwic(twcorpus, "immigration", window=10)[1:5,]
##                                                                           
##  [1620, 19]               We must have strong borders& amp; stop illegal |
##  [1622, 11] Those who believe in tight border security, stopping illegal |
##  [1623, 24]           are weak on border security& amp; stopping illegal |
##   [1632, 8]                     Make our borders strong and stop illegal |
##   [1666, 5]                                    TRUMP DECLARES VICTORY ON |
##                                                                    
##  immigration | now! https:// t.co/ HLCboRmTbl                      
##  immigration | & amp; SMART trade deals w/ other countries         
##  immigration | .                                                   
##  immigration | . Even President Obama agrees- https://             
##  IMMIGRATION | AS OBAMA ADMITS SOME ILLEGALS ARE GANG BANGERS http:
kwic(twcorpus, "healthcare", window=10)[1:5,]
##                                                                  
##   [2546, 6]     RT@ericbolling:@realDonaldTrump on | Healthcare |
##   [4228, 8] Just left Virginia where I unveiled my | healthcare |
##  [6065, 22]   , home not worth what I paid for it, | healthcare |
##  [6133, 27]  We will save$' s and have much better | healthcare |
##   [6292, 5]                      I was asked about | healthcare |
##                                                   
##  .." repeal and replace Obamacare"..              
##  and other plans for our great Veterans! They will
##  is a joke Obama is a liar. TRUMP 2016            
##  !                                                
##  by Anderson Cooper& amp; have been consistent-
kwic(twcorpus, "clinton", window=10)[1:5,]
##                                                                     
##  [1657, 10]         Via@trscoop: Mark Levin DEFENDS Trump: Hillary |
##  [1707, 25]      You know he's not doing it to enrich himself like |
##  [1845, 10] Via@businessinsider by@hunterw: TRUMP UNLOADS: Hillary |
##   [1950, 8]                   Can you envision Jeb Bush or Hillary |
##   [1985, 4]                                    Response to Hillary |
##                                                         
##  Clinton | is a CROOK and a FRAUD and shes not treated  
##  Clinton |                                              
##  Clinton | was' the worst' and is' extremely bad        
##  Clinton | negotiating with' El Chapo', the Mexican drug
##  Clinton | - http:// t.co/ nzYfehyURa

We can then convert a corpus into a document-feature matrix using the dfm function.

twdfm <- dfm(twcorpus, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 22,022 documents, 36,423 features
##    ... created a 22,022 x 36,423 sparse dfm
##    ... complete. 
## Elapsed time: 1.56 seconds.
twdfm
## Document-feature matrix of: 22,022 documents, 36,423 features (99.9% sparse).

The dfm will show the count of times each word appears in each document (tweet):

twdfm[1:5, 1:10]
## Document-feature matrix of: 5 documents, 10 features (76% sparse).
## 5 x 10 sparse Matrix of class "dfm"
##       features
## docs   via @thesharktank1 : " donald trump's controversial mexican
##   1618   1              1 2 2      1       1             1       1
##   1619   0              0 0 0      0       0             0       0
##   1620   0              0 2 0      0       0             0       0
##   1621   0              0 2 0      0       0             0       0
##   1622   0              0 0 0      0       0             0       0
##       features
## docs   comments are
##   1618        1   1
##   1619        0   0
##   1620        0   0
##   1621        0   0
##   1622        0   0

dfm has many useful options (check out ?dfm for more information). Let’s actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features…

twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, remove_url=TRUE, ngrams=1:3, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 22,022 documents, 399,616 features
##    ... stemming features (English)
## , trimmed 17477 feature variants
##    ... created a 22,022 x 382,139 sparse dfm
##    ... complete. 
## Elapsed time: 22 seconds.
twdfm
## Document-feature matrix of: 22,022 documents, 382,139 features (100% sparse).

Note that here we use ngrams – this will extract all combinations of one, two, and three words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens in the matrix).

In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim the dfm:

twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
## Removing features occurring:
##   - in fewer than 3 documents: 340,886
##   Total features removed: 340,886 (89.2%).
twdfm
## Document-feature matrix of: 22,022 documents, 41,253 features (99.9% sparse).

It’s often a good idea to take a look at a wordcloud of the most frequent features to see if there’s anything weird.

textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)

What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures

topfeatures(twdfm, 25)
##      the       to       rt       in        a        u      and       of 
##    11928     9972     6742     6294     6128     5926     5604     5055 
##      for       is       on      you       we        i       it @tedcruz 
##     4974     4414     3787     3555     3544     3129     2738     2580 
##     that       be     with     will       at     this     2026   u_2026 
##     2337     2292     2089     2075     2043     2030     1982     1982 
##      our 
##     1897

We can remove the stopwords when we create the dfm object:

twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
  stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), remove_url=TRUE, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 22,022 documents, 22,971 features
##    ... removed 178 features
##    ... created a 22,022 x 22,793 sparse dfm
##    ... complete. 
## Elapsed time: 1.3 seconds.
textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)

Identifying most unique features of documents

Keyness is a measure of to what extent some features are specific to a (group of) document in comparison to the rest of the corpus, taking into account that some features may be too rare.

twdfm <- dfm(twcorpus, groups=c("screen_name"), verbose=TRUE)

head(textstat_keyness(twdfm, target="realDonaldTrump",
                      measure="chi2"), n=20)
head(textstat_keyness(twdfm, target="HillaryClinton",
                      measure="chi2"), n=20)
head(textstat_keyness(twdfm, target="tedcruz",
                      measure="chi2"), n=20)
head(textstat_keyness(twdfm, target="BernieSanders",
                      measure="chi2"), n=20)


twdfm <- dfm(twcorpus, groups=c("screen_name"), remove_punct=TRUE,
             remove=c(stopwords("english"), 'rt', 'u', 's'), verbose=TRUE)
head(textstat_keyness(twdfm, target="realDonaldTrump",
                      measure="chi2"), n=20)
head(textstat_keyness(twdfm, target="HillaryClinton",
                      measure="chi2"), n=20)
head(textstat_keyness(twdfm, target="tedcruz",
                      measure="chi2"), n=20)
head(textstat_keyness(twdfm, target="BernieSanders",
                      measure="chi2"), n=20)

trump <- corpus_subset(twcorpus, screen_name=="realDonaldTrump")
twdfm <- dfm(trump, remove_punct=TRUE,
             remove=c(stopwords("english"), 'rt', 'u', 's'), verbose=TRUE)
head(textstat_keyness(twdfm, target=docvars(twdfm)$month<"2016-01", 
                      measure="chi2"), n=20)
head(textstat_keyness(twdfm, target=docvars(twdfm)$month>"2016-03", 
                      measure="chi2"), n=20)

clinton <- corpus_subset(twcorpus, screen_name=="HillaryClinton")
twdfm <- dfm(clinton, remove_punct=TRUE,
             remove=c(stopwords("english"), 'rt', 'u', 's'), verbose=TRUE)
head(textstat_keyness(twdfm, target=docvars(twdfm)$month<"2016-01", 
                      measure="chi2"), n=20)
head(textstat_keyness(twdfm, target=docvars(twdfm)$month>"2016-03", 
                      measure="chi2"), n=20)

We can use textplot_xray to visualize where some words appear in the corpus.

trump <- paste(
  tweets$text[tweets$screen_name=="realDonaldTrump"], collapse=" ")
textplot_xray(kwic(trump, "hillary"), scale="relative")

textplot_xray(kwic(trump, "crooked"), scale="relative")

textplot_xray(kwic(trump, "mexic*"), scale="relative")

textplot_xray(kwic(trump, "fake"), scale="relative")

textplot_xray(kwic(trump, "immigr*"), scale="relative")

textplot_xray(kwic(trump, "muslim*"), scale="relative")

clinton <- paste(
  tweets$text[tweets$screen_name=="HillaryClinton"], collapse=" ")
textplot_xray(kwic(clinton, "trump"), scale="relative")

textplot_xray(kwic(clinton, "sanders"), scale="relative")

textplot_xray(kwic(clinton, "gun*"), scale="relative")

Clustering documents and features

We can identify documents that are similar to one another based on the frequency of words, using similarity. There’s different metrics to compute similarity. Here we explore two of them: Jaccard distance and Cosine distance.

twdfm <- dfm(twcorpus, groups=c("screen_name"), verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 22,022 documents, 36,423 features
##    ... grouping texts
##    ... created a 4 x 36,423 sparse dfm
##    ... complete. 
## Elapsed time: 1.36 seconds.
docnames(twdfm)
## [1] "BernieSanders"   "HillaryClinton"  "realDonaldTrump" "tedcruz"
textstat_simil(twdfm, margin="documents", method="jaccard")
##                 BernieSanders HillaryClinton realDonaldTrump
## HillaryClinton      0.2007540                               
## realDonaldTrump     0.1704245      0.1587010                
## tedcruz             0.1455431      0.1393852       0.1584593
textstat_simil(twdfm, margin="documents", method="cosine")
##                 BernieSanders HillaryClinton realDonaldTrump
## HillaryClinton      0.9115943                               
## realDonaldTrump     0.8754789      0.8391635                
## tedcruz             0.8782221      0.9307062       0.7829703

And the opposite: term similarity based on the frequency with which they appear in documents:

twdfm <- dfm(twcorpus, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 22,022 documents, 36,423 features
##    ... created a 22,022 x 36,423 sparse dfm
##    ... complete. 
## Elapsed time: 1.35 seconds.
# term similarities
simils <- textstat_simil(twdfm, "america", margin="features", method="cosine")
# most similar features
df <- data.frame(
  featname = rownames(simils),
  simil = as.numeric(simils),
  stringsAsFactors=F
)
head(df[order(simils, decreasing=TRUE),], n=5)
##       featname     simil
## 119    america 1.0000000
## 141      again 0.3868689
## 185       make 0.3202199
## 121      great 0.2750211
## 20757 reignite 0.2111842
# another example
simils <- textstat_simil(twdfm, "immigration", margin="features", method="cosine")
# most similar features
df <- data.frame(
  featname = rownames(simils),
  simil = as.numeric(simils),
  stringsAsFactors=F
)
head(df[order(simils, decreasing=TRUE),], n=5)
##            featname     simil
## 42      immigration 1.0000000
## 41          illegal 0.4369491
## 12710 comprehensive 0.2698504
## 3336         reform 0.2561071
## 87             weak 0.2152661

We can then used these distances to create a dendogram that can help us cluster documents.

twdfm <- dfm(twcorpus, groups=c("screen_name"), verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 22,022 documents, 36,423 features
##    ... grouping texts
##    ... created a 4 x 36,423 sparse dfm
##    ... complete. 
## Elapsed time: 1.32 seconds.
# compute distances
distances <- textstat_dist(twdfm, margin="documents")
as.matrix(distances)
##                 BernieSanders HillaryClinton realDonaldTrump  tedcruz
## BernieSanders           0.000       8643.537        7253.074 23081.70
## HillaryClinton       8643.537          0.000        9546.685 17224.74
## realDonaldTrump      7253.074       9546.685           0.000 22754.95
## tedcruz             23081.702      17224.742       22754.952     0.00
# clustering
cluster <- hclust(distances)
plot(cluster)