## Exploring large-scale text datasets When faced with a new corpus of social media text whose characteristics are unknown, it's a good idea to start by conducting some descriptive analysis to understand how documents are similar or different. Let's learn some of those techniques with our previous example containing tweets by the four leading candidates in the 2016 Presidential primaries. ```{r} library(quanteda) tweets <- read.csv("../data/candidate-tweets.csv", stringsAsFactors=F) # extract month data and subset only data during campaign tweets$month <- substr(tweets$datetime, 1, 7) tweets <- tweets[tweets$month>"2015-06",] # create DFM at the candidate and month level twcorpus <- corpus(tweets) ``` ### Preprocessing text with quanteda Before we start our analysis, let's pause for a second to discuss the "preprocessing" steps we need to take before we can analyze the data. We'll use the `quanteda` package [quanteda](https://github.com/kbenoit/quanteda) here. The basic unit of work for the `quanteda` package is called a `corpus`, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use `summary` to get some information about your corpus. ```{r} twcorpus <- corpus(tweets) summary(twcorpus, n=10) ``` A very useful feature of corpus objects is _keywords in context_, which returns all the appearances of a word (or combination of words) in its immediate context. ```{r} kwic(twcorpus, "immigration", window=10)[1:5,] kwic(twcorpus, "healthcare", window=10)[1:5,] kwic(twcorpus, "clinton", window=10)[1:5,] ``` We can then convert a corpus into a document-feature matrix using the `dfm` function. ```{r} twdfm <- dfm(twcorpus, verbose=TRUE) twdfm ``` The `dfm` will show the count of times each word appears in each document (tweet): ```{r} twdfm[1:5, 1:10] ``` `dfm` has many useful options (check out `?dfm` for more information). Let's actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features... ```{r} twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, remove_url=TRUE, ngrams=1:3, verbose=TRUE) twdfm ``` Note that here we use ngrams -- this will extract all combinations of one, two, and three words (e.g. it will consider both "human", "rights", and "human rights" as tokens in the matrix). In a large corpus like this, many features often only appear in one or two documents. In some case it's a good idea to remove those features, to speed up the analysis or because they're not relevant. We can `trim` the dfm: ```{r} twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE) twdfm ``` It's often a good idea to take a look at a wordcloud of the most frequent features to see if there's anything weird. ```{r} textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50) ``` What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. "a", "the", "is"). We can also see the list using `topFeatures` ```{r} topfeatures(twdfm, 25) ``` We can remove the stopwords when we create the `dfm` object: ```{r} twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c( stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), remove_url=TRUE, verbose=TRUE) textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50) ``` ### Identifying most unique features of documents _Keyness_ is a measure of to what extent some features are specific to a (group of) document in comparison to the rest of the corpus, taking into account that some features may be too rare. ```{r, eval=FALSE} twdfm <- dfm(twcorpus, groups=c("screen_name"), verbose=TRUE) head(textstat_keyness(twdfm, target="realDonaldTrump", measure="chi2"), n=20) head(textstat_keyness(twdfm, target="HillaryClinton", measure="chi2"), n=20) head(textstat_keyness(twdfm, target="tedcruz", measure="chi2"), n=20) head(textstat_keyness(twdfm, target="BernieSanders", measure="chi2"), n=20) twdfm <- dfm(twcorpus, groups=c("screen_name"), remove_punct=TRUE, remove=c(stopwords("english"), 'rt', 'u', 's'), verbose=TRUE) head(textstat_keyness(twdfm, target="realDonaldTrump", measure="chi2"), n=20) head(textstat_keyness(twdfm, target="HillaryClinton", measure="chi2"), n=20) head(textstat_keyness(twdfm, target="tedcruz", measure="chi2"), n=20) head(textstat_keyness(twdfm, target="BernieSanders", measure="chi2"), n=20) trump <- corpus_subset(twcorpus, screen_name=="realDonaldTrump") twdfm <- dfm(trump, remove_punct=TRUE, remove=c(stopwords("english"), 'rt', 'u', 's'), verbose=TRUE) head(textstat_keyness(twdfm, target=docvars(twdfm)$month<"2016-01", measure="chi2"), n=20) head(textstat_keyness(twdfm, target=docvars(twdfm)$month>"2016-03", measure="chi2"), n=20) clinton <- corpus_subset(twcorpus, screen_name=="HillaryClinton") twdfm <- dfm(clinton, remove_punct=TRUE, remove=c(stopwords("english"), 'rt', 'u', 's'), verbose=TRUE) head(textstat_keyness(twdfm, target=docvars(twdfm)$month<"2016-01", measure="chi2"), n=20) head(textstat_keyness(twdfm, target=docvars(twdfm)$month>"2016-03", measure="chi2"), n=20) ``` We can use `textplot_xray` to visualize where some words appear in the corpus. ```{r} trump <- paste( tweets$text[tweets$screen_name=="realDonaldTrump"], collapse=" ") textplot_xray(kwic(trump, "hillary"), scale="relative") textplot_xray(kwic(trump, "crooked"), scale="relative") textplot_xray(kwic(trump, "mexic*"), scale="relative") textplot_xray(kwic(trump, "fake"), scale="relative") textplot_xray(kwic(trump, "immigr*"), scale="relative") textplot_xray(kwic(trump, "muslim*"), scale="relative") clinton <- paste( tweets$text[tweets$screen_name=="HillaryClinton"], collapse=" ") textplot_xray(kwic(clinton, "trump"), scale="relative") textplot_xray(kwic(clinton, "sanders"), scale="relative") textplot_xray(kwic(clinton, "gun*"), scale="relative") ``` ### Clustering documents and features We can identify documents that are similar to one another based on the frequency of words, using `similarity`. There's different metrics to compute similarity. Here we explore two of them: [Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) and [Cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity). ```{r} twdfm <- dfm(twcorpus, groups=c("screen_name"), verbose=TRUE) docnames(twdfm) textstat_simil(twdfm, margin="documents", method="jaccard") textstat_simil(twdfm, margin="documents", method="cosine") ``` And the opposite: term similarity based on the frequency with which they appear in documents: ```{r} twdfm <- dfm(twcorpus, verbose=TRUE) # term similarities simils <- textstat_simil(twdfm, "america", margin="features", method="cosine") # most similar features df <- data.frame( featname = rownames(simils), simil = as.numeric(simils), stringsAsFactors=F ) head(df[order(simils, decreasing=TRUE),], n=5) # another example simils <- textstat_simil(twdfm, "immigration", margin="features", method="cosine") # most similar features df <- data.frame( featname = rownames(simils), simil = as.numeric(simils), stringsAsFactors=F ) head(df[order(simils, decreasing=TRUE),], n=5) ``` We can then used these distances to create a dendogram that can help us cluster documents. ```{r} twdfm <- dfm(twcorpus, groups=c("screen_name"), verbose=TRUE) # compute distances distances <- textstat_dist(twdfm, margin="documents") as.matrix(distances) # clustering cluster <- hclust(distances) plot(cluster) ```