As we discussed earlier, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda
package quanteda here.
The basic unit of work for the quanteda
package is called a corpus
, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary
to get some information about your corpus.
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library(streamR)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
## Warning: package 'rjson' was built under R version 3.4.4
## Loading required package: ndjson
## Warning: package 'ndjson' was built under R version 3.4.4
tweets <- parseTweets("~/data/trump-tweets.json")
## 3866 tweets have been parsed.
twcorpus <- corpus(tweets$text)
summary(twcorpus, n=10)
## Corpus consisting of 3866 documents, showing 10 documents:
##
## Text Types Tokens Sentences
## text1 40 54 3
## text2 20 23 3
## text3 20 22 3
## text4 32 41 4
## text5 48 56 4
## text6 12 14 2
## text7 20 22 2
## text8 29 31 2
## text9 44 50 3
## text10 22 24 2
##
## Source: /Users/pablobarbera/git/text-analysis-vienna/code/* on x86_64 by pablobarbera
## Created: Tue Oct 16 00:19:00 2018
## Notes:
A very useful feature of corpus objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.
kwic(twcorpus, "immigration", window=10)[1:5,]
##
## [text1, 14] today to hear directly from the AMERICAN VICTIMS of ILLEGAL
## [text10, 17] today to hear directly from the AMERICAN VICTIMS of ILLEGAL
## [text14, 11] .... If this is done, illegal
## [text15, 9] HOUSE REPUBLICANS SHOULD PASS THE STRONG BUT FAIR
## [text41, 6] .... Our
##
## | IMMIGRATION |
## | IMMIGRATION |
## | immigration |
## | IMMIGRATION |
## | Immigration |
##
## . These are the American Citizens permanently separated from their
## . These are the American Citize…
## will be stopped in it's tracks- and at very
## BILL, KNOWN AS GOODLATTE II, IN THEIR AFTERNOON
## policy, laughed at all over the world, is
kwic(twcorpus, "healthcare", window=10)[1:5,]
##
## [text46, 17] help to me on Cutting Taxes, creating great new |
## [text182, 37] He is tough on Crime and Strong on Borders, |
## [text507, 48] Warren lines, loves sanctuary cities, bad and expensive |
## [text530, 6] The American people deserve a |
## [text554, 27] will be a great Governor with a heavy focus on |
##
## healthcare | programs at low cost, fighting for Border Security,
## Healthcare | , the Military and our great Vets. Henry has
## healthcare | ...
## healthcare | system that takes care of them- not one that
## HealthCare | and Jobs. His Socialist opponent in November should not
kwic(twcorpus, "clinton", window=10)[1:5,]
##
## [text141, 23] the Bush Dynasty, then I had to beat the |
## [text161, 20] the Bush Dynasty, then I had to beat the |
## [text204, 9] FBI Agent Peter Strzok, who headed the |
## [text216, 13] :.@jasoninthehouse: All of this started because Hillary |
## [text252, 10] .... Schneiderman, who ran the |
##
## Clinton | Dynasty, and now I…
## Clinton | Dynasty, and now I have to beat a phony
## Clinton | & amp; Russia investigations, texted to his lover
## Clinton | set up her private server https:// t.co
## Clinton | campaign in New York, never had the guts to
We can then convert a corpus into a document-feature matrix using the dfm
function.
twdfm <- dfm(twcorpus, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 3,866 documents, 9,930 features
## ... created a 3,866 x 9,930 sparse dfm
## ... complete.
## Elapsed time: 0.235 seconds.
twdfm
## Document-feature matrix of: 3,866 documents, 9,930 features (99.7% sparse).
The dfm
will show the count of times each word appears in each document (tweet):
twdfm[1:5, 1:10]
## Document-feature matrix of: 5 documents, 10 features (72% sparse).
## 5 x 10 sparse Matrix of class "dfm"
## features
## docs we are gathered today to hear directly from the american
## text1 1 3 1 1 1 1 1 2 4 2
## text2 0 0 0 0 0 0 0 0 0 0
## text3 0 0 0 0 0 0 0 0 0 0
## text4 0 0 0 0 2 0 0 0 2 0
## text5 0 0 0 0 2 0 0 0 2 0
dfm
has many useful options (check out ?dfm
for more information). Let’s actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features…
twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, remove_url=TRUE, ngrams=1:3, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 3,866 documents, 128,909 features
## ... stemming features (English)
## , trimmed 5431 feature variants
## ... created a 3,866 x 123,478 sparse dfm
## ... complete.
## Elapsed time: 5.09 seconds.
twdfm
## Document-feature matrix of: 3,866 documents, 123,478 features (99.9% sparse).
Note that here we use ngrams – this will extract all combinations of one, two, and three words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens in the matrix).
Stemming relies on the SnowballC
package’s implementation of the Porter stemmer:
example <- tolower(tweets$text[1])
tokens(example)
## tokens from 1 document.
## text1 :
## [1] "we" "are" "gathered" "today" "to"
## [6] "hear" "directly" "from" "the" "american"
## [11] "victims" "of" "illegal" "immigration" "."
## [16] "these" "are" "the" "american" "citizens"
## [21] "permanently" "separated" "from" "their" "loved"
## [26] "ones" "b" "/" "c" "they"
## [31] "were" "killed" "by" "criminal" "illegal"
## [36] "aliens" "." "these" "are" "the"
## [41] "families" "the" "media" "ignores" "."
## [46] "." "." "https" ":" "/"
## [51] "/" "t.co" "/" "zjxesyacjy"
tokens_wordstem(tokens(example))
## tokens from 1 document.
## text1 :
## [1] "we" "are" "gather" "today" "to"
## [6] "hear" "direct" "from" "the" "american"
## [11] "victim" "of" "illeg" "immigr" "."
## [16] "these" "are" "the" "american" "citizen"
## [21] "perman" "separ" "from" "their" "love"
## [26] "one" "b" "/" "c" "they"
## [31] "were" "kill" "by" "crimin" "illeg"
## [36] "alien" "." "these" "are" "the"
## [41] "famili" "the" "media" "ignor" "."
## [46] "." "." "https" ":" "/"
## [51] "/" "t.co" "/" "zjxesyacji"
In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim
the dfm:
twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
## Removing features occurring:
## - in fewer than 3 documents: 112,440
## Total features removed: 112,440 (91.1%).
twdfm
## Document-feature matrix of: 3,866 documents, 11,038 features (99.7% sparse).
It’s often a good idea to take a look at a wordcloud of the most frequent features to see if there’s anything weird.
textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)
What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures
topfeatures(twdfm, 25)
## the to and of a in is for on our be will
## 4580 2697 2493 1945 1549 1456 1299 1088 920 894 846 842
## great with are we i that it amp have at you was
## 836 815 793 764 735 733 729 637 573 523 520 492
## they
## 474
We can remove the stopwords when we create the dfm
object:
twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), remove_url=TRUE, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 3,866 documents, 8,456 features
## ... removed 165 features
## ... created a 3,866 x 8,291 sparse dfm
## ... complete.
## Elapsed time: 0.249 seconds.
textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)