### Preprocessing text with quanteda As we discussed earlier, before we can do any type of automated text analysis, we will need to go through several "preprocessing" steps before it can be passed to a statistical model. We'll use the `quanteda` package [quanteda](https://github.com/kbenoit/quanteda) here. The basic unit of work for the `quanteda` package is called a `corpus`, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use `summary` to get some information about your corpus. ```{r} library(quanteda) library(streamR) tweets <- parseTweets("../data/trump-tweets.json") twcorpus <- corpus(tweets$text) summary(twcorpus, n=10) ``` A very useful feature of corpus objects is _keywords in context_, which returns all the appearances of a word (or combination of words) in its immediate context. ```{r} kwic(twcorpus, "immigration", window=10)[1:5,] kwic(twcorpus, "healthcare", window=10)[1:5,] kwic(twcorpus, "clinton", window=10)[1:5,] ``` We can then convert a corpus into a document-feature matrix using the `dfm` function. ```{r} twdfm <- dfm(twcorpus, verbose=TRUE) twdfm ``` The `dfm` will show the count of times each word appears in each document (tweet): ```{r} twdfm[1:5, 1:10] ``` `dfm` has many useful options (check out `?dfm` for more information). Let's actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features... ```{r} twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, remove_url=TRUE, ngrams=1:3, verbose=TRUE) twdfm ``` Note that here we use ngrams -- this will extract all combinations of one, two, and three words (e.g. it will consider both "human", "rights", and "human rights" as tokens in the matrix). Stemming relies on the `SnowballC` package's implementation of the Porter stemmer: ```{r} example <- tolower(tweets$text[1]) tokens(example) tokens_wordstem(tokens(example)) ``` In a large corpus like this, many features often only appear in one or two documents. In some case it's a good idea to remove those features, to speed up the analysis or because they're not relevant. We can `trim` the dfm: ```{r} twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE) twdfm ``` It's often a good idea to take a look at a wordcloud of the most frequent features to see if there's anything weird. ```{r} textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50) ``` What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. "a", "the", "is"). We can also see the list using `topFeatures` ```{r} topfeatures(twdfm, 25) ``` We can remove the stopwords when we create the `dfm` object: ```{r} twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c( stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), remove_url=TRUE, verbose=TRUE) textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50) ```