String manipulation with R

We will start with basic string manipulation with R.

Our running example will be a random sample of 10,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We’ll save the text of these tweets as a vector called `text’

tweets <- read.csv("data/EP-elections-tweets.csv", stringsAsFactors=F)
text <- tweets$text

R stores the basic string in a character vector. length gets the number of items in the vector, while nchar is the number of characters in the vector.

## [1] 10000
## [1] "@NSinclaireMEP Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry."
## [1] 132

Note that we can work with multiple strings at once.

##  [1] 132 102  54  30  43  79 140 137  94 139
## [1] 950
## [1] 140

We can merge different strings into one using paste:

paste(text[1], text[2], sep='--')
## [1] "@NSinclaireMEP Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry.--@Steven_Woolfe hi Steven, would you be free to join @LBC on the phone for 5 minutes after 3.30 at all?"

Charcter vectors can be compared using the == and %in% operators:

## [1] TRUE
"DavidCoburnUKip" %in% tweets$screen_name
## [1] TRUE

As we will see later, it is often convenient to convert all words to lowercase or uppercase.

## [1] "@nsinclairemep knew that lib dems getting into bed with tories would end like this. they might never get another bite of the cherry."

We can grab substrings with substr. The first argument is the string, the second is the beginning index (starting from 1), and the third is final index.

substr(text[1], 1, 2)
## [1] "@N"
substr(text[1], 1, 10)
## [1] "@NSinclair"

This is useful when working with date strings as well:

dates <- c("2015/01/01", "2014/12/01")
substr(dates, 1, 4) # years
## [1] "2015" "2014"
substr(dates, 6, 7) # months
## [1] "01" "12"

We can split up strings by a separator using strsplit. If we choose space as the separator, this is in most cases equivalent to splitting into words.

strsplit(text[1], " ")
## [[1]]
##  [1] "@NSinclaireMEP" "Knew"           "that"           "Lib"           
##  [5] "Dems"           "getting"        "into"           "bed"           
##  [9] "with"           "Tories"         "would"          "end"           
## [13] "like"           "this."          "They"           "might"         
## [17] "never"          "get"            "another"        "bite"          
## [21] "of"             "the"            "cherry."

Let’s dit into the data a little bit more. Given the construction of the dataset, we can expect that there will be many tweets mentioning the names of the candidates, such as @Nigel_Farage, We can use the grep command to identify these. grep returns the index where the word occurs.

grep('@Nigel_Farage', text[1:10])
## [1]  3  6  9 10

grepl returns TRUE or FALSE, indicating whether each element of the character vector contains that particular pattern.

grepl('@Nigel_Farage', text[1:10])

Going back to the full dataset, we can use the results of grep to get particular rows. First, check how many tweets mention the handle “@Nigel_Farage”.

## [1] 10000
grep('@Nigel_Farage', tweets$text[1:10])
## [1]  3  6  9 10
length(grep('@Nigel_Farage', tweets$text))
## [1] 1512

It is important to note that matching is case-sensitive. You can use the argument to match to a lowercase version.

## [1] 10000
length(grep('@Nigel_Farage', tweets$text))
## [1] 1512
length(grep('@Nigel_Farage', tweets$text, = TRUE))
## [1] 1535

Regular expressions

Another useful tool to work with text data is called “regular expression”. You can learn more about regular expressions here. Regular expressions let us develop complicated rules for both matching strings and extracting elements from them.

For example, we could look at tweets that mention more than one handle using the operator “|” (equivalent to “OR”)

## [1] 10000
length(grep('@Nigel_Farage|@UKIP', tweets$text,
## [1] 1739

We can also use question marks to indicate optional characters.

## [1] 10000
length(grep('MEP?', tweets$text,
## [1] 4461

This will match MEP, MEPs, etc.

Other common expression patterns are:

For example, how many tweets are direct replies to @Nigel_Farage? How many tweets are retweets? How many tweets mention any username?

length(grep('^@Nigel_Farage', tweets$text,
## [1] 376
length(grep('^RT @', tweets$text,
## [1] 47
length(grep('@[A-Za-z0-9]+ ', tweets$text,
## [1] 7834

Another function that we will use is gsub, which replaces a pattern (or a regular expression) with another string:

gsub('@[0-9_A-Za-z]+', 'USERNAME', text[1])
## [1] "USERNAME Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry."

To extract a pattern, and not just replace, use parentheses and choose the option repl="\\1":

gsub('.*@([0-9_A-Za-z]+) .*', text[1], repl="\\1")
## [1] "NSinclaireMEP"

You can make this a bit more complex using gregexpr, which will extract the location of the matches, and then regmatches

handles <- gregexpr('@([0-9_A-Za-z]+)', text)
handles <- regmatches(text, handles)
handles <- unlist(handles)
head(sort(table(handles), decreasing=TRUE), n=25)
Now let’s try to identify what tweets are related to UKIP and try to extract them. How would we do it? First, let’s create a new column to the data frame that has value TRUE for tweets that mention this keyword and FALSE otherwise. Then, we can keep the rows with value TRUE.

tweets$ukip <- grepl('ukip|farage', tweets$text,
##  6968  3032
ukip.tweets <- tweets[tweets$ukip==TRUE, ]

Preprocessing text with quanteda

As we discussed earlier, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.

The basic unit of work for the quanteda package is called a corpus, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary to get some information about your corpus.

## quanteda version
## Using 3 of 4 cores for parallel computing
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##     View
twcorpus <- corpus(tweets$text)
A useful feature of corpus objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.

kwic(twcorpus, "brexit", window=10)
We can then convert a corpus into a document-feature matrix using the dfm function.

twdfm <- dfm(twcorpus, verbose=TRUE)
## Document-feature matrix of: 10,000 documents, 16,513 features (99.9% sparse).

dfm has many useful options. Let’s actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features…

twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, ngrams=1:3, verbose=TRUE)
Note that here we use ngrams – this will extract all combinations of one, two, and three words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens in the matrix).

Stemming relies on the SnowballC package’s implementation of the Porter stemmer:

In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim the dfm:

twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
It’s often a good idea to take a look at a wordcloud of the most frequent features to see if there’s anything weird.

textplot_wordcloud(twdfm, rot.per=0, scale=c(3.5, .75), max.words=100)

What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures

topfeatures(twdfm, 25)
##          the           to            a          you           of 
##         3731         3090         2259         2136         1950 
##           in         http          and 
##         1950         1863         1744         1744         1718 
##          for            i @nigel_farag           is           it 
##         1706         1616         1535         1475         1452 
##           on         that           be        thank          not 
##         1277         1099          919          849          828 
##          are         have         with         ukip         vote 
##          800          790          719          690          684

We can remove the stopwords when we create the dfm object:

twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
  stopwords("english"), "", "https", "rt", "amp", "http", "t.c", "can", "u"), verbose=TRUE)
textplot_wordcloud(twdfm, rot.per=0, scale=c(3.5, .75), max.words=100)

One nice feature of quanteda is that we can easily add metadata to the corpus object.

docvars(twcorpus) <- data.frame(screen_name=tweets$screen_name, polite=tweets$polite)
We can then use this metadata to subset the dataset:

polite.tweets <- corpus_subset(twcorpus, polite=="impolite")

And then extract the text:

mytexts <- texts(polite.tweets)

We’ll come back later to this dataset.

Importing text with quanteda

There are different ways to read text into R and create a corpus object with quanteda. We have already seen the most common way, importing the text from a csv file and then adding the metadata, but quanteda has a built-in function to help with this:

tweets <- readtext(file='data/EP-elections-tweets.csv')
twcorpus <- corpus(tweets)

This function will also work with text in multiple files. To do this, we use the textfile command, and use the ‘glob’ operator ’*’ to indicate that we want to load multiple files:

myCorpus <- readtext(file='data/inaugural/*.txt')
inaugCorpus <- corpus(myCorpus)