String manipulation with R

We will start with basic string manipulation with R.

Our running example will be a random sample of 10,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We’ll save the text of these tweets as a vector called `text’

tweets <- read.csv("../data/EP-elections-tweets.csv", stringsAsFactors=F)
head(tweets)
##                                                                                                                                   text
## 1 @NSinclaireMEP Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry.
## 2                               @Steven_Woolfe hi Steven, would you be free to join @LBC on the phone for 5 minutes after 3.30 at all?
## 3                                                                               @TrevorWAllman @Nigel_Farage The clock is ticking ....
## 4                                                                                                       @llexanderaamb got the badges.
## 5                                                                                          @Peebi @AnujaPrashar @Angel4theNorth thanks
## 6                                                      Well said @Nigel_Farage , poor Paxo had to change the subject.  #Gettingstuffed
##      screen_name           id   polite
## 1    martinwedge 4.730558e+17 impolite
## 2        WillGav 4.696599e+17   polite
## 3    CathyWood55 4.676486e+17   polite
## 4   CStephenssnp 4.701751e+17   polite
## 5 sanchia4europe 4.693720e+17   polite
## 6    EnglandsAce 4.685071e+17   polite
text <- tweets$text

R stores the basic string in a character vector. length gets the number of items in the vector, while nchar is the number of characters in the vector.

length(text)
## [1] 10000
text[1]
## [1] "@NSinclaireMEP Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry."
nchar(text[1])
## [1] 132

Note that we can work with multiple strings at once.

nchar(text[1:10])
##  [1] 132 102  54  30  43  79 140 137  94 139
sum(nchar(text[1:10]))
## [1] 950
max(nchar(text[1:10]))
## [1] 140

We can merge different strings into one using paste. The default is adding a space between strings; but there’s also paste0, which will leave no space:

paste(text[1], text[2], sep='--')
## [1] "@NSinclaireMEP Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry.--@Steven_Woolfe hi Steven, would you be free to join @LBC on the phone for 5 minutes after 3.30 at all?"
paste("one", "two")
## [1] "one two"
paste0("one", "two")
## [1] "onetwo"

Character vectors can be compared using the == and %in% operators:

tweets$screen_name[1]=="martinwedge"
## [1] TRUE
"DavidCoburnUKip" %in% tweets$screen_name
## [1] TRUE

For more advanced string manipulation, we will use the stringr library, created by Hadley Wickham, which standardized most of the techniques we want to employ. For example, this is how we would convert all words to lowercase or uppercase.

library(stringr)
str_to_lower(text[1])
## [1] "@nsinclairemep knew that lib dems getting into bed with tories would end like this. they might never get another bite of the cherry."
str_to_upper(text[1])
## [1] "@NSINCLAIREMEP KNEW THAT LIB DEMS GETTING INTO BED WITH TORIES WOULD END LIKE THIS. THEY MIGHT NEVER GET ANOTHER BITE OF THE CHERRY."
str_to_title(text[1])
## [1] "@Nsinclairemep Knew That Lib Dems Getting Into Bed With Tories Would End Like This. They Might Never Get Another Bite Of The Cherry."

We can grab substrings with str_sub. The first argument is the string, the second is the beginning index (starting from 1), and the third is final index.

str_sub(text[1], 1, 2)
## [1] "@N"
str_sub(text[1], 1, 10)
## [1] "@NSinclair"

This is useful when working with date strings as well:

dates <- c("2015/01/01", "2014/12/01")
str_sub(dates, 1, 4) # years
## [1] "2015" "2014"
str_sub(dates, 6, 7) # months
## [1] "01" "12"

We can split up strings by a separator using strsplit. If we choose space as the separator, this is in most cases equivalent to splitting into words.

str_split(text[1], " ")
## [[1]]
##  [1] "@NSinclaireMEP" "Knew"           "that"           "Lib"           
##  [5] "Dems"           "getting"        "into"           "bed"           
##  [9] "with"           "Tories"         "would"          "end"           
## [13] "like"           "this."          "They"           "might"         
## [17] "never"          "get"            "another"        "bite"          
## [21] "of"             "the"            "cherry."

Let’s dit into the data a little bit more. Given the construction of the dataset, we can expect that there will be many tweets mentioning the names of the candidates, such as @Nigel_Farage, We can use the grep command to identify these. grep returns the index where the word occurs.

grep('@Nigel_Farage', text[1:10])
## [1]  3  6  9 10

grepl returns TRUE or FALSE, indicating whether each element of the character vector contains that particular pattern.

grepl('@Nigel_Farage', text[1:10])
##  [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE

Going back to the full dataset, we can use the results of grep to get particular rows. First, check how many tweets mention the handle “@Nigel_Farage”.

nrow(tweets)
## [1] 10000
grep('@Nigel_Farage', tweets$text[1:10])
## [1]  3  6  9 10
length(grep('@Nigel_Farage', tweets$text))
## [1] 1512

It is important to note that matching is case-sensitive. You can use the ignore.case argument to match to a lowercase version.

nrow(tweets)
## [1] 10000
length(grep('@Nigel_Farage', tweets$text))
## [1] 1512
length(grep('@Nigel_Farage', tweets$text, ignore.case = TRUE))
## [1] 1535

Regular expressions

Another useful tool to work with text data is called “regular expression”. You can learn more about regular expressions here. Regular expressions let us develop complicated rules for both matching strings and extracting elements from them.

For example, we could look at tweets that mention more than one handle using the operator “|” (equivalent to “OR”)

nrow(tweets)
## [1] 10000
length(grep('@Nigel_Farage|@UKIP', tweets$text, ignore.case=TRUE))
## [1] 1739

We can also use question marks to indicate optional characters.

nrow(tweets)
## [1] 10000
length(grep('MEP?', tweets$text, ignore.case=TRUE))
## [1] 4461

This will match MEP, MEPs, etc.

Other common expression patterns are:

For example, how many tweets are direct replies to @Nigel_Farage? How many tweets are retweets? How many tweets mention any username?

length(grep('^@Nigel_Farage', tweets$text, ignore.case=TRUE))
## [1] 376
length(grep('^RT @', tweets$text, ignore.case=TRUE))
## [1] 47
length(grep('@[A-Za-z0-9]+ ', tweets$text, ignore.case=TRUE))
## [1] 7834

Another function that we will use is str_replace, which replaces a pattern (or a regular expression) with another string:

str_replace(text[1], '@[0-9_A-Za-z]+', 'USERNAME')
## [1] "USERNAME Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry."

To extract a pattern, and not just replace, use str_extract. If there are multiple instances, choose str_extract_all instead

str_extract(text[1], '@[0-9_A-Za-z]+')
## [1] "@NSinclaireMEP"
str_extract_all("one user is @one and another user is @another", '@[0-9_A-Za-z]+')
## [[1]]
## [1] "@one"     "@another"

Here’s a more complex example which we already saw yesterday:

handles <- str_extract_all(text, '@[0-9_A-Za-z]+')
handles <- unlist(handles)
head(sort(table(handles), decreasing=TRUE), n=25)
## handles
##    @Nigel_Farage  @nickgriffinmep            @UKIP @DavidCoburnUKip 
##             1514              375              357              356 
##      @JaniceUKIP  @RogerHelmerMEP    @DanHannanMEP @ClaudeMoraesMEP 
##              241              234              228              123 
##        @SebDance     @marcuschown        @Lucy4MEP @IvanaBartoletti 
##              119              110              102               98 
##  @Michael_Heaver   @TasminaSheikh   @maryhoneyball    @GreenJeanMEP 
##               98               89               88               82 
##   @TheGreenParty        @Tim_Aker     @JimAllister @MEPStandingUp4U 
##               78               72               71               66 
##  @GlenisWillmott  @sanchia4europe        @SLATUKIP  @davidmartinmep 
##               62               62               61               57 
##  @KamaljeetJandu 
##               54
# now with hashtags...
hashtags <- str_extract_all(text, "#(\\d|\\w)+")
hashtags <- unlist(hashtags)
head(sort(table(hashtags), decreasing=TRUE), n=25)
## hashtags
##              #UKIP            #EP2014     #VoteGreen2014 
##                261                146                 75 
##                #EU             #bbcqt              #ukip 
##                 57                 49                 44 
##    #labourdoorstep #europeanelections     #votegreen2014 
##                 40                 20                 20 
##         #VoteLab14   #WhyImVotingUkip            #ep2014 
##                 17                 17                 16 
##        #Eurovision           #indyref          #Vote2014 
##                 15                 15                 15 
##             #bbcsp            #Labour        #votelabour 
##                 14                 14                 14 
##              #TUSC         #votegreen          #voteUKIP 
##                 13                 13                 13 
##            #London        #VoteLabour             #youth 
##                 12                 12                 12 
##               #SNP 
##                 11

Now let’s try to identify what tweets are related to UKIP and try to extract them. How would we do it? First, let’s create a new column to the data frame that has value TRUE for tweets that mention this keyword and FALSE otherwise. Then, we can keep the rows with value TRUE.

tweets$ukip <- grepl('ukip|farage', tweets$text, ignore.case=TRUE)
table(tweets$ukip)
## 
## FALSE  TRUE 
##  6968  3032
ukip.tweets <- tweets[tweets$ukip==TRUE, ]

Preprocessing text with quanteda

As we discussed earlier, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.

The basic unit of work for the quanteda package is called a corpus, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary to get some information about your corpus.

library(quanteda)
## quanteda version 0.9.9.65
## Using 3 of 4 cores for parallel computing
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
twcorpus <- corpus(tweets$text)
summary(twcorpus)
## Corpus consisting of 10000 documents, showing 100 documents.
## 
##     Text Types Tokens Sentences
##    text1    24     25         2
##    text2    22     22         1
##    text3     7     10         1
##    text4     5      5         1
##    text5     4      4         1
##    text6    13     13         2
##    text7    25     26         3
##    text8    21     23         1
##    text9    16     18         2
##   text10    26     29         1
##   text11    22     24         3
##   text12     7      8         1
##   text13    23     24         1
##   text14    17     18         1
##   text15    26     27         3
##   text16    24     26         2
##   text17    20     29         1
##   text18    15     16         1
##   text19    20     22         2
##   text20    20     20         2
##   text21    27     31         2
##   text22    21     23         1
##   text23    10     10         1
##   text24    20     22         2
##   text25    13     13         1
##   text26    12     12         1
##   text27     9     11         1
##   text28     3      3         1
##   text29    16     27         1
##   text30    10     10         2
##   text31     4      4         1
##   text32    22     24         6
##   text33     3      3         1
##   text34     9      9         1
##   text35    12     14         1
##   text36    17     18         2
##   text37    14     14         2
##   text38    20     25         1
##   text39    17     18         1
##   text40    11     12         2
##   text41     2      2         1
##   text42    23     28         2
##   text43    17     17         1
##   text44     7      7         1
##   text45    24     27         1
##   text46    14     15         1
##   text47    22     24         2
##   text48    19     21         1
##   text49    16     17         1
##   text50    15     15         1
##   text51    16     18         2
##   text52     5      5         1
##   text53     9      9         1
##   text54    15     19         1
##   text55    16     21         2
##   text56    13     14         1
##   text57    18     20         1
##   text58    23     31         3
##   text59     9     11         1
##   text60     3      3         1
##   text61    13     15         1
##   text62    20     23         4
##   text63     4      4         1
##   text64    15     16         2
##   text65    21     21         1
##   text66    10     12         1
##   text67    12     13         1
##   text68    24     25         3
##   text69    10     10         1
##   text70    12     12         1
##   text71    12     13         1
##   text72    19     23         1
##   text73     9      9         1
##   text74    12     12         1
##   text75    14     18         1
##   text76    16     16         1
##   text77    15     15         1
##   text78    22     24         1
##   text79    16     19         1
##   text80    12     15         1
##   text81    26     27         2
##   text82    15     16         2
##   text83     9     12         1
##   text84    22     25         1
##   text85    15     15         1
##   text86    10     10         1
##   text87    15     18         1
##   text88    20     25         2
##   text89    27     32         1
##   text90    15     16         1
##   text91    20     23         1
##   text92    19     23         2
##   text93    16     22         2
##   text94    25     29         2
##   text95    20     22         1
##   text96    22     24         4
##   text97    13     17         2
##   text98    16     18         2
##   text99    25     28         3
##  text100    23     26         1
## 
## Source:  /Users/pablobarbera/git/ECPR-SC103/day4/* on x86_64 by pablobarbera
## Created: Thu Aug  3 12:31:53 2017
## Notes:

A useful feature of corpus objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.

kwic(twcorpus, "brexit", window=10)
##                                                            
##  [text7905, 7] @2cvdolly1 I'll make the case for | Brexit |
##                                                
##  to the best of my ability. I genuinely believe
kwic(twcorpus, "merkel", window=10)
##                                                                     
##  [text6945, 5]                         @ggbenedetto Good will from |
##  [text9761, 7] @MartinSelmayr@Juncker_JC@sikorskiradek I hope that |
##                                                              
##  Merkel | appears to be in short supply- Juncker nomination a
##  Merkel | + Cameron will see sense on that. cc@jonworth
kwic(twcorpus, "eu referendum", window=10)
##                   
##  [text5316, 18:19]
##  [text5756, 18:19]
##  [text6906, 12:13]
##  [text9038, 12:13]
##                                                                      
##                                  , why wait for 2017 for an in/ out |
##                                  , why wait for 2017 for an in/ out |
##  @Nigel_Farage What happened to#Cameron's cast iron guarantee of an |
##  @Nigel_Farage What happened to#Cameron's cast iron guarantee of an |
##                                                   
##  EU referendum | - If EU renegotiation impossible?
##  EU referendum | - If EU renegotiation impossible?
##  EU referendum | before the last election?#bbcsp  
##  EU referendum | before the last election?#bbcsp

We can then convert a corpus into a document-feature matrix using the dfm function.

twdfm <- dfm(twcorpus, verbose=TRUE)
## Creating a dfm from a corpus ...
##    ... tokenizing texts
##    ... lowercasing
##    ... found 10,000 documents, 16,513 features
##    ... created a 10,000 x 16,513 sparse dfm
##    ... complete. 
## Elapsed time: 0.001 seconds.
twdfm
## Document-feature matrix of: 10,000 documents, 16,513 features (99.9% sparse).

dfm has many useful options. Let’s actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features…

twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, ngrams=1:3, verbose=TRUE,
             remove_twitter=FALSE)
## Creating a dfm from a corpus ...
##    ... tokenizing texts
##    ... lowercasing
##    ... found 10,000 documents, 154,722 features
## ... stemming features (English)
## , trimmed 5111 feature variants
##    ... created a 10,000 x 149,611 sparse dfm
##    ... complete. 
## Elapsed time: 4.22 seconds.
twdfm
## Document-feature matrix of: 10,000 documents, 149,611 features (100% sparse).

Note that here we use ngrams – this will extract all combinations of one, two, and three words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens in the matrix).

Stemming relies on the SnowballC package’s implementation of the Porter stemmer:

tokenize(tweets$text[1])
## tokenizedTexts from 1 document.
## Component 1 :
##  [1] "@NSinclaireMEP" "Knew"           "that"           "Lib"           
##  [5] "Dems"           "getting"        "into"           "bed"           
##  [9] "with"           "Tories"         "would"          "end"           
## [13] "like"           "this"           "."              "They"          
## [17] "might"          "never"          "get"            "another"       
## [21] "bite"           "of"             "the"            "cherry"        
## [25] "."
tokens_wordstem(tokenize(tweets$text[1]))
## tokenizedTexts from 1 document.
## Component 1 :
##  [1] "@NSinclaireMEP" "Knew"           "that"           "Lib"           
##  [5] "Dem"            "get"            "into"           "bed"           
##  [9] "with"           "Tori"           "would"          "end"           
## [13] "like"           "thi"            "."              "Thei"          
## [17] "might"          "never"          "get"            "anoth"         
## [21] "bite"           "of"             "the"            "cherri"        
## [25] "."
char_wordstem(c("win", "winning", "wins", "won", "winner"))
## [1] "win"    "win"    "win"    "won"    "winner"

Note that stemming is available in multiple languages:

tokens_wordstem(tokenize("esto es un ejemplo"), language="es")
## tokenizedTexts from 1 document.
## Component 1 :
## [1] "esto"   "es"     "un"     "ejempl"
tokens_wordstem(tokenize("ceci est un exemple"), language="fr")
## tokenizedTexts from 1 document.
## Component 1 :
## [1] "cec"    "est"    "un"     "exempl"
tokens_wordstem(tokenize("это пример"), language="ru")
## tokenizedTexts from 1 document.
## Component 1 :
## [1] "эт"     "пример"
tokens_wordstem(tokenize("dies ist ein Beispiel"), language="fr")
## tokenizedTexts from 1 document.
## Component 1 :
## [1] "di"       "ist"      "ein"      "Beispiel"
# full list:
SnowballC::getStemLanguages()
##  [1] "danish"     "dutch"      "english"    "finnish"    "french"    
##  [6] "german"     "hungarian"  "italian"    "norwegian"  "porter"    
## [11] "portuguese" "romanian"   "russian"    "spanish"    "swedish"   
## [16] "turkish"

In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim the dfm:

twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
## Removing features occurring:
##   - in fewer than 3 document: 132,761
##   Total features removed: 132,761 (88.7%).

It’s often a good idea to take a look at a wordcloud of the most frequent features to see if there’s anything weird.

textplot_wordcloud(twdfm, rot.per=0, scale=c(3.5, .75), max.words=100)

What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures

topfeatures(twdfm, 25)
##          the           to            a          you           of 
##         3731         3090         2259         2136         1950 
##           in         t.co         http    http_t.co          and 
##         1950         1863         1744         1744         1718 
##          for            i @nigel_farag           is           it 
##         1706         1616         1535         1475         1452 
##           on         that           be        thank          not 
##         1277         1099          919          849          828 
##          are         have         with         ukip         vote 
##          800          790          719          690          684

We can remove the stopwords when we create the dfm object:

twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
  stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), verbose=TRUE)
## Creating a dfm from a corpus ...
##    ... tokenizing texts
##    ... lowercasing
##    ... found 10,000 documents, 16,634 features
## ...
## dfm_select removed 177 features and 0 documents, padding 0s for 0 features and 0 documents.
##    ... created a 10,000 x 16,457 sparse dfm
##    ... complete. 
## Elapsed time: 0.148 seconds.
textplot_wordcloud(twdfm, rot.per=0, scale=c(3.5, .75), max.words=100)