String manipulation with R

We will start with basic string manipulation with R.

Our running example will be a random sample of 10,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We’ll save the text of these tweets as a vector called `text’

tweets <- read.csv("data/EP-elections-tweets.csv", stringsAsFactors=F)
head(tweets)
##                                                                                                                                   text
## 1 @NSinclaireMEP Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry.
## 2                               @Steven_Woolfe hi Steven, would you be free to join @LBC on the phone for 5 minutes after 3.30 at all?
## 3                                                                               @TrevorWAllman @Nigel_Farage The clock is ticking ....
## 4                                                                                                       @llexanderaamb got the badges.
## 5                                                                                          @Peebi @AnujaPrashar @Angel4theNorth thanks
## 6                                                      Well said @Nigel_Farage , poor Paxo had to change the subject.  #Gettingstuffed
##      screen_name           id   polite
## 1    martinwedge 4.730558e+17 impolite
## 2        WillGav 4.696599e+17   polite
## 3    CathyWood55 4.676486e+17   polite
## 4   CStephenssnp 4.701751e+17   polite
## 5 sanchia4europe 4.693720e+17   polite
## 6    EnglandsAce 4.685071e+17   polite
text <- tweets$text

R stores the basic string in a character vector. length gets the number of items in the vector, while nchar is the number of characters in the vector.

length(text)
## [1] 10000
text[1]
## [1] "@NSinclaireMEP Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry."
nchar(text[1])
## [1] 132

Note that we can work with multiple strings at once.

nchar(text[1:10])
##  [1] 132 102  54  30  43  79 140 137  94 139
sum(nchar(text[1:10]))
## [1] 950
max(nchar(text[1:10]))
## [1] 140

We can merge different strings into one using paste:

paste(text[1], text[2], sep='--')
## [1] "@NSinclaireMEP Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry.--@Steven_Woolfe hi Steven, would you be free to join @LBC on the phone for 5 minutes after 3.30 at all?"

Charcter vectors can be compared using the == and %in% operators:

tweets$screen_name[1]=="martinwedge"
## [1] TRUE
"DavidCoburnUKip" %in% tweets$screen_name
## [1] TRUE

As we will see later, it is often convenient to convert all words to lowercase or uppercase.

tolower(text[1])
## [1] "@nsinclairemep knew that lib dems getting into bed with tories would end like this. they might never get another bite of the cherry."
toupper(text[1])
## [1] "@NSINCLAIREMEP KNEW THAT LIB DEMS GETTING INTO BED WITH TORIES WOULD END LIKE THIS. THEY MIGHT NEVER GET ANOTHER BITE OF THE CHERRY."

We can grab substrings with substr. The first argument is the string, the second is the beginning index (starting from 1), and the third is final index.

substr(text[1], 1, 2)
## [1] "@N"
substr(text[1], 1, 10)
## [1] "@NSinclair"

This is useful when working with date strings as well:

dates <- c("2015/01/01", "2014/12/01")
substr(dates, 1, 4) # years
## [1] "2015" "2014"
substr(dates, 6, 7) # months
## [1] "01" "12"

We can split up strings by a separator using strsplit. If we choose space as the separator, this is in most cases equivalent to splitting into words.

strsplit(text[1], " ")
## [[1]]
##  [1] "@NSinclaireMEP" "Knew"           "that"           "Lib"           
##  [5] "Dems"           "getting"        "into"           "bed"           
##  [9] "with"           "Tories"         "would"          "end"           
## [13] "like"           "this."          "They"           "might"         
## [17] "never"          "get"            "another"        "bite"          
## [21] "of"             "the"            "cherry."

Let’s dit into the data a little bit more. Given the construction of the dataset, we can expect that there will be many tweets mentioning the names of the candidates, such as @Nigel_Farage, We can use the grep command to identify these. grep returns the index where the word occurs.

grep('@Nigel_Farage', text[1:10])
## [1]  3  6  9 10

grepl returns TRUE or FALSE, indicating whether each element of the character vector contains that particular pattern.

grepl('@Nigel_Farage', text[1:10])
##  [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE

Going back to the full dataset, we can use the results of grep to get particular rows. First, check how many tweets mention the handle “@Nigel_Farage”.

nrow(tweets)
## [1] 10000
grep('@Nigel_Farage', tweets$text[1:10])
## [1]  3  6  9 10
length(grep('@Nigel_Farage', tweets$text))
## [1] 1512

It is important to note that matching is case-sensitive. You can use the ignore.case argument to match to a lowercase version.

nrow(tweets)
## [1] 10000
length(grep('@Nigel_Farage', tweets$text))
## [1] 1512
length(grep('@Nigel_Farage', tweets$text, ignore.case = TRUE))
## [1] 1535

Regular expressions

Another useful tool to work with text data is called “regular expression”. You can learn more about regular expressions here. Regular expressions let us develop complicated rules for both matching strings and extracting elements from them.

For example, we could look at tweets that mention more than one handle using the operator “|” (equivalent to “OR”)

nrow(tweets)
## [1] 10000
length(grep('@Nigel_Farage|@UKIP', tweets$text, ignore.case=TRUE))
## [1] 1739

We can also use question marks to indicate optional characters.

nrow(tweets)
## [1] 10000
length(grep('MEP?', tweets$text, ignore.case=TRUE))
## [1] 4461

This will match MEP, MEPs, etc.

Other common expression patterns are:

For example, how many tweets are direct replies to @Nigel_Farage? How many tweets are retweets? How many tweets mention any username?

length(grep('^@Nigel_Farage', tweets$text, ignore.case=TRUE))
## [1] 376
length(grep('^RT @', tweets$text, ignore.case=TRUE))
## [1] 47
length(grep('@[A-Za-z0-9]+ ', tweets$text, ignore.case=TRUE))
## [1] 7834

Another function that we will use is gsub, which replaces a pattern (or a regular expression) with another string:

gsub('@[0-9_A-Za-z]+', 'USERNAME', text[1])
## [1] "USERNAME Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry."

To extract a pattern, and not just replace, use parentheses and choose the option repl="\\1":

gsub('.*@([0-9_A-Za-z]+) .*', text[1], repl="\\1")
## [1] "NSinclaireMEP"

You can make this a bit more complex using gregexpr, which will extract the location of the matches, and then regmatches

handles <- gregexpr('@([0-9_A-Za-z]+)', text)
handles <- regmatches(text, handles)
handles <- unlist(handles)
head(sort(table(handles), decreasing=TRUE), n=25)
## handles
##    @Nigel_Farage  @nickgriffinmep            @UKIP @DavidCoburnUKip 
##             1514              375              357              356 
##      @JaniceUKIP  @RogerHelmerMEP    @DanHannanMEP @ClaudeMoraesMEP 
##              241              234              228              123 
##        @SebDance     @marcuschown        @Lucy4MEP @IvanaBartoletti 
##              119              110              102               98 
##  @Michael_Heaver   @TasminaSheikh   @maryhoneyball    @GreenJeanMEP 
##               98               89               88               82 
##   @TheGreenParty        @Tim_Aker     @JimAllister @MEPStandingUp4U 
##               78               72               71               66 
##  @GlenisWillmott  @sanchia4europe        @SLATUKIP  @davidmartinmep 
##               62               62               61               57 
##  @KamaljeetJandu 
##               54
# now with hashtags...
hashtags <- regmatches(text, gregexpr("#(\\d|\\w)+",text))
hashtags <- unlist(hashtags)
head(sort(table(hashtags), decreasing=TRUE), n=25)
## hashtags
##              #UKIP            #EP2014     #VoteGreen2014 
##                261                146                 75 
##                #EU             #bbcqt              #ukip 
##                 57                 49                 44 
##    #labourdoorstep #europeanelections     #votegreen2014 
##                 40                 20                 20 
##         #VoteLab14   #WhyImVotingUkip            #ep2014 
##                 17                 17                 16 
##        #Eurovision           #indyref          #Vote2014 
##                 15                 15                 15 
##             #bbcsp            #Labour        #votelabour 
##                 14                 14                 14 
##              #TUSC         #votegreen          #voteUKIP 
##                 13                 13                 13 
##            #London        #VoteLabour             #youth 
##                 12                 12                 12 
##               #SNP 
##                 11

Now let’s try to identify what tweets are related to UKIP and try to extract them. How would we do it? First, let’s create a new column to the data frame that has value TRUE for tweets that mention this keyword and FALSE otherwise. Then, we can keep the rows with value TRUE.

tweets$ukip <- grepl('ukip|farage', tweets$text, ignore.case=TRUE)
table(tweets$ukip)
## 
## FALSE  TRUE 
##  6968  3032
ukip.tweets <- tweets[tweets$ukip==TRUE, ]

Preprocessing text with quanteda

As we discussed earlier, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.

The basic unit of work for the quanteda package is called a corpus, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary to get some information about your corpus.

library(quanteda)
## quanteda version 0.9.9.65
## Using 3 of 4 cores for parallel computing
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
twcorpus <- corpus(tweets$text)
summary(twcorpus)
## Corpus consisting of 10000 documents, showing 100 documents.
## 
##     Text Types Tokens Sentences
##    text1    24     25         2
##    text2    22     22         1
##    text3     7     10         1
##    text4     5      5         1
##    text5     4      4         1
##    text6    13     13         2
##    text7    25     26         3
##    text8    21     23         1
##    text9    16     18         2
##   text10    26     29         1
##   text11    22     24         3
##   text12     7      8         1
##   text13    23     24         1
##   text14    17     18         1
##   text15    26     27         3
##   text16    24     26         2
##   text17    20     29         1
##   text18    15     16         1
##   text19    20     22         2
##   text20    20     20         2
##   text21    27     31         2
##   text22    21     23         1
##   text23    10     10         1
##   text24    20     22         2
##   text25    13     13         1
##   text26    12     12         1
##   text27     9     11         1
##   text28     3      3         1
##   text29    16     27         1
##   text30    10     10         2
##   text31     4      4         1
##   text32    22     24         6
##   text33     3      3         1
##   text34     9      9         1
##   text35    12     14         1
##   text36    17     18         2
##   text37    14     14         2
##   text38    20     25         1
##   text39    17     18         1
##   text40    11     12         2
##   text41     2      2         1
##   text42    23     28         2
##   text43    17     17         1
##   text44     7      7         1
##   text45    24     27         1
##   text46    14     15         1
##   text47    22     24         2
##   text48    19     21         1
##   text49    16     17         1
##   text50    15     15         1
##   text51    16     18         2
##   text52     5      5         1
##   text53     9      9         1
##   text54    15     19         1
##   text55    16     21         2
##   text56    13     14         1
##   text57    18     20         1
##   text58    23     31         3
##   text59     9     11         1
##   text60     3      3         1
##   text61    13     15         1
##   text62    20     23         4
##   text63     4      4         1
##   text64    15     16         2
##   text65    21     21         1
##   text66    10     12         1
##   text67    12     13         1
##   text68    24     25         3
##   text69    10     10         1
##   text70    12     12         1
##   text71    12     13         1
##   text72    19     23         1
##   text73     9      9         1
##   text74    12     12         1
##   text75    14     18         1
##   text76    16     16         1
##   text77    15     15         1
##   text78    22     24         1
##   text79    16     19         1
##   text80    12     15         1
##   text81    26     27         2
##   text82    15     16         2
##   text83     9     12         1
##   text84    22     25         1
##   text85    15     15         1
##   text86    10     10         1
##   text87    15     18         1
##   text88    20     25         2
##   text89    27     32         1
##   text90    15     16         1
##   text91    20     23         1
##   text92    19     23         2
##   text93    16     22         2
##   text94    25     29         2
##   text95    20     22         1
##   text96    22     24         4
##   text97    13     17         2
##   text98    16     18         2
##   text99    25     28         3
##  text100    23     26         1
## 
## Source:  /Users/pablobarbera/git/big-data-upf/* on x86_64 by pablobarbera
## Created: Wed Jun 28 21:33:43 2017
## Notes:

A useful feature of corpus objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.

kwic(twcorpus, "brexit", window=10)
##                                                            
##  [text7905, 7] @2cvdolly1 I'll make the case for | Brexit |
##                                                
##  to the best of my ability. I genuinely believe
kwic(twcorpus, "miliband", window=10)
##                
##   [text1072, 2]
##   [text2859, 6]
##   [text3757, 4]
##   [text4300, 3]
##   [text4863, 2]
##   [text5150, 6]
##   [text5251, 4]
##   [text5294, 4]
##  [text5376, 10]
##   [text6042, 1]
##   [text6089, 4]
##   [text6291, 2]
##  [text7458, 12]
##  [text7988, 12]
##  [text8075, 28]
##   [text9746, 2]
##                                                                                                                  
##                                                                                                                Ed
##                                                                                         RT"@steverichards14: It's
##                                                                                                      Watch out Ed
##                                                                                                       Here's what
##                                                                                                                Ed
##                                                                   @petercoles44@DavidCoburnUKip@Ed_Miliband To me
##                                                                                                      Watch out Ed
##                                                                                                      Watch out Ed
##  @PrzSkwirczynski@State_Control@WinstonMcK@croydon_oldtown@suzanneshine@hopenothate@SLATUKIP@SquareBiz_T@bieneosa
##                                                                                                                  
##                                                                                                      Watch out Ed
##                                                                                                                Ed
##                                                  @UKIP_Bolton@Nigel_Farage UKIP ain't racist Mr's Cameron/ Clegg/
##                                                  @UKIP_Bolton@Nigel_Farage UKIP ain't racist Mr's Cameron/ Clegg/
##                                                                           imagine what he can do to cameron& amp;
##                                                                                                                Ed
##                                                                        
##  | Miliband | , please cap gym prices http:// t.co                     
##  | Miliband | , not Farage, who's breaking with tradition and upsetting
##  | Miliband | : Nigel Farage and Ukip targets Labour( via@daily_express
##  | Miliband | had to say about standing up to UKIP- http               
##  | Miliband | , please cap gym prices http:// t.co                     
##  | Miliband | looks a shifty character with normal people              
##  | Miliband | : Nigel Farage and Ukip targets Labour( via@daily_express
##  | Miliband | : Nigel Farage and Ukip targets Labour( via@daily_express
##  | Miliband | apologise!                                               
##  | Miliband | saying he wants to hear where#UKIP stands on key         
##  | Miliband | : Nigel Farage and Ukip targets Labour( via@daily_express
##  | Miliband | is heading for disaster as Labour MPs say party leaders  
##  | Miliband | Sticks& amp; Stones Two VOTES ukip http:                 
##  | Miliband | Sticks& amp; Stones Two VOTES ukip http:                 
##  | miliband |                                                          
##  | Miliband | is heading for disaster as Labour MPs say party leaders
kwic(twcorpus, "eu referendum", window=10)
##                   
##  [text5316, 18:19]
##  [text5756, 18:19]
##  [text6906, 12:13]
##  [text9038, 12:13]
##                                                                      
##                                  , why wait for 2017 for an in/ out |
##                                  , why wait for 2017 for an in/ out |
##  @Nigel_Farage What happened to#Cameron's cast iron guarantee of an |
##  @Nigel_Farage What happened to#Cameron's cast iron guarantee of an |
##                                                   
##  EU referendum | - If EU renegotiation impossible?
##  EU referendum | - If EU renegotiation impossible?
##  EU referendum | before the last election?#bbcsp  
##  EU referendum | before the last election?#bbcsp

We can then convert a corpus into a document-feature matrix using the dfm function.

twdfm <- dfm(twcorpus, verbose=TRUE)
## Creating a dfm from a corpus ...
##    ... tokenizing texts
##    ... lowercasing
##    ... found 10,000 documents, 16,513 features
##    ... created a 10,000 x 16,513 sparse dfm
##    ... complete. 
## Elapsed time: 0.001 seconds.
twdfm
## Document-feature matrix of: 10,000 documents, 16,513 features (99.9% sparse).

dfm has many useful options. Let’s actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features…

?dfm
twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, ngrams=1:3, verbose=TRUE)
## Creating a dfm from a corpus ...
##    ... tokenizing texts
##    ... lowercasing
##    ... found 10,000 documents, 154,722 features
## ... stemming features (English)
## , trimmed 5111 feature variants
##    ... created a 10,000 x 149,611 sparse dfm
##    ... complete. 
## Elapsed time: 5.82 seconds.
twdfm
## Document-feature matrix of: 10,000 documents, 149,611 features (100% sparse).

Note that here we use ngrams – this will extract all combinations of one, two, and three words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens in the matrix).

Stemming relies on the SnowballC package’s implementation of the Porter stemmer:

tokenize(tweets$text[1])
## tokenizedTexts from 1 document.
## Component 1 :
##  [1] "@NSinclaireMEP" "Knew"           "that"           "Lib"           
##  [5] "Dems"           "getting"        "into"           "bed"           
##  [9] "with"           "Tories"         "would"          "end"           
## [13] "like"           "this"           "."              "They"          
## [17] "might"          "never"          "get"            "another"       
## [21] "bite"           "of"             "the"            "cherry"        
## [25] "."
tokens_wordstem(tokenize(tweets$text[1]))
## tokenizedTexts from 1 document.
## Component 1 :
##  [1] "@NSinclaireMEP" "Knew"           "that"           "Lib"           
##  [5] "Dem"            "get"            "into"           "bed"           
##  [9] "with"           "Tori"           "would"          "end"           
## [13] "like"           "thi"            "."              "Thei"          
## [17] "might"          "never"          "get"            "anoth"         
## [21] "bite"           "of"             "the"            "cherri"        
## [25] "."

In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim the dfm:

twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
## Removing features occurring:
##   - in fewer than 3 document: 132,761
##   Total features removed: 132,761 (88.7%).

It’s often a good idea to take a look at a wordcloud of the most frequent features to see if there’s anything weird.

textplot_wordcloud(twdfm, rot.per=0, scale=c(3.5, .75), max.words=100)

What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures

topfeatures(twdfm, 25)
##          the           to            a          you           of 
##         3731         3090         2259         2136         1950 
##           in         t.co         http    http_t.co          and 
##         1950         1863         1744         1744         1718 
##          for            i @nigel_farag           is           it 
##         1706         1616         1535         1475         1452 
##           on         that           be        thank          not 
##         1277         1099          919          849          828 
##          are         have         with         ukip         vote 
##          800          790          719          690          684

We can remove the stopwords when we create the dfm object:

twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
  stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), verbose=TRUE)
## Creating a dfm from a corpus ...
##    ... tokenizing texts
##    ... lowercasing
##    ... found 10,000 documents, 16,634 features
## ...
## dfm_select removed 177 features and 0 documents, padding 0s for 0 features and 0 documents.
##    ... created a 10,000 x 16,457 sparse dfm
##    ... complete. 
## Elapsed time: 0.107 seconds.
textplot_wordcloud(twdfm, rot.per=0, scale=c(3.5, .75), max.words=100)

One nice feature of quanteda is that we can easily add metadata to the corpus object.

docvars(twcorpus) <- data.frame(screen_name=tweets$screen_name, polite=tweets$polite)
summary(twcorpus)
## Corpus consisting of 10000 documents, showing 100 documents.
## 
##     Text Types Tokens Sentences     screen_name   polite
##    text1    24     25         2     martinwedge impolite
##    text2    22     22         1         WillGav   polite
##    text3     7     10         1     CathyWood55   polite
##    text4     5      5         1    CStephenssnp   polite
##    text5     4      4         1  sanchia4europe   polite
##    text6    13     13         2     EnglandsAce   polite
##    text7    25     26         3   MikeGreenUKIP   polite
##    text8    21     23         1    Anothergreen   polite
##    text9    16     18         2         kell901   polite
##   text10    26     29         1 BranimiraMachev   polite
##   text11    22     24         3      NorseFired   polite
##   text12     7      8         1  CharlesTannock   polite
##   text13    23     24         1    GoodallGiles   polite
##   text14    17     18         1 francisdolarhy2   polite
##   text15    26     27         3    CuinnUiNeill   polite
##   text16    24     26         2   HenryMcMorrow   polite
##   text17    20     29         1 DavidCoburnUKip   polite
##   text18    15     16         1        ajcdeane   polite
##   text19    20     22         2      jackbuckby   polite
##   text20    20     20         2       kvmarthur   polite
##   text21    27     31         2  YOURvoiceParty   polite
##   text22    21     23         1       101flyboy   polite
##   text23    10     10         1  CharlesTannock   polite
##   text24    20     22         2    DuncanRights   polite
##   text25    13     13         1  skepticalvoter   polite
##   text26    12     12         1 DavidCoburnUKip   polite
##   text27     9     11         1  scrapperduncan   polite
##   text28     3      3         1       zander469   polite
##   text29    16     27         1   DavidWickham3   polite
##   text30    10     10         2  Green_Caroline   polite
##   text31     4      4         1    GucciAirbag_   polite
##   text32    22     24         6       Comrade58 impolite
##   text33     3      3         1 DugaldMacMillan   polite
##   text34     9      9         1        Shyman33   polite
##   text35    12     14         1      jackbuckby   polite
##   text36    17     18         2    cymroynewrop   polite
##   text37    14     14         2 danielrhamilton   polite
##   text38    20     25         1    Green_DannyB   polite
##   text39    17     18         1    GoodallGiles   polite
##   text40    11     12         2     PascaleLamb   polite
##   text41     2      2         1   helena_pigott   polite
##   text42    23     28         2     EnzaFerreri   polite
##   text43    17     17         1   NSinclaireMEP   polite
##   text44     7      7         1   NSinclaireMEP   polite
##   text45    24     27         1         garrodt   polite
##   text46    14     15         1 DavidCoburnUKip   polite
##   text47    22     24         2    dannyyoung35   polite
##   text48    19     21         1 DavidCoburnUKip   polite
##   text49    16     17         1    GreggatQuest   polite
##   text50    15     15         1          Wise64   polite
##   text51    16     18         2      FionaRadic   polite
##   text52     5      5         1    CStephenssnp   polite
##   text53     9      9         1    Kevinmorosky   polite
##   text54    15     19         1  sanchia4europe   polite
##   text55    16     21         2      ScrumpyNed   polite
##   text56    13     14         1   JosephMcShane   polite
##   text57    18     20         1 SarahLudfordMEP   polite
##   text58    23     31         3     GinaDowding   polite
##   text59     9     11         1   DavidWickham3   polite
##   text60     3      3         1 katrinamurray71   polite
##   text61    13     15         1      IainMcGill   polite
##   text62    20     23         4   SchaduwStaten   polite
##   text63     4      4         1     Rory_Palmer   polite
##   text64    15     16         2 PercyBlakeney63   polite
##   text65    21     21         1    DanHannanMEP   polite
##   text66    10     12         1 jennyknight2014   polite
##   text67    12     13         1   Steven_Woolfe   polite
##   text68    24     25         3  JamesJimCarver   polite
##   text69    10     10         1 DavidCoburnUKip   polite
##   text70    12     12         1    waddesdonbaz   polite
##   text71    12     13         1   Cumpedelibero   polite
##   text72    19     23         1    Green_DannyB   polite
##   text73     9      9         1         F1andyD   polite
##   text74    12     12         1  graham_pointer   polite
##   text75    14     18         1   veganfishcake   polite
##   text76    16     16         1    peterlfoster   polite
##   text77    15     15         1 DavidCoburnUKip   polite
##   text78    22     24         1    londonstatto   polite
##   text79    16     19         1     TurfShifter   polite
##   text80    12     15         1    suzanneshine   polite
##   text81    26     27         2    GoodallGiles   polite
##   text82    15     16         2        Mauginog   polite
##   text83     9     12         1   rivermagic123   polite
##   text84    22     25         1          SHKMEP   polite
##   text85    15     15         1 GrillingKippers   polite
##   text86    10     10         1       Zoidybear   polite
##   text87    15     18         1      CulliganPA   polite
##   text88    20     25         2   globalrichard   polite
##   text89    27     32         1     davenellist   polite
##   text90    15     16         1    AlynSmithMEP   polite
##   text91    20     23         1    suzanneshine   polite
##   text92    19     23         2   ssilverwavess   polite
##   text93    16     22         2   GlobalYawning   polite
##   text94    25     29         2      CulliganPA   polite
##   text95    20     22         1   CllrChrisPain   polite
##   text96    22     24         4    dennisterrey impolite
##   text97    13     17         2    Anothergreen   polite
##   text98    16     18         2      FionaRadic   polite
##   text99    25     28         3    cristian7897   polite
##  text100    23     26         1 jennyknight2014   polite
## 
## Source:  /Users/pablobarbera/git/big-data-upf/* on x86_64 by pablobarbera
## Created: Wed Jun 28 21:33:43 2017
## Notes:

We can then use this metadata to subset the dataset:

polite.tweets <- corpus_subset(twcorpus, polite=="impolite")

And then extract the text:

mytexts <- texts(polite.tweets)

We’ll come back later to this dataset.

Importing text with quanteda

There are different ways to read text into R and create a corpus object with quanteda. We have already seen the most common way, importing the text from a csv file and then adding the metadata, but quanteda has a built-in function to help with this:

library(readtext)
tweets <- readtext(file='data/EP-elections-tweets.csv')
twcorpus <- corpus(tweets)

This function will also work with text in multiple files. To do this, we use the textfile command, and use the ‘glob’ operator ’*’ to indicate that we want to load multiple files:

myCorpus <- readtext(file='data/inaugural/*.txt')
inaugCorpus <- corpus(myCorpus)