This RMarkdown file provides additional information about the basic tools of text analysis that we will use in this course to clean text data scraped from the web.

String manipulation with R

We will start with basic string manipulation with R.

Our running example will be the set of tweets posted by Donald Trump’s Twitter account since January 1st, 2018

library(streamR)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
## Loading required package: ndjson
tweets <- parseTweets("~/data/trump-tweets.json")
## 3866 tweets have been parsed.
head(tweets)
##                                                                                                                                                                                                                                                                                       text
## 1 We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY
## 2                                                                                                                                                                             Amy Kremer, Women for Trump, was so great on @foxandfriends. Brave and very smart, thank you Amy! @AmyKremer
## 3                                                                                                                                                                                Thank you South Carolina. Now let’s get out tomorrow and VOTE for @HenryMcMaster! https://t.co/5xlz0wfMfu
## 4                                                                                     Just watched @SharkGregNorman on @foxandfriends. Said “President is doing a great job. All over the world, people want to come back to the U.S.” Thank you Greg, and you’re looking and doing great!
## 5   Russia continues to say they had nothing to do with Meddling in our Election! Where is the DNC Server, and why didn’t Shady James Comey and the now disgraced FBI agents take and closely examine it? Why isn’t Hillary/Russia being looked at? So many questions, so much corruption!
## 6                                                                                                                                                                                                                    Statement on Justice Anthony Kennedy. #SCOTUS https://t.co/8aWJ6fWemA
##   retweet_count favorite_count favorited truncated              id_str
## 1         30514          89162     FALSE     FALSE 1010246126820347906
## 2          9382          50425     FALSE     FALSE 1012297599431401474
## 3         13631          56997     FALSE     FALSE 1011422555947712513
## 4         12007          62025     FALSE     FALSE 1012299239207198721
## 5         23077          92661     FALSE     FALSE 1012295859072126977
## 6         11138          47234     FALSE     FALSE 1012051330591023107
##   in_reply_to_screen_name
## 1                      NA
## 2                      NA
## 3                      NA
## 4                      NA
## 5                      NA
## 6                      NA
##                                                                               source
## 1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 2 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 3 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 4 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 5 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 6 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
##   retweeted                     created_at in_reply_to_status_id_str
## 1     FALSE Fri Jun 22 19:40:20 +0000 2018                        NA
## 2     FALSE Thu Jun 28 11:32:09 +0000 2018                        NA
## 3     FALSE Tue Jun 26 01:35:03 +0000 2018                        NA
## 4     FALSE Thu Jun 28 11:38:40 +0000 2018                        NA
## 5     FALSE Thu Jun 28 11:25:15 +0000 2018                        NA
## 6     FALSE Wed Jun 27 19:13:34 +0000 2018                        NA
##   in_reply_to_user_id_str lang listed_count verified       location
## 1                      NA   en        89807     TRUE Washington, DC
## 2                      NA   en        89807     TRUE Washington, DC
## 3                      NA   en        89807     TRUE Washington, DC
## 4                      NA   en        89807     TRUE Washington, DC
## 5                      NA   en        89807     TRUE Washington, DC
## 6                      NA   en        89807     TRUE Washington, DC
##   user_id_str                                      description geo_enabled
## 1    25073877 45th President of the United States of America🇺🇸        TRUE
## 2    25073877 45th President of the United States of America🇺🇸        TRUE
## 3    25073877 45th President of the United States of America🇺🇸        TRUE
## 4    25073877 45th President of the United States of America🇺🇸        TRUE
## 5    25073877 45th President of the United States of America🇺🇸        TRUE
## 6    25073877 45th President of the United States of America🇺🇸        TRUE
##                  user_created_at statuses_count followers_count
## 1 Wed Mar 18 13:46:38 +0000 2009          38073        53101783
## 2 Wed Mar 18 13:46:38 +0000 2009          38073        53101783
## 3 Wed Mar 18 13:46:38 +0000 2009          38073        53101783
## 4 Wed Mar 18 13:46:38 +0000 2009          38073        53101783
## 5 Wed Mar 18 13:46:38 +0000 2009          38073        53101783
## 6 Wed Mar 18 13:46:38 +0000 2009          38073        53101783
##   favourites_count protected                user_url            name
## 1               25     FALSE https://t.co/OMxB0x7xC5 Donald J. Trump
## 2               25     FALSE https://t.co/OMxB0x7xC5 Donald J. Trump
## 3               25     FALSE https://t.co/OMxB0x7xC5 Donald J. Trump
## 4               25     FALSE https://t.co/OMxB0x7xC5 Donald J. Trump
## 5               25     FALSE https://t.co/OMxB0x7xC5 Donald J. Trump
## 6               25     FALSE https://t.co/OMxB0x7xC5 Donald J. Trump
##   time_zone user_lang utc_offset friends_count     screen_name
## 1        NA        en         NA            47 realDonaldTrump
## 2        NA        en         NA            47 realDonaldTrump
## 3        NA        en         NA            47 realDonaldTrump
## 4        NA        en         NA            47 realDonaldTrump
## 5        NA        en         NA            47 realDonaldTrump
## 6        NA        en         NA            47 realDonaldTrump
##   country_code country place_type full_name place_name place_id place_lat
## 1         <NA>      NA         NA      <NA>       <NA>     <NA>       NaN
## 2         <NA>      NA         NA      <NA>       <NA>     <NA>       NaN
## 3         <NA>      NA         NA      <NA>       <NA>     <NA>       NaN
## 4         <NA>      NA         NA      <NA>       <NA>     <NA>       NaN
## 5         <NA>      NA         NA      <NA>       <NA>     <NA>       NaN
## 6         <NA>      NA         NA      <NA>       <NA>     <NA>       NaN
##   place_lon lat lon
## 1       NaN  NA  NA
## 2       NaN  NA  NA
## 3       NaN  NA  NA
## 4       NaN  NA  NA
## 5       NaN  NA  NA
## 6       NaN  NA  NA
##                                                                                                        expanded_url
## 1 https://www.pscp.tv/w/bf1GFzFvTlFsTFJub1dwUXd8MWpNSmdFVll5ZUFLTAWuHc0BMMKeCOoDRCPmtIftVLaFLQVwfSLoC_C0SbzX?t=9m9s
## 2                                                                                                              <NA>
## 3  https://www.pscp.tv/w/bgGOtTFvTlFsTFJub1dwUXd8MXlvSk1WZHJWQm54Uf-J8fPu1RO4E84ax-LuK1bAbiCpnzBBZmdPfI9FAhGV?t=11s
## 4                                                                                                              <NA>
## 5                                                                                                              <NA>
## 6                                                                                                              <NA>
##                       url
## 1 https://t.co/ZjXESYAcjY
## 2                    <NA>
## 3 https://t.co/5xlz0wfMfu
## 4                    <NA>
## 5                    <NA>
## 6                    <NA>

R stores the basic string in a character vector. length gets the number of items in the vector, while nchar is the number of characters in the vector.

length(tweets$text)
## [1] 3866
tweets$text[1]
## [1] "We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY"
nchar(tweets$text[1])
## [1] 280

Note that we can work with multiple strings at once.

nchar(tweets$text[1:10])
##  [1] 280 108 105 196 278  69 104 187 230 140
sum(nchar(tweets$text[1:10]))
## [1] 1697
max(nchar(tweets$text[1:10]))
## [1] 280

We can merge different strings into one using paste:

paste(tweets$text[1], tweets$text[2], sep='--')
## [1] "We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY--Amy Kremer, Women for Trump, was so great on @foxandfriends. Brave and very smart, thank you Amy! @AmyKremer"

As we will see later, it is often convenient to convert all words to lowercase or uppercase.

tolower(tweets$text[1])
## [1] "we are gathered today to hear directly from the american victims of illegal immigration. these are the american citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. these are the families the media ignores...https://t.co/zjxesyacjy"
toupper(tweets$text[1])
## [1] "WE ARE GATHERED TODAY TO HEAR DIRECTLY FROM THE AMERICAN VICTIMS OF ILLEGAL IMMIGRATION. THESE ARE THE AMERICAN CITIZENS PERMANENTLY SEPARATED FROM THEIR LOVED ONES B/C THEY WERE KILLED BY CRIMINAL ILLEGAL ALIENS. THESE ARE THE FAMILIES THE MEDIA IGNORES...HTTPS://T.CO/ZJXESYACJY"

We can grab substrings with substr. The first argument is the string, the second is the beginning index (starting from 1), and the third is final index.

substr(tweets$text[1], 1, 2)
## [1] "We"
substr(tweets$text[1], 1, 10)
## [1] "We are gat"

This is useful when working with date strings as well:

dates <- c("2015/01/01", "2014/12/01")
substr(dates, 1, 4) # years
## [1] "2015" "2014"
substr(dates, 6, 7) # months
## [1] "01" "12"

Let’s dig into the data a little bit more. Given the source of the dataset, we can expect that there will be many tweets mentioning topics such as immigration or health care. We can use the grep command to identify these. grep returns the index where the word occurs.

grep('immigration', tweets$text[1:25])
## [1] 14

grepl returns TRUE or FALSE, indicating whether each element of the character vector contains that particular pattern.

grepl("immigration", tweets$text[1:25])
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE

Going back to the full dataset, we can use the results of grep to get particular rows. First, check how many tweets mention the word “immigration”.

nrow(tweets)
## [1] 3866
grep('immigration', tweets$text)
##  [1]   14   60   74   75   79   92  102  108  109  121  122  125  133  151
## [15]  166  185  193  531  605  614  621  789  827  834  835 1024 1111 1136
## [29] 1142 1162 1179 1183 1200 1202 1212 1229 1246 1296 1301 1308 1539 1649
## [43] 1785 1970 2348 2550 2899 2908 3234 3289 3347 3380 3547 3637 3684
length(grep('immigration', tweets$text))
## [1] 55

It is important to note that matching is case-sensitive. You can use the ignore.case argument to match to a lowercase version.

nrow(tweets)
## [1] 3866
length(grep('immigration', tweets$text))
## [1] 55
length(grep('immigration', tweets$text, ignore.case = TRUE))
## [1] 77

Now let’s try to identify what tweets are related to immigration and try to store them into a smaller data frame. How would we do it?

immi_tweets <- tweets[grep('immigration', tweets$text, ignore.case=TRUE),]

Regular expressions

Another useful tool to work with text data is called “regular expression”. You can learn more about regular expressions here. Regular expressions let us develop complicated rules for both matching strings and extracting elements from them.

For example, we could look at tweets that mention more than one handle using the operator “|” (equivalent to “OR”)

nrow(tweets)
## [1] 3866
length(grep('immigration|immigrant', tweets$text, ignore.case=TRUE))
## [1] 91

We can also use question marks to indicate optional characters.

nrow(tweets)
## [1] 3866
length(grep('immigr?', tweets$text, ignore.case=TRUE))
## [1] 91

This will match immigration, immigrant, immigrants, etc.

Other common expression patterns are:

For example, how many tweets ends with an exclamation mark? How many tweets are retweets? How many tweets mention any username? And a hashtag?

length(grep('!$', tweets$text, ignore.case=TRUE))
## [1] 1528
length(grep('^RT @', tweets$text, ignore.case=TRUE))
## [1] 419
length(grep('@[A-Za-z0-9_]+', tweets$text, ignore.case=TRUE))
## [1] 1018
length(grep('#[A-Za-z0-9_]+', tweets$text, ignore.case=TRUE))
## [1] 581

More complex examples of regular expressions using stringr

stringr is an R package that extends the capabilities of R for manipulation of text. Let’s say that e.g. we want to replace a pattern (or a regular expression) with another string:

library(stringr)
str_replace(tweets$text[2], '@[0-9_A-Za-z]+', 'USERNAME')
## [1] "Amy Kremer, Women for Trump, was so great on USERNAME. Brave and very smart, thank you Amy! @AmyKremer"

Note this will only replace the first instance. For all the instances, do:

str_replace_all(tweets$text[2], '@[0-9_A-Za-z]+', 'USERNAME')
## [1] "Amy Kremer, Women for Trump, was so great on USERNAME. Brave and very smart, thank you Amy! USERNAME"

To extract a pattern we can use str_extract, and again we can extract one or all instances of the pattern:

str_extract(tweets$text[2], '@[0-9_A-Za-z]+')
## [1] "@foxandfriends"
str_extract_all(tweets$text[2], '@[0-9_A-Za-z]+')
## [[1]]
## [1] "@foxandfriends" "@AmyKremer"

This function is vectorized, which means we can apply it to all elements of a vector simultaneously. That will give us a list, which we can then turn into a vector to get a frequency table of the most frequently mentioned handles or hashtags:

handles <- str_extract_all(tweets$text, '@[0-9_A-Za-z]+')
handles[1:3]
## [[1]]
## character(0)
## 
## [[2]]
## [1] "@foxandfriends" "@AmyKremer"    
## 
## [[3]]
## [1] "@HenryMcMaster"
handles_vector <- unlist(handles)
head(sort(table(handles_vector), decreasing = TRUE), n=10)
## handles_vector
##   @foxandfriends @realDonaldTrump      @WhiteHouse         @FoxNews 
##              122              109              106               79 
##           @POTUS          @FLOTUS         @nytimes       @Scavino45 
##               50               48               34               32 
##     @IvankaTrump       @EricTrump 
##               31               27
# now with hashtags...
hashtags <- str_extract_all(tweets$text, '#[A-Za-z0-9_]+')
hashtags[1:3]
## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
hashtags_vector <- unlist(hashtags)
head(sort(table(hashtags_vector), decreasing = TRUE), n=10)
## hashtags_vector
##                  #MAGA                   #USA          #AmericaFirst 
##                     77                     32                     19 
##              #FakeNews #MakeAmericaGreatAgain             #TaxReform 
##                     17                     13                     12 
##                  #UNGA       #HurricaneHarvey                 #ICYMI 
##                     12                     11                     10 
##            #PuertoRico 
##                      8

Preprocessing text with quanteda

Before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.

The basic unit of work for the quanteda package is called a corpus, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary to get some information about your corpus.

library(quanteda)
## Package version: 1.3.4
## Parallel computing: 2 of 2 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library(streamR)
tweets <- parseTweets("~/data/trump-tweets.json")
## 3866 tweets have been parsed.
twcorpus <- corpus(tweets$text)
summary(twcorpus, n=10)
## Corpus consisting of 3866 documents, showing 10 documents:
## 
##    Text Types Tokens Sentences
##   text1    40     54         3
##   text2    20     23         3
##   text3    20     22         3
##   text4    32     41         4
##   text5    48     56         4
##   text6    12     14         2
##   text7    20     22         2
##   text8    29     31         2
##   text9    44     50         3
##  text10    22     24         2
## 
## Source: /home/ecpr40/code/* on x86_64 by ecpr40
## Created: Mon Jul 30 10:11:47 2018
## Notes:

A very useful feature of corpus objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.

kwic(twcorpus, "immigration", window=10)[1:5,]
##                                                                          
##   [text1, 14] today to hear directly from the AMERICAN VICTIMS of ILLEGAL
##  [text10, 17] today to hear directly from the AMERICAN VICTIMS of ILLEGAL
##  [text14, 11]                               .... If this is done, illegal
##   [text15, 9]           HOUSE REPUBLICANS SHOULD PASS THE STRONG BUT FAIR
##   [text41, 6]                                                    .... Our
##                 
##  | IMMIGRATION |
##  | IMMIGRATION |
##  | immigration |
##  | IMMIGRATION |
##  | Immigration |
##                                                                    
##  . These are the American Citizens permanently separated from their
##  . These are the American Citize…                                  
##  will be stopped in it's tracks- and at very                       
##  BILL, KNOWN AS GOODLATTE II, IN THEIR AFTERNOON                   
##  policy, laughed at all over the world, is
kwic(twcorpus, "healthcare", window=10)[1:5,]
##                                                                         
##   [text46, 17]         help to me on Cutting Taxes, creating great new |
##  [text182, 37]             He is tough on Crime and Strong on Borders, |
##  [text507, 48] Warren lines, loves sanctuary cities, bad and expensive |
##   [text530, 6]                           The American people deserve a |
##  [text554, 27]          will be a great Governor with a heavy focus on |
##                                                                      
##  healthcare | programs at low cost, fighting for Border Security,    
##  Healthcare | , the Military and our great Vets. Henry has           
##  healthcare | ...                                                    
##  healthcare | system that takes care of them- not one that           
##  HealthCare | and Jobs. His Socialist opponent in November should not
kwic(twcorpus, "clinton", window=10)[1:5,]
##                                                                         
##  [text141, 23]                the Bush Dynasty, then I had to beat the |
##  [text161, 20]                the Bush Dynasty, then I had to beat the |
##   [text204, 9]                  FBI Agent Peter Strzok, who headed the |
##  [text216, 13] :.@jasoninthehouse: All of this started because Hillary |
##  [text252, 10]                          .... Schneiderman, who ran the |
##                                                             
##  Clinton | Dynasty, and now I…                              
##  Clinton | Dynasty, and now I have to beat a phony          
##  Clinton | & amp; Russia investigations, texted to his lover
##  Clinton | set up her private server https:// t.co          
##  Clinton | campaign in New York, never had the guts to

We can then convert a corpus into a document-feature matrix using the dfm function.

twdfm <- dfm(twcorpus, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 3,866 documents, 9,930 features
##    ... created a 3,866 x 9,930 sparse dfm
##    ... complete. 
## Elapsed time: 0.251 seconds.
twdfm
## Document-feature matrix of: 3,866 documents, 9,930 features (99.7% sparse).

The dfm will show the count of times each word appears in each document (tweet):

twdfm[1:5, 1:10]
## Document-feature matrix of: 5 documents, 10 features (72% sparse).
## 5 x 10 sparse Matrix of class "dfm"
##        features
## docs    we are gathered today to hear directly from the american
##   text1  1   3        1     1  1    1        1    2   4        2
##   text2  0   0        0     0  0    0        0    0   0        0
##   text3  0   0        0     0  0    0        0    0   0        0
##   text4  0   0        0     0  2    0        0    0   2        0
##   text5  0   0        0     0  2    0        0    0   2        0

dfm has many useful options (check out ?dfm for more information). Let’s actually use it extract n-grams, remove punctuation, keep Twitter features…

twdfm <- dfm(twcorpus, tolower=TRUE, remove_punct = TRUE, remove_url=TRUE, ngrams=1:3, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 3,866 documents, 128,909 features
##    ... created a 3,866 x 128,909 sparse dfm
##    ... complete. 
## Elapsed time: 0.953 seconds.
twdfm
## Document-feature matrix of: 3,866 documents, 128,909 features (99.9% sparse).

Note that here we use ngrams – this will extract all combinations of one, two, and three words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens in the matrix).

In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim the dfm:

twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
## Removing features occurring:
##   - in fewer than 3 documents: 117,710
##   Total features removed: 117,710 (91.3%).
twdfm
## Document-feature matrix of: 3,866 documents, 11,199 features (99.7% sparse).

It’s often a good idea to take a look at a wordcloud of the most frequent features to see if there’s anything weird.

textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)

What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures

topfeatures(twdfm, 25)
##   the    to   and    of     a    in    is   for    on   our  will great 
##  4580  2697  2493  1945  1549  1455  1299  1088   920   887   836   825 
##  with   are    we     i    be  that   amp    it  have    at   you   was 
##   815   793   764   735   714   707   637   601   536   523   520   492 
##  they 
##   474

We can remove the stopwords when we create the dfm object:

twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
  stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), remove_url=TRUE, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 3,866 documents, 8,456 features
##    ... removed 165 features
##    ... created a 3,866 x 8,291 sparse dfm
##    ... complete. 
## Elapsed time: 0.27 seconds.
textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)