Basics of text analysis

This RMarkdown file provides additional information about the basic tools of text analysis that we will use in this course to clean text data scraped from the web.

String manipulation with R

We will start with basic string manipulation with R.

Our running example will be the set of tweets posted by Donald Trump’s Twitter account since January 1st, 2018

library(streamR)

## Loading required package: RCurl

## Loading required package: bitops

## Loading required package: rjson

## Warning: package 'rjson' was built under R version 3.4.4

## Loading required package: ndjson

## Warning: package 'ndjson' was built under R version 3.4.4

tweets <- parseTweets("~/data/trump-tweets.json")

## 3866 tweets have been parsed.

head(tweets)

##                                                                                                                                                                                                                                                                                       text
## 1 We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY
## 2                                                                                                                                                                             Amy Kremer, Women for Trump, was so great on @foxandfriends. Brave and very smart, thank you Amy! @AmyKremer
## 3                                                                                                                                                                                Thank you South Carolina. Now let’s get out tomorrow and VOTE for @HenryMcMaster! https://t.co/5xlz0wfMfu
## 4                                                                                     Just watched @SharkGregNorman on @foxandfriends. Said “President is doing a great job. All over the world, people want to come back to the U.S.” Thank you Greg, and you’re looking and doing great!
## 5   Russia continues to say they had nothing to do with Meddling in our Election! Where is the DNC Server, and why didn’t Shady James Comey and the now disgraced FBI agents take and closely examine it? Why isn’t Hillary/Russia being looked at? So many questions, so much corruption!
## 6                                                                                                                                                                                                                    Statement on Justice Anthony Kennedy. #SCOTUS https://t.co/8aWJ6fWemA
##   retweet_count favorite_count favorited truncated              id_str
## 1         30514          89162     FALSE     FALSE 1010246126820347906
## 2          9382          50425     FALSE     FALSE 1012297599431401474
## 3         13631          56997     FALSE     FALSE 1011422555947712513
## 4         12007          62025     FALSE     FALSE 1012299239207198721
## 5         23077          92661     FALSE     FALSE 1012295859072126977
## 6         11138          47234     FALSE     FALSE 1012051330591023107
##   in_reply_to_screen_name
## 1                      NA
## 2                      NA
## 3                      NA
## 4                      NA
## 5                      NA
## 6                      NA
##                                                                               source
## 1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 2 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 3 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 4 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 5 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 6 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
##   retweeted                     created_at in_reply_to_status_id_str
## 1     FALSE Fri Jun 22 19:40:20 +0000 2018                        NA
## 2     FALSE Thu Jun 28 11:32:09 +0000 2018                        NA
## 3     FALSE Tue Jun 26 01:35:03 +0000 2018                        NA
## 4     FALSE Thu Jun 28 11:38:40 +0000 2018                        NA
## 5     FALSE Thu Jun 28 11:25:15 +0000 2018                        NA
## 6     FALSE Wed Jun 27 19:13:34 +0000 2018                        NA
##   in_reply_to_user_id_str lang listed_count verified       location
## 1                      NA   en        89807     TRUE Washington, DC
## 2                      NA   en        89807     TRUE Washington, DC
## 3                      NA   en        89807     TRUE Washington, DC
## 4                      NA   en        89807     TRUE Washington, DC
## 5                      NA   en        89807     TRUE Washington, DC
## 6                      NA   en        89807     TRUE Washington, DC
##   user_id_str
## 1    25073877
## 2    25073877
## 3    25073877
## 4    25073877
## 5    25073877
## 6    25073877
##                                                          description
## 1 45th President of the United States of America\U0001f1fa\U0001f1f8
## 2 45th President of the United States of America\U0001f1fa\U0001f1f8
## 3 45th President of the United States of America\U0001f1fa\U0001f1f8
## 4 45th President of the United States of America\U0001f1fa\U0001f1f8
## 5 45th President of the United States of America\U0001f1fa\U0001f1f8
## 6 45th President of the United States of America\U0001f1fa\U0001f1f8
##   geo_enabled                user_created_at statuses_count
## 1        TRUE Wed Mar 18 13:46:38 +0000 2009          38073
## 2        TRUE Wed Mar 18 13:46:38 +0000 2009          38073
## 3        TRUE Wed Mar 18 13:46:38 +0000 2009          38073
## 4        TRUE Wed Mar 18 13:46:38 +0000 2009          38073
## 5        TRUE Wed Mar 18 13:46:38 +0000 2009          38073
## 6        TRUE Wed Mar 18 13:46:38 +0000 2009          38073
##   followers_count favourites_count protected                user_url
## 1        53101783               25     FALSE https://t.co/OMxB0x7xC5
## 2        53101783               25     FALSE https://t.co/OMxB0x7xC5
## 3        53101783               25     FALSE https://t.co/OMxB0x7xC5
## 4        53101783               25     FALSE https://t.co/OMxB0x7xC5
## 5        53101783               25     FALSE https://t.co/OMxB0x7xC5
## 6        53101783               25     FALSE https://t.co/OMxB0x7xC5
##              name time_zone user_lang utc_offset friends_count
## 1 Donald J. Trump        NA        en         NA            47
## 2 Donald J. Trump        NA        en         NA            47
## 3 Donald J. Trump        NA        en         NA            47
## 4 Donald J. Trump        NA        en         NA            47
## 5 Donald J. Trump        NA        en         NA            47
## 6 Donald J. Trump        NA        en         NA            47
##       screen_name country_code country place_type full_name place_name
## 1 realDonaldTrump         <NA>      NA         NA      <NA>       <NA>
## 2 realDonaldTrump         <NA>      NA         NA      <NA>       <NA>
## 3 realDonaldTrump         <NA>      NA         NA      <NA>       <NA>
## 4 realDonaldTrump         <NA>      NA         NA      <NA>       <NA>
## 5 realDonaldTrump         <NA>      NA         NA      <NA>       <NA>
## 6 realDonaldTrump         <NA>      NA         NA      <NA>       <NA>
##   place_id place_lat place_lon lat lon
## 1     <NA>       NaN       NaN  NA  NA
## 2     <NA>       NaN       NaN  NA  NA
## 3     <NA>       NaN       NaN  NA  NA
## 4     <NA>       NaN       NaN  NA  NA
## 5     <NA>       NaN       NaN  NA  NA
## 6     <NA>       NaN       NaN  NA  NA
##                                                                                                        expanded_url
## 1 https://www.pscp.tv/w/bf1GFzFvTlFsTFJub1dwUXd8MWpNSmdFVll5ZUFLTAWuHc0BMMKeCOoDRCPmtIftVLaFLQVwfSLoC_C0SbzX?t=9m9s
## 2                                                                                                              <NA>
## 3  https://www.pscp.tv/w/bgGOtTFvTlFsTFJub1dwUXd8MXlvSk1WZHJWQm54Uf-J8fPu1RO4E84ax-LuK1bAbiCpnzBBZmdPfI9FAhGV?t=11s
## 4                                                                                                              <NA>
## 5                                                                                                              <NA>
## 6                                                                                                              <NA>
##                       url
## 1 https://t.co/ZjXESYAcjY
## 2                    <NA>
## 3 https://t.co/5xlz0wfMfu
## 4                    <NA>
## 5                    <NA>
## 6                    <NA>

R stores the basic string in a character vector. length gets the number of items in the vector, while nchar is the number of characters in the vector.

length(tweets$text)

## [1] 3866

tweets$text[1]

## [1] "We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY"

nchar(tweets$text[1])

## [1] 280

Note that we can work with multiple strings at once.

nchar(tweets$text[1:10])

##  [1] 280 108 105 196 278  69 104 187 230 140

sum(nchar(tweets$text[1:10]))

## [1] 1697

max(nchar(tweets$text[1:10]))

## [1] 280

We can merge different strings into one using paste:

paste(tweets$text[1], tweets$text[2], sep='--')

## [1] "We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY--Amy Kremer, Women for Trump, was so great on @foxandfriends. Brave and very smart, thank you Amy! @AmyKremer"

As we will see later, it is often convenient to convert all words to lowercase or uppercase.

tolower(tweets$text[1])

## [1] "we are gathered today to hear directly from the american victims of illegal immigration. these are the american citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. these are the families the media ignores...https://t.co/zjxesyacjy"

toupper(tweets$text[1])

## [1] "WE ARE GATHERED TODAY TO HEAR DIRECTLY FROM THE AMERICAN VICTIMS OF ILLEGAL IMMIGRATION. THESE ARE THE AMERICAN CITIZENS PERMANENTLY SEPARATED FROM THEIR LOVED ONES B/C THEY WERE KILLED BY CRIMINAL ILLEGAL ALIENS. THESE ARE THE FAMILIES THE MEDIA IGNORES...HTTPS://T.CO/ZJXESYACJY"

We can grab substrings with substr. The first argument is the string, the second is the beginning index (starting from 1), and the third is final index.

substr(tweets$text[1], 1, 2)

## [1] "We"

substr(tweets$text[1], 1, 10)

## [1] "We are gat"

This is useful when working with date strings as well:

dates <- c("2015/01/01", "2014/12/01")
substr(dates, 1, 4) # years

## [1] "2015" "2014"

substr(dates, 6, 7) # months

## [1] "01" "12"

Let’s dig into the data a little bit more. Given the source of the dataset, we can expect that there will be many tweets mentioning topics such as immigration or health care. We can use the grep command to identify these. grep returns the index where the word occurs.

grep('immigration', tweets$text[1:25])

## [1] 14

grepl returns TRUE or FALSE, indicating whether each element of the character vector contains that particular pattern.

grepl("immigration", tweets$text[1:25])

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE

Going back to the full dataset, we can use the results of grep to get particular rows. First, check how many tweets mention the word “immigration”.

nrow(tweets)

## [1] 3866

grep('immigration', tweets$text)

##  [1]   14   60   74   75   79   92  102  108  109  121  122  125  133  151
## [15]  166  185  193  531  605  614  621  789  827  834  835 1024 1111 1136
## [29] 1142 1162 1179 1183 1200 1202 1212 1229 1246 1296 1301 1308 1539 1649
## [43] 1785 1970 2348 2550 2899 2908 3234 3289 3347 3380 3547 3637 3684

length(grep('immigration', tweets$text))

## [1] 55

It is important to note that matching is case-sensitive. You can use the ignore.case argument to match to a lowercase version.

nrow(tweets)

## [1] 3866

length(grep('immigration', tweets$text))

## [1] 55

length(grep('immigration', tweets$text, ignore.case = TRUE))

## [1] 77

Now let’s try to identify what tweets are related to immigration and try to store them into a smaller data frame. How would we do it?

immi_tweets <- tweets[grep('immigration', tweets$text, ignore.case=TRUE),]

Regular expressions

Another useful tool to work with text data is called “regular expression”. You can learn more about regular expressions here. Regular expressions let us develop complicated rules for both matching strings and extracting elements from them.

For example, we could look at tweets that mention more than one handle using the operator “|” (equivalent to “OR”)

nrow(tweets)

## [1] 3866

length(grep('immigration|immigrant', tweets$text, ignore.case=TRUE))

## [1] 91

We can also use question marks to indicate optional characters.

nrow(tweets)

## [1] 3866

length(grep('immigr?', tweets$text, ignore.case=TRUE))

## [1] 91

This will match immigration, immigrant, immigrants, etc.

Other common expression patterns are:

. matches any character, ^ and $ match the beginning and end of a string.
Any character followed by {3}, *, + is matched exactly 3 times, 0 or more times, 1 or more times.
[0-9], [a-zA-Z], [:alnum:] match any digit, any letter, or any digit and letter.
Special characters such as ., \, ( or ) must be preceded by a backslash.
See ?regex for more details.

For example, how many tweets ends with an exclamation mark? How many tweets are retweets? How many tweets mention any username? And a hashtag?

length(grep('!$', tweets$text, ignore.case=TRUE))

## [1] 1528

length(grep('^RT @', tweets$text, ignore.case=TRUE))

## [1] 419

length(grep('@[A-Za-z0-9_]+', tweets$text, ignore.case=TRUE))

## [1] 1018

length(grep('#[A-Za-z0-9_]+', tweets$text, ignore.case=TRUE))

## [1] 581

More complex examples of regular expressions using stringr

stringr is an R package that extends the capabilities of R for manipulation of text. Let’s say that e.g. we want to replace a pattern (or a regular expression) with another string:

library(stringr)
str_replace(tweets$text[2], '@[0-9_A-Za-z]+', 'USERNAME')

## [1] "Amy Kremer, Women for Trump, was so great on USERNAME. Brave and very smart, thank you Amy! @AmyKremer"

Note this will only replace the first instance. For all the instances, do:

str_replace_all(tweets$text[2], '@[0-9_A-Za-z]+', 'USERNAME')

## [1] "Amy Kremer, Women for Trump, was so great on USERNAME. Brave and very smart, thank you Amy! USERNAME"

To extract a pattern we can use str_extract, and again we can extract one or all instances of the pattern:

str_extract(tweets$text[2], '@[0-9_A-Za-z]+')

## [1] "@foxandfriends"

str_extract_all(tweets$text[2], '@[0-9_A-Za-z]+')

## [[1]]
## [1] "@foxandfriends" "@AmyKremer"

This function is vectorized, which means we can apply it to all elements of a vector simultaneously. That will give us a list, which we can then turn into a vector to get a frequency table of the most frequently mentioned handles or hashtags:

handles <- str_extract_all(tweets$text, '@[0-9_A-Za-z]+')
handles[1:3]

## [[1]]
## character(0)
## 
## [[2]]
## [1] "@foxandfriends" "@AmyKremer"    
## 
## [[3]]
## [1] "@HenryMcMaster"

handles_vector <- unlist(handles)
head(sort(table(handles_vector), decreasing = TRUE), n=10)

## handles_vector
##   @foxandfriends @realDonaldTrump      @WhiteHouse         @FoxNews 
##              122              109              106               79 
##           @POTUS          @FLOTUS         @nytimes       @Scavino45 
##               50               48               34               32 
##     @IvankaTrump       @EricTrump 
##               31               27

# now with hashtags...
hashtags <- str_extract_all(tweets$text, '#[A-Za-z0-9_]+')
hashtags[1:3]

## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)

hashtags_vector <- unlist(hashtags)
head(sort(table(hashtags_vector), decreasing = TRUE), n=10)

## hashtags_vector
##                  #MAGA                   #USA          #AmericaFirst 
##                     77                     32                     19 
##              #FakeNews #MakeAmericaGreatAgain             #TaxReform 
##                     17                     13                     12 
##                  #UNGA       #HurricaneHarvey                 #ICYMI 
##                     12                     11                     10 
##            #PuertoRico 
##                      8

Basics of text analysis

Pablo Barbera

October 16, 2017

String manipulation with R

Regular expressions

More complex examples of regular expressions using stringr