This RMarkdown file provides additional information about the basic tools of text analysis that we will use in this course to clean text data scraped from the web.
We will start with basic string manipulation with R.
Our running example will be the set of tweets posted by Donald Trump’s Twitter account since January 1st, 2018
library(streamR)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
## Warning: package 'rjson' was built under R version 3.4.4
## Loading required package: ndjson
## Warning: package 'ndjson' was built under R version 3.4.4
tweets <- parseTweets("~/data/trump-tweets.json")
## 3866 tweets have been parsed.
head(tweets)
## text
## 1 We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY
## 2 Amy Kremer, Women for Trump, was so great on @foxandfriends. Brave and very smart, thank you Amy! @AmyKremer
## 3 Thank you South Carolina. Now let’s get out tomorrow and VOTE for @HenryMcMaster! https://t.co/5xlz0wfMfu
## 4 Just watched @SharkGregNorman on @foxandfriends. Said “President is doing a great job. All over the world, people want to come back to the U.S.” Thank you Greg, and you’re looking and doing great!
## 5 Russia continues to say they had nothing to do with Meddling in our Election! Where is the DNC Server, and why didn’t Shady James Comey and the now disgraced FBI agents take and closely examine it? Why isn’t Hillary/Russia being looked at? So many questions, so much corruption!
## 6 Statement on Justice Anthony Kennedy. #SCOTUS https://t.co/8aWJ6fWemA
## retweet_count favorite_count favorited truncated id_str
## 1 30514 89162 FALSE FALSE 1010246126820347906
## 2 9382 50425 FALSE FALSE 1012297599431401474
## 3 13631 56997 FALSE FALSE 1011422555947712513
## 4 12007 62025 FALSE FALSE 1012299239207198721
## 5 23077 92661 FALSE FALSE 1012295859072126977
## 6 11138 47234 FALSE FALSE 1012051330591023107
## in_reply_to_screen_name
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## source
## 1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 2 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 3 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 4 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 5 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 6 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## retweeted created_at in_reply_to_status_id_str
## 1 FALSE Fri Jun 22 19:40:20 +0000 2018 NA
## 2 FALSE Thu Jun 28 11:32:09 +0000 2018 NA
## 3 FALSE Tue Jun 26 01:35:03 +0000 2018 NA
## 4 FALSE Thu Jun 28 11:38:40 +0000 2018 NA
## 5 FALSE Thu Jun 28 11:25:15 +0000 2018 NA
## 6 FALSE Wed Jun 27 19:13:34 +0000 2018 NA
## in_reply_to_user_id_str lang listed_count verified location
## 1 NA en 89807 TRUE Washington, DC
## 2 NA en 89807 TRUE Washington, DC
## 3 NA en 89807 TRUE Washington, DC
## 4 NA en 89807 TRUE Washington, DC
## 5 NA en 89807 TRUE Washington, DC
## 6 NA en 89807 TRUE Washington, DC
## user_id_str
## 1 25073877
## 2 25073877
## 3 25073877
## 4 25073877
## 5 25073877
## 6 25073877
## description
## 1 45th President of the United States of America\U0001f1fa\U0001f1f8
## 2 45th President of the United States of America\U0001f1fa\U0001f1f8
## 3 45th President of the United States of America\U0001f1fa\U0001f1f8
## 4 45th President of the United States of America\U0001f1fa\U0001f1f8
## 5 45th President of the United States of America\U0001f1fa\U0001f1f8
## 6 45th President of the United States of America\U0001f1fa\U0001f1f8
## geo_enabled user_created_at statuses_count
## 1 TRUE Wed Mar 18 13:46:38 +0000 2009 38073
## 2 TRUE Wed Mar 18 13:46:38 +0000 2009 38073
## 3 TRUE Wed Mar 18 13:46:38 +0000 2009 38073
## 4 TRUE Wed Mar 18 13:46:38 +0000 2009 38073
## 5 TRUE Wed Mar 18 13:46:38 +0000 2009 38073
## 6 TRUE Wed Mar 18 13:46:38 +0000 2009 38073
## followers_count favourites_count protected user_url
## 1 53101783 25 FALSE https://t.co/OMxB0x7xC5
## 2 53101783 25 FALSE https://t.co/OMxB0x7xC5
## 3 53101783 25 FALSE https://t.co/OMxB0x7xC5
## 4 53101783 25 FALSE https://t.co/OMxB0x7xC5
## 5 53101783 25 FALSE https://t.co/OMxB0x7xC5
## 6 53101783 25 FALSE https://t.co/OMxB0x7xC5
## name time_zone user_lang utc_offset friends_count
## 1 Donald J. Trump NA en NA 47
## 2 Donald J. Trump NA en NA 47
## 3 Donald J. Trump NA en NA 47
## 4 Donald J. Trump NA en NA 47
## 5 Donald J. Trump NA en NA 47
## 6 Donald J. Trump NA en NA 47
## screen_name country_code country place_type full_name place_name
## 1 realDonaldTrump <NA> NA NA <NA> <NA>
## 2 realDonaldTrump <NA> NA NA <NA> <NA>
## 3 realDonaldTrump <NA> NA NA <NA> <NA>
## 4 realDonaldTrump <NA> NA NA <NA> <NA>
## 5 realDonaldTrump <NA> NA NA <NA> <NA>
## 6 realDonaldTrump <NA> NA NA <NA> <NA>
## place_id place_lat place_lon lat lon
## 1 <NA> NaN NaN NA NA
## 2 <NA> NaN NaN NA NA
## 3 <NA> NaN NaN NA NA
## 4 <NA> NaN NaN NA NA
## 5 <NA> NaN NaN NA NA
## 6 <NA> NaN NaN NA NA
## expanded_url
## 1 https://www.pscp.tv/w/bf1GFzFvTlFsTFJub1dwUXd8MWpNSmdFVll5ZUFLTAWuHc0BMMKeCOoDRCPmtIftVLaFLQVwfSLoC_C0SbzX?t=9m9s
## 2 <NA>
## 3 https://www.pscp.tv/w/bgGOtTFvTlFsTFJub1dwUXd8MXlvSk1WZHJWQm54Uf-J8fPu1RO4E84ax-LuK1bAbiCpnzBBZmdPfI9FAhGV?t=11s
## 4 <NA>
## 5 <NA>
## 6 <NA>
## url
## 1 https://t.co/ZjXESYAcjY
## 2 <NA>
## 3 https://t.co/5xlz0wfMfu
## 4 <NA>
## 5 <NA>
## 6 <NA>
R stores the basic string in a character vector. length
gets the number of items in the vector, while nchar
is the number of characters in the vector.
length(tweets$text)
## [1] 3866
tweets$text[1]
## [1] "We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY"
nchar(tweets$text[1])
## [1] 280
Note that we can work with multiple strings at once.
nchar(tweets$text[1:10])
## [1] 280 108 105 196 278 69 104 187 230 140
sum(nchar(tweets$text[1:10]))
## [1] 1697
max(nchar(tweets$text[1:10]))
## [1] 280
We can merge different strings into one using paste
:
paste(tweets$text[1], tweets$text[2], sep='--')
## [1] "We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY--Amy Kremer, Women for Trump, was so great on @foxandfriends. Brave and very smart, thank you Amy! @AmyKremer"
As we will see later, it is often convenient to convert all words to lowercase or uppercase.
tolower(tweets$text[1])
## [1] "we are gathered today to hear directly from the american victims of illegal immigration. these are the american citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. these are the families the media ignores...https://t.co/zjxesyacjy"
toupper(tweets$text[1])
## [1] "WE ARE GATHERED TODAY TO HEAR DIRECTLY FROM THE AMERICAN VICTIMS OF ILLEGAL IMMIGRATION. THESE ARE THE AMERICAN CITIZENS PERMANENTLY SEPARATED FROM THEIR LOVED ONES B/C THEY WERE KILLED BY CRIMINAL ILLEGAL ALIENS. THESE ARE THE FAMILIES THE MEDIA IGNORES...HTTPS://T.CO/ZJXESYACJY"
We can grab substrings with substr
. The first argument is the string, the second is the beginning index (starting from 1), and the third is final index.
substr(tweets$text[1], 1, 2)
## [1] "We"
substr(tweets$text[1], 1, 10)
## [1] "We are gat"
This is useful when working with date strings as well:
dates <- c("2015/01/01", "2014/12/01")
substr(dates, 1, 4) # years
## [1] "2015" "2014"
substr(dates, 6, 7) # months
## [1] "01" "12"
Let’s dig into the data a little bit more. Given the source of the dataset, we can expect that there will be many tweets mentioning topics such as immigration or health care. We can use the grep
command to identify these. grep
returns the index where the word occurs.
grep('immigration', tweets$text[1:25])
## [1] 14
grepl
returns TRUE
or FALSE
, indicating whether each element of the character vector contains that particular pattern.
grepl("immigration", tweets$text[1:25])
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE
Going back to the full dataset, we can use the results of grep
to get particular rows. First, check how many tweets mention the word “immigration”.
nrow(tweets)
## [1] 3866
grep('immigration', tweets$text)
## [1] 14 60 74 75 79 92 102 108 109 121 122 125 133 151
## [15] 166 185 193 531 605 614 621 789 827 834 835 1024 1111 1136
## [29] 1142 1162 1179 1183 1200 1202 1212 1229 1246 1296 1301 1308 1539 1649
## [43] 1785 1970 2348 2550 2899 2908 3234 3289 3347 3380 3547 3637 3684
length(grep('immigration', tweets$text))
## [1] 55
It is important to note that matching is case-sensitive. You can use the ignore.case
argument to match to a lowercase version.
nrow(tweets)
## [1] 3866
length(grep('immigration', tweets$text))
## [1] 55
length(grep('immigration', tweets$text, ignore.case = TRUE))
## [1] 77
Now let’s try to identify what tweets are related to immigration and try to store them into a smaller data frame. How would we do it?
immi_tweets <- tweets[grep('immigration', tweets$text, ignore.case=TRUE),]
Another useful tool to work with text data is called “regular expression”. You can learn more about regular expressions here. Regular expressions let us develop complicated rules for both matching strings and extracting elements from them.
For example, we could look at tweets that mention more than one handle using the operator “|” (equivalent to “OR”)
nrow(tweets)
## [1] 3866
length(grep('immigration|immigrant', tweets$text, ignore.case=TRUE))
## [1] 91
We can also use question marks to indicate optional characters.
nrow(tweets)
## [1] 3866
length(grep('immigr?', tweets$text, ignore.case=TRUE))
## [1] 91
This will match immigration, immigrant, immigrants, etc.
Other common expression patterns are:
.
matches any character, ^
and $
match the beginning and end of a string.{3}
, *
, +
is matched exactly 3 times, 0 or more times, 1 or more times.[0-9]
, [a-zA-Z]
, [:alnum:]
match any digit, any letter, or any digit and letter..
, \
, (
or )
must be preceded by a backslash.?regex
for more details.For example, how many tweets ends with an exclamation mark? How many tweets are retweets? How many tweets mention any username? And a hashtag?
length(grep('!$', tweets$text, ignore.case=TRUE))
## [1] 1528
length(grep('^RT @', tweets$text, ignore.case=TRUE))
## [1] 419
length(grep('@[A-Za-z0-9_]+', tweets$text, ignore.case=TRUE))
## [1] 1018
length(grep('#[A-Za-z0-9_]+', tweets$text, ignore.case=TRUE))
## [1] 581
stringr
is an R package that extends the capabilities of R for manipulation of text. Let’s say that e.g. we want to replace a pattern (or a regular expression) with another string:
library(stringr)
str_replace(tweets$text[2], '@[0-9_A-Za-z]+', 'USERNAME')
## [1] "Amy Kremer, Women for Trump, was so great on USERNAME. Brave and very smart, thank you Amy! @AmyKremer"
Note this will only replace the first instance. For all the instances, do:
str_replace_all(tweets$text[2], '@[0-9_A-Za-z]+', 'USERNAME')
## [1] "Amy Kremer, Women for Trump, was so great on USERNAME. Brave and very smart, thank you Amy! USERNAME"
To extract a pattern we can use str_extract
, and again we can extract one or all instances of the pattern:
str_extract(tweets$text[2], '@[0-9_A-Za-z]+')
## [1] "@foxandfriends"
str_extract_all(tweets$text[2], '@[0-9_A-Za-z]+')
## [[1]]
## [1] "@foxandfriends" "@AmyKremer"
This function is vectorized, which means we can apply it to all elements of a vector simultaneously. That will give us a list, which we can then turn into a vector to get a frequency table of the most frequently mentioned handles or hashtags:
handles <- str_extract_all(tweets$text, '@[0-9_A-Za-z]+')
handles[1:3]
## [[1]]
## character(0)
##
## [[2]]
## [1] "@foxandfriends" "@AmyKremer"
##
## [[3]]
## [1] "@HenryMcMaster"
handles_vector <- unlist(handles)
head(sort(table(handles_vector), decreasing = TRUE), n=10)
## handles_vector
## @foxandfriends @realDonaldTrump @WhiteHouse @FoxNews
## 122 109 106 79
## @POTUS @FLOTUS @nytimes @Scavino45
## 50 48 34 32
## @IvankaTrump @EricTrump
## 31 27
# now with hashtags...
hashtags <- str_extract_all(tweets$text, '#[A-Za-z0-9_]+')
hashtags[1:3]
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
hashtags_vector <- unlist(hashtags)
head(sort(table(hashtags_vector), decreasing = TRUE), n=10)
## hashtags_vector
## #MAGA #USA #AmericaFirst
## 77 32 19
## #FakeNews #MakeAmericaGreatAgain #TaxReform
## 17 13 12
## #UNGA #HurricaneHarvey #ICYMI
## 12 11 10
## #PuertoRico
## 8