This RMarkdown offers an overview of the basic tools of text analysis that we will use in this course.
We will start with basic string manipulation with R.
Our running example will be the set of tweets posted by Donald Trump’s Twitter account, downloaded from https://www.thetrumparchive.com/.
tweets <- read.csv("../data/trump-tweets.csv", stringsAsFactors = FALSE)
head(tweets)
## id
## 1 9.845497e+16
## 2 1.234653e+18
## 3 1.218011e+18
## 4 1.304875e+18
## 5 1.218160e+18
## 6 1.217963e+18
## text
## 1 Republicans and Democrats have both created our economic problems.
## 2 I was thrilled to be back in the Great city of Charlotte, North Carolina with thousands of hardworking American Patriots who love our Country, cherish our values, respect our laws, and always put AMERICA FIRST! Thank you for a wonderful evening!! #KAG2020 https://t.co/dNJZfRsl9y
## 3 RT @CBS_Herridge: READ: Letter to surveillance court obtained by CBS News questions where there will be further disciplinary action and cho…
## 4 The Unsolicited Mail In Ballot Scam is a major threat to our Democracy, & the Democrats know it. Almost all recent elections using this system, even though much smaller & with far fewer Ballots to count, have ended up being a disaster. Large numbers of missing Ballots & Fraud!
## 5 RT @MZHemingway: Very friendly telling of events here about Comey's apparent leaking to compliant media. If you read those articles and tho…
## 6 RT @WhiteHouse: President @realDonaldTrump announced historic steps to protect the Constitutional right to pray in public schools! https://…
## isRetweet isDeleted device favorites retweets date
## 1 f f TweetDeck 49 255 2011-08-02 18:07:48
## 2 f f Twitter for iPhone 73748 17404 2020-03-03 01:34:50
## 3 t f Twitter for iPhone 0 7396 2020-01-17 03:22:47
## 4 f f Twitter for iPhone 80527 23502 2020-09-12 20:10:58
## 5 t f Twitter for iPhone 0 9081 2020-01-17 13:13:59
## 6 t f Twitter for iPhone 0 25048 2020-01-17 00:11:56
## isFlagged
## 1 f
## 2 f
## 3 f
## 4 f
## 5 f
## 6 f
# let's order by date
tweets <- tweets[order(tweets$date),]
R stores the basic string in a character vector. length
gets the number of items in the vector, while nchar
is the number of characters in the vector.
length(tweets$text)
## [1] 56571
tweets$text[1]
## [1] "Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!"
nchar(tweets$text[1])
## [1] 117
Note that we can work with multiple strings at once.
nchar(tweets$text[1:10])
## [1] 117 131 116 103 113 111 114 108 118 115
sum(nchar(tweets$text[1:10]))
## [1] 1146
max(nchar(tweets$text[1:10]))
## [1] 131
We can merge different strings into one using paste
:
paste(tweets$text[1], tweets$text[2], sep='--')
## [1] "Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!--Donald Trump will be appearing on The View tomorrow morning to discuss Celebrity Apprentice and his new book Think Like A Champion!"
As we will see later, it is often convenient to convert all words to lowercase or uppercase.
tolower(tweets$text[1])
## [1] "be sure to tune in and watch donald trump on late night with david letterman as he presents the top ten list tonight!"
toupper(tweets$text[1])
## [1] "BE SURE TO TUNE IN AND WATCH DONALD TRUMP ON LATE NIGHT WITH DAVID LETTERMAN AS HE PRESENTS THE TOP TEN LIST TONIGHT!"
We can grab substrings with substr
. The first argument is the string, the second is the beginning index (starting from 1), and the third is final index.
substr(tweets$text[1], 1, 2)
## [1] "Be"
substr(tweets$text[1], 1, 10)
## [1] "Be sure to"
This is useful when working with date strings as well:
dates <- c("2015/01/01", "2014/12/01")
substr(dates, 1, 4) # years
## [1] "2015" "2014"
substr(dates, 6, 7) # months
## [1] "01" "12"
paste(substr(dates, 1, 4),
substr(dates, 6, 7), sep="-")
## [1] "2015-01" "2014-12"
Let’s dig into the data a little bit more. Given the source of the dataset, we can expect that there will be many tweets mentioning “Trump”. We can use the grep
command to identify these. grep
returns the index where the word occurs.
grep('Trump', tweets$text[1:10])
## [1] 1 2 3 5 6 7 8 10
grepl
returns TRUE
or FALSE
, indicating whether each element of the character vector contains that particular pattern.
grepl("Trump", tweets$text[1:10])
## [1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
Going back to the full dataset, we can use the results of grep
to get particular rows. First, check how many tweets mention the word “Trump”.
nrow(tweets)
## [1] 56571
head(grep('Trump', tweets$text))
## [1] 1 2 3 5 6 7
length(grep('Trump', tweets$text))
## [1] 17824
It is important to note that matching is case-sensitive. You can use the ignore.case
argument to match to a lowercase version.
nrow(tweets)
## [1] 56571
length(grep('Trump', tweets$text))
## [1] 17824
length(grep('Trump', tweets$text, ignore.case = TRUE))
## [1] 18356
Now let’s try to identify what tweets mention the substring “Trump” and try to store them into a smaller data frame. How would we do it?
self_tweets <- tweets[grep('Trump', tweets$text, ignore.case=TRUE),]
Another useful tool to work with text data is called “regular expression”. You can learn more about regular expressions here. Regular expressions let us develop complicated rules for both matching strings and extracting elements from them.
For example, we could look at tweets that mention more than one word using the operator “|” (equivalent to “OR”)
nrow(tweets)
## [1] 56571
length(grep('immigration|immigrant', tweets$text, ignore.case=TRUE))
## [1] 433
We can also use question marks to indicate optional characters.
nrow(tweets)
## [1] 56571
length(grep('immigrants?', tweets$text, ignore.case=TRUE))
## [1] 107
length(grep('immigrant|immigrants', tweets$text, ignore.case=TRUE))
## [1] 107
This will match immigrant or immigrants, etc.
Other common expression patterns are:
.
matches any character, ^
and $
match the beginning and end of a string.{3}
, *
, +
is matched exactly 3 times, 0 or more times, 1 or more times.[0-9]
, [a-zA-Z]
, [:alnum:]
match any digit, any letter, or any digit and letter..
, \
, (
or )
must be preceded by a backslash.?regex
for more details.For example, how many tweets ends with an exclamation mark? How many tweets are retweets? How many tweets mention any username? And a hashtag?
length(grep('!$', tweets$text, ignore.case=TRUE))
## [1] 11954
length(grep('^RT @', tweets$text, ignore.case=TRUE))
## [1] 9701
length(grep('@[A-Za-z0-9_]+', tweets$text, ignore.case=TRUE))
## [1] 32850
length(grep('#[A-Za-z0-9_]+', tweets$text, ignore.case=TRUE))
## [1] 7397
stringr
is an R package that extends the capabilities of R for manipulation of text. Let’s say that e.g. we want to replace a pattern (or a regular expression) with another string:
library(stringr)
tweets$text[10000]
## [1] "\"\"\"@KevinMartinRI: I'm a big fan of the new @realDonaldTrump ties. http://t.co/Fka9s2D0e8\"\" Thanks, selling great at.Macy's!\""
str_replace(tweets$text[10000], '@[0-9_A-Za-z]+', 'USERNAME')
## [1] "\"\"\"USERNAME: I'm a big fan of the new @realDonaldTrump ties. http://t.co/Fka9s2D0e8\"\" Thanks, selling great at.Macy's!\""
Note this will only replace the first instance. For all the instances, do:
str_replace_all(tweets$text[10000], '@[:alnum:]+', 'USERNAME')
## [1] "\"\"\"USERNAME: I'm a big fan of the new USERNAME ties. http://t.co/Fka9s2D0e8\"\" Thanks, selling great at.Macy's!\""
To extract a pattern we can use str_extract
, and again we can extract one or all instances of the pattern:
str_extract(tweets$text[10000], '@[0-9_A-Za-z]+')
## [1] "@KevinMartinRI"
str_extract_all(tweets$text[10000], '@[0-9_A-Za-z]+')
## [[1]]
## [1] "@KevinMartinRI" "@realDonaldTrump"
This function is vectorized, which means we can apply it to all elements of a vector simultaneously. That will give us a list, which we can then turn into a vector to get a frequency table of the most frequently mentioned handles or hashtags:
handles <- str_extract_all(tweets$text, '@[0-9_A-Za-z]+')
handles[1:3]
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
handles_vector <- unlist(handles)
head(sort(table(handles_vector), decreasing = TRUE), n=10)
## handles_vector
## @realDonaldTrump @FoxNews @WhiteHouse @BarackObama
## 10955 938 840 738
## @foxandfriends @CNN @ApprenticeNBC @IvankaTrump
## 703 395 393 326
## @TeamTrump @MittRomney
## 323 318
# now with hashtags...
hashtags <- str_extract_all(tweets$text, '#[A-Za-z0-9_]+')
hashtags[1:3]
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
hashtags_vector <- unlist(hashtags)
head(sort(table(hashtags_vector), decreasing = TRUE), n=10)
## hashtags_vector
## #Trump2016 #MakeAmericaGreatAgain #MAGA
## 761 557 524
## #CelebApprentice #1 #CelebrityApprentice
## 289 144 137
## #AmericaFirst #TimeToGetTough #Trump
## 107 95 81
## #DrainTheSwamp
## 78