Introduction to text analysis

This RMarkdown offers an overview of the basic tools of text analysis that we will use in this course.

String manipulation with R

We will start with basic string manipulation with R.

Our running example will be the set of tweets posted by Donald Trump’s Twitter account, downloaded from https://www.thetrumparchive.com/.

tweets <- read.csv("../data/trump-tweets.csv", stringsAsFactors = FALSE)
head(tweets)

##             id
## 1 9.845497e+16
## 2 1.234653e+18
## 3 1.218011e+18
## 4 1.304875e+18
## 5 1.218160e+18
## 6 1.217963e+18
##                                                                                                                                                                                                                                                                                                 text
## 1                                                                                                                                                                                                                                 Republicans and Democrats have both created our economic problems.
## 2            I was thrilled to be back in the Great city of Charlotte, North Carolina with thousands of hardworking American Patriots who love our Country, cherish our values, respect our laws, and always put AMERICA FIRST! Thank you for a wonderful evening!! #KAG2020 https://t.co/dNJZfRsl9y
## 3                                                                                                                                                       RT @CBS_Herridge: READ: Letter to surveillance court obtained by CBS News questions where there will be further disciplinary action and cho…
## 4 The Unsolicited Mail In Ballot Scam is a major threat to our Democracy, &amp; the Democrats know it. Almost all recent elections using this system, even though much smaller &amp;  with far fewer Ballots to count, have ended up being a disaster. Large numbers of missing Ballots &amp; Fraud!
## 5                                                                                                                                                       RT @MZHemingway: Very friendly telling of events here about Comey's apparent leaking to compliant media. If you read those articles and tho…
## 6                                                                                                                                                       RT @WhiteHouse: President @realDonaldTrump announced historic steps to protect the Constitutional right to pray in public schools! https://…
##   isRetweet isDeleted             device favorites retweets                date
## 1         f         f          TweetDeck        49      255 2011-08-02 18:07:48
## 2         f         f Twitter for iPhone     73748    17404 2020-03-03 01:34:50
## 3         t         f Twitter for iPhone         0     7396 2020-01-17 03:22:47
## 4         f         f Twitter for iPhone     80527    23502 2020-09-12 20:10:58
## 5         t         f Twitter for iPhone         0     9081 2020-01-17 13:13:59
## 6         t         f Twitter for iPhone         0    25048 2020-01-17 00:11:56
##   isFlagged
## 1         f
## 2         f
## 3         f
## 4         f
## 5         f
## 6         f

# let's order by date
tweets <- tweets[order(tweets$date),]

R stores the basic string in a character vector. length gets the number of items in the vector, while nchar is the number of characters in the vector.

length(tweets$text)

## [1] 56571

tweets$text[1]

## [1] "Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!"

nchar(tweets$text[1])

## [1] 117

Note that we can work with multiple strings at once.

nchar(tweets$text[1:10])

##  [1] 117 131 116 103 113 111 114 108 118 115

sum(nchar(tweets$text[1:10]))

## [1] 1146

max(nchar(tweets$text[1:10]))

## [1] 131

We can merge different strings into one using paste:

paste(tweets$text[1], tweets$text[2], sep='--')

## [1] "Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!--Donald Trump will be appearing on The View tomorrow morning to discuss Celebrity Apprentice and his new book Think Like A Champion!"

As we will see later, it is often convenient to convert all words to lowercase or uppercase.

tolower(tweets$text[1])

## [1] "be sure to tune in and watch donald trump on late night with david letterman as he presents the top ten list tonight!"

toupper(tweets$text[1])

## [1] "BE SURE TO TUNE IN AND WATCH DONALD TRUMP ON LATE NIGHT WITH DAVID LETTERMAN AS HE PRESENTS THE TOP TEN LIST TONIGHT!"

We can grab substrings with substr. The first argument is the string, the second is the beginning index (starting from 1), and the third is final index.

substr(tweets$text[1], 1, 2)

## [1] "Be"

substr(tweets$text[1], 1, 10)

## [1] "Be sure to"

This is useful when working with date strings as well:

dates <- c("2015/01/01", "2014/12/01")
substr(dates, 1, 4) # years

## [1] "2015" "2014"

substr(dates, 6, 7) # months

## [1] "01" "12"

paste(substr(dates, 1, 4), 
      substr(dates, 6, 7), sep="-")

## [1] "2015-01" "2014-12"

Let’s dig into the data a little bit more. Given the source of the dataset, we can expect that there will be many tweets mentioning “Trump”. We can use the grep command to identify these. grep returns the index where the word occurs.

grep('Trump', tweets$text[1:10])

## [1]  1  2  3  5  6  7  8 10

grepl returns TRUE or FALSE, indicating whether each element of the character vector contains that particular pattern.

grepl("Trump", tweets$text[1:10])

##  [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

Going back to the full dataset, we can use the results of grep to get particular rows. First, check how many tweets mention the word “Trump”.

nrow(tweets)

## [1] 56571

head(grep('Trump', tweets$text))

## [1] 1 2 3 5 6 7

length(grep('Trump', tweets$text))

## [1] 17824

It is important to note that matching is case-sensitive. You can use the ignore.case argument to match to a lowercase version.

nrow(tweets)

## [1] 56571

length(grep('Trump', tweets$text))

## [1] 17824

length(grep('Trump', tweets$text, ignore.case = TRUE))

## [1] 18356

Now let’s try to identify what tweets mention the substring “Trump” and try to store them into a smaller data frame. How would we do it?

self_tweets <- tweets[grep('Trump', tweets$text, ignore.case=TRUE),]

Regular expressions

Another useful tool to work with text data is called “regular expression”. You can learn more about regular expressions here. Regular expressions let us develop complicated rules for both matching strings and extracting elements from them.

For example, we could look at tweets that mention more than one word using the operator “|” (equivalent to “OR”)

nrow(tweets)

## [1] 56571

length(grep('immigration|immigrant', tweets$text, ignore.case=TRUE))

## [1] 433

We can also use question marks to indicate optional characters.

nrow(tweets)

## [1] 56571

length(grep('immigrants?', tweets$text, ignore.case=TRUE))

## [1] 107

length(grep('immigrant|immigrants', tweets$text, ignore.case=TRUE))

## [1] 107

This will match immigrant or immigrants, etc.

Other common expression patterns are:

. matches any character, ^ and $ match the beginning and end of a string.
Any character followed by {3}, *, + is matched exactly 3 times, 0 or more times, 1 or more times.
[0-9], [a-zA-Z], [:alnum:] match any digit, any letter, or any digit and letter.
Special characters such as ., \, ( or ) must be preceded by a backslash.
See ?regex for more details.

For example, how many tweets ends with an exclamation mark? How many tweets are retweets? How many tweets mention any username? And a hashtag?

length(grep('!$', tweets$text, ignore.case=TRUE))

## [1] 11954

length(grep('^RT @', tweets$text, ignore.case=TRUE))

## [1] 9701

length(grep('@[A-Za-z0-9_]+', tweets$text, ignore.case=TRUE))

## [1] 32850

length(grep('#[A-Za-z0-9_]+', tweets$text, ignore.case=TRUE))

## [1] 7397

More complex examples of regular expressions using stringr

stringr is an R package that extends the capabilities of R for manipulation of text. Let’s say that e.g. we want to replace a pattern (or a regular expression) with another string:

library(stringr)
tweets$text[10000]

## [1] "\"\"\"@KevinMartinRI: I'm a big fan of the new @realDonaldTrump ties. http://t.co/Fka9s2D0e8\"\"  Thanks, selling great at.Macy's!\""

str_replace(tweets$text[10000], '@[0-9_A-Za-z]+', 'USERNAME')

## [1] "\"\"\"USERNAME: I'm a big fan of the new @realDonaldTrump ties. http://t.co/Fka9s2D0e8\"\"  Thanks, selling great at.Macy's!\""

Note this will only replace the first instance. For all the instances, do:

str_replace_all(tweets$text[10000], '@[:alnum:]+', 'USERNAME')

## [1] "\"\"\"USERNAME: I'm a big fan of the new USERNAME ties. http://t.co/Fka9s2D0e8\"\"  Thanks, selling great at.Macy's!\""

To extract a pattern we can use str_extract, and again we can extract one or all instances of the pattern:

str_extract(tweets$text[10000], '@[0-9_A-Za-z]+')

## [1] "@KevinMartinRI"

str_extract_all(tweets$text[10000], '@[0-9_A-Za-z]+')

## [[1]]
## [1] "@KevinMartinRI"   "@realDonaldTrump"

This function is vectorized, which means we can apply it to all elements of a vector simultaneously. That will give us a list, which we can then turn into a vector to get a frequency table of the most frequently mentioned handles or hashtags:

handles <- str_extract_all(tweets$text, '@[0-9_A-Za-z]+')
handles[1:3]

## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)

handles_vector <- unlist(handles)
head(sort(table(handles_vector), decreasing = TRUE), n=10)

## handles_vector
## @realDonaldTrump         @FoxNews      @WhiteHouse     @BarackObama 
##            10955              938              840              738 
##   @foxandfriends             @CNN   @ApprenticeNBC     @IvankaTrump 
##              703              395              393              326 
##       @TeamTrump      @MittRomney 
##              323              318

# now with hashtags...
hashtags <- str_extract_all(tweets$text, '#[A-Za-z0-9_]+')
hashtags[1:3]

## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)

hashtags_vector <- unlist(hashtags)
head(sort(table(hashtags_vector), decreasing = TRUE), n=10)

## hashtags_vector
##             #Trump2016 #MakeAmericaGreatAgain                  #MAGA 
##                    761                    557                    524 
##       #CelebApprentice                     #1   #CelebrityApprentice 
##                    289                    144                    137 
##          #AmericaFirst        #TimeToGetTough                 #Trump 
##                    107                     95                     81 
##         #DrainTheSwamp 
##                     78

Introduction to text analysis

Pablo Barbera

String manipulation with R

Regular expressions

More complex examples of regular expressions using stringr