This RMarkdown file provides additional information about the basic tools of text analysis that we will use in this course to clean text data scraped from the web.
We will start with basic string manipulation with R.
Our running example will be the set of tweets posted by Donald Trump’s Twitter account since January 1st, 2018
library(streamR)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
## Loading required package: ndjson
tweets <- parseTweets("~/data/trump-tweets.json")
## 3866 tweets have been parsed.
head(tweets)
## text
## 1 We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY
## 2 Amy Kremer, Women for Trump, was so great on @foxandfriends. Brave and very smart, thank you Amy! @AmyKremer
## 3 Thank you South Carolina. Now let’s get out tomorrow and VOTE for @HenryMcMaster! https://t.co/5xlz0wfMfu
## 4 Just watched @SharkGregNorman on @foxandfriends. Said “President is doing a great job. All over the world, people want to come back to the U.S.” Thank you Greg, and you’re looking and doing great!
## 5 Russia continues to say they had nothing to do with Meddling in our Election! Where is the DNC Server, and why didn’t Shady James Comey and the now disgraced FBI agents take and closely examine it? Why isn’t Hillary/Russia being looked at? So many questions, so much corruption!
## 6 Statement on Justice Anthony Kennedy. #SCOTUS https://t.co/8aWJ6fWemA
## retweet_count favorite_count favorited truncated id_str
## 1 30514 89162 FALSE FALSE 1010246126820347906
## 2 9382 50425 FALSE FALSE 1012297599431401474
## 3 13631 56997 FALSE FALSE 1011422555947712513
## 4 12007 62025 FALSE FALSE 1012299239207198721
## 5 23077 92661 FALSE FALSE 1012295859072126977
## 6 11138 47234 FALSE FALSE 1012051330591023107
## in_reply_to_screen_name
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## source
## 1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 2 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 3 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 4 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 5 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 6 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## retweeted created_at in_reply_to_status_id_str
## 1 FALSE Fri Jun 22 19:40:20 +0000 2018 NA
## 2 FALSE Thu Jun 28 11:32:09 +0000 2018 NA
## 3 FALSE Tue Jun 26 01:35:03 +0000 2018 NA
## 4 FALSE Thu Jun 28 11:38:40 +0000 2018 NA
## 5 FALSE Thu Jun 28 11:25:15 +0000 2018 NA
## 6 FALSE Wed Jun 27 19:13:34 +0000 2018 NA
## in_reply_to_user_id_str lang listed_count verified location
## 1 NA en 89807 TRUE Washington, DC
## 2 NA en 89807 TRUE Washington, DC
## 3 NA en 89807 TRUE Washington, DC
## 4 NA en 89807 TRUE Washington, DC
## 5 NA en 89807 TRUE Washington, DC
## 6 NA en 89807 TRUE Washington, DC
## user_id_str description geo_enabled
## 1 25073877 45th President of the United States of America🇺🇸 TRUE
## 2 25073877 45th President of the United States of America🇺🇸 TRUE
## 3 25073877 45th President of the United States of America🇺🇸 TRUE
## 4 25073877 45th President of the United States of America🇺🇸 TRUE
## 5 25073877 45th President of the United States of America🇺🇸 TRUE
## 6 25073877 45th President of the United States of America🇺🇸 TRUE
## user_created_at statuses_count followers_count
## 1 Wed Mar 18 13:46:38 +0000 2009 38073 53101783
## 2 Wed Mar 18 13:46:38 +0000 2009 38073 53101783
## 3 Wed Mar 18 13:46:38 +0000 2009 38073 53101783
## 4 Wed Mar 18 13:46:38 +0000 2009 38073 53101783
## 5 Wed Mar 18 13:46:38 +0000 2009 38073 53101783
## 6 Wed Mar 18 13:46:38 +0000 2009 38073 53101783
## favourites_count protected user_url name
## 1 25 FALSE https://t.co/OMxB0x7xC5 Donald J. Trump
## 2 25 FALSE https://t.co/OMxB0x7xC5 Donald J. Trump
## 3 25 FALSE https://t.co/OMxB0x7xC5 Donald J. Trump
## 4 25 FALSE https://t.co/OMxB0x7xC5 Donald J. Trump
## 5 25 FALSE https://t.co/OMxB0x7xC5 Donald J. Trump
## 6 25 FALSE https://t.co/OMxB0x7xC5 Donald J. Trump
## time_zone user_lang utc_offset friends_count screen_name
## 1 NA en NA 47 realDonaldTrump
## 2 NA en NA 47 realDonaldTrump
## 3 NA en NA 47 realDonaldTrump
## 4 NA en NA 47 realDonaldTrump
## 5 NA en NA 47 realDonaldTrump
## 6 NA en NA 47 realDonaldTrump
## country_code country place_type full_name place_name place_id place_lat
## 1 <NA> NA NA <NA> <NA> <NA> NaN
## 2 <NA> NA NA <NA> <NA> <NA> NaN
## 3 <NA> NA NA <NA> <NA> <NA> NaN
## 4 <NA> NA NA <NA> <NA> <NA> NaN
## 5 <NA> NA NA <NA> <NA> <NA> NaN
## 6 <NA> NA NA <NA> <NA> <NA> NaN
## place_lon lat lon
## 1 NaN NA NA
## 2 NaN NA NA
## 3 NaN NA NA
## 4 NaN NA NA
## 5 NaN NA NA
## 6 NaN NA NA
## expanded_url
## 1 https://www.pscp.tv/w/bf1GFzFvTlFsTFJub1dwUXd8MWpNSmdFVll5ZUFLTAWuHc0BMMKeCOoDRCPmtIftVLaFLQVwfSLoC_C0SbzX?t=9m9s
## 2 <NA>
## 3 https://www.pscp.tv/w/bgGOtTFvTlFsTFJub1dwUXd8MXlvSk1WZHJWQm54Uf-J8fPu1RO4E84ax-LuK1bAbiCpnzBBZmdPfI9FAhGV?t=11s
## 4 <NA>
## 5 <NA>
## 6 <NA>
## url
## 1 https://t.co/ZjXESYAcjY
## 2 <NA>
## 3 https://t.co/5xlz0wfMfu
## 4 <NA>
## 5 <NA>
## 6 <NA>
R stores the basic string in a character vector. length
gets the number of items in the vector, while nchar
is the number of characters in the vector.
length(tweets$text)
## [1] 3866
tweets$text[1]
## [1] "We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY"
nchar(tweets$text[1])
## [1] 280
Note that we can work with multiple strings at once.
nchar(tweets$text[1:10])
## [1] 280 108 105 196 278 69 104 187 230 140
sum(nchar(tweets$text[1:10]))
## [1] 1697
max(nchar(tweets$text[1:10]))
## [1] 280
We can merge different strings into one using paste
:
paste(tweets$text[1], tweets$text[2], sep='--')
## [1] "We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY--Amy Kremer, Women for Trump, was so great on @foxandfriends. Brave and very smart, thank you Amy! @AmyKremer"
As we will see later, it is often convenient to convert all words to lowercase or uppercase.
tolower(tweets$text[1])
## [1] "we are gathered today to hear directly from the american victims of illegal immigration. these are the american citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. these are the families the media ignores...https://t.co/zjxesyacjy"
toupper(tweets$text[1])
## [1] "WE ARE GATHERED TODAY TO HEAR DIRECTLY FROM THE AMERICAN VICTIMS OF ILLEGAL IMMIGRATION. THESE ARE THE AMERICAN CITIZENS PERMANENTLY SEPARATED FROM THEIR LOVED ONES B/C THEY WERE KILLED BY CRIMINAL ILLEGAL ALIENS. THESE ARE THE FAMILIES THE MEDIA IGNORES...HTTPS://T.CO/ZJXESYACJY"
We can grab substrings with substr
. The first argument is the string, the second is the beginning index (starting from 1), and the third is final index.
substr(tweets$text[1], 1, 2)
## [1] "We"
substr(tweets$text[1], 1, 10)
## [1] "We are gat"
This is useful when working with date strings as well:
dates <- c("2015/01/01", "2014/12/01")
substr(dates, 1, 4) # years
## [1] "2015" "2014"
substr(dates, 6, 7) # months
## [1] "01" "12"
Let’s dig into the data a little bit more. Given the source of the dataset, we can expect that there will be many tweets mentioning topics such as immigration or health care. We can use the grep
command to identify these. grep
returns the index where the word occurs.
grep('immigration', tweets$text[1:25])
## [1] 14
grepl
returns TRUE
or FALSE
, indicating whether each element of the character vector contains that particular pattern.
grepl("immigration", tweets$text[1:25])
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE
Going back to the full dataset, we can use the results of grep
to get particular rows. First, check how many tweets mention the word “immigration”.
nrow(tweets)
## [1] 3866
grep('immigration', tweets$text)
## [1] 14 60 74 75 79 92 102 108 109 121 122 125 133 151
## [15] 166 185 193 531 605 614 621 789 827 834 835 1024 1111 1136
## [29] 1142 1162 1179 1183 1200 1202 1212 1229 1246 1296 1301 1308 1539 1649
## [43] 1785 1970 2348 2550 2899 2908 3234 3289 3347 3380 3547 3637 3684
length(grep('immigration', tweets$text))
## [1] 55
It is important to note that matching is case-sensitive. You can use the ignore.case
argument to match to a lowercase version.
nrow(tweets)
## [1] 3866
length(grep('immigration', tweets$text))
## [1] 55
length(grep('immigration', tweets$text, ignore.case = TRUE))
## [1] 77
Now let’s try to identify what tweets are related to immigration and try to store them into a smaller data frame. How would we do it?
immi_tweets <- tweets[grep('immigration', tweets$text, ignore.case=TRUE),]
Another useful tool to work with text data is called “regular expression”. You can learn more about regular expressions here. Regular expressions let us develop complicated rules for both matching strings and extracting elements from them.
For example, we could look at tweets that mention more than one handle using the operator “|” (equivalent to “OR”)
nrow(tweets)
## [1] 3866
length(grep('immigration|immigrant', tweets$text, ignore.case=TRUE))
## [1] 91
We can also use question marks to indicate optional characters.
nrow(tweets)
## [1] 3866
length(grep('immigr?', tweets$text, ignore.case=TRUE))
## [1] 91
This will match immigration, immigrant, immigrants, etc.
Other common expression patterns are:
.
matches any character, ^
and $
match the beginning and end of a string.{3}
, *
, +
is matched exactly 3 times, 0 or more times, 1 or more times.[0-9]
, [a-zA-Z]
, [:alnum:]
match any digit, any letter, or any digit and letter..
, \
, (
or )
must be preceded by a backslash.?regex
for more details.For example, how many tweets ends with an exclamation mark? How many tweets are retweets? How many tweets mention any username? And a hashtag?
length(grep('!$', tweets$text, ignore.case=TRUE))
## [1] 1528
length(grep('^RT @', tweets$text, ignore.case=TRUE))
## [1] 419
length(grep('@[A-Za-z0-9_]+', tweets$text, ignore.case=TRUE))
## [1] 1018
length(grep('#[A-Za-z0-9_]+', tweets$text, ignore.case=TRUE))
## [1] 581
stringr
is an R package that extends the capabilities of R for manipulation of text. Let’s say that e.g. we want to replace a pattern (or a regular expression) with another string:
library(stringr)
str_replace(tweets$text[2], '@[0-9_A-Za-z]+', 'USERNAME')
## [1] "Amy Kremer, Women for Trump, was so great on USERNAME. Brave and very smart, thank you Amy! @AmyKremer"
Note this will only replace the first instance. For all the instances, do:
str_replace_all(tweets$text[2], '@[0-9_A-Za-z]+', 'USERNAME')
## [1] "Amy Kremer, Women for Trump, was so great on USERNAME. Brave and very smart, thank you Amy! USERNAME"
To extract a pattern we can use str_extract
, and again we can extract one or all instances of the pattern:
str_extract(tweets$text[2], '@[0-9_A-Za-z]+')
## [1] "@foxandfriends"
str_extract_all(tweets$text[2], '@[0-9_A-Za-z]+')
## [[1]]
## [1] "@foxandfriends" "@AmyKremer"
This function is vectorized, which means we can apply it to all elements of a vector simultaneously. That will give us a list, which we can then turn into a vector to get a frequency table of the most frequently mentioned handles or hashtags:
handles <- str_extract_all(tweets$text, '@[0-9_A-Za-z]+')
handles[1:3]
## [[1]]
## character(0)
##
## [[2]]
## [1] "@foxandfriends" "@AmyKremer"
##
## [[3]]
## [1] "@HenryMcMaster"
handles_vector <- unlist(handles)
head(sort(table(handles_vector), decreasing = TRUE), n=10)
## handles_vector
## @foxandfriends @realDonaldTrump @WhiteHouse @FoxNews
## 122 109 106 79
## @POTUS @FLOTUS @nytimes @Scavino45
## 50 48 34 32
## @IvankaTrump @EricTrump
## 31 27
# now with hashtags...
hashtags <- str_extract_all(tweets$text, '#[A-Za-z0-9_]+')
hashtags[1:3]
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
hashtags_vector <- unlist(hashtags)
head(sort(table(hashtags_vector), decreasing = TRUE), n=10)
## hashtags_vector
## #MAGA #USA #AmericaFirst
## 77 32 19
## #FakeNews #MakeAmericaGreatAgain #TaxReform
## 17 13 12
## #UNGA #HurricaneHarvey #ICYMI
## 12 11 10
## #PuertoRico
## 8
Before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda
package quanteda here.
The basic unit of work for the quanteda
package is called a corpus
, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary
to get some information about your corpus.
library(quanteda)
## Package version: 1.3.4
## Parallel computing: 2 of 2 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library(streamR)
tweets <- parseTweets("~/data/trump-tweets.json")
## 3866 tweets have been parsed.
twcorpus <- corpus(tweets$text)
summary(twcorpus, n=10)
## Corpus consisting of 3866 documents, showing 10 documents:
##
## Text Types Tokens Sentences
## text1 40 54 3
## text2 20 23 3
## text3 20 22 3
## text4 32 41 4
## text5 48 56 4
## text6 12 14 2
## text7 20 22 2
## text8 29 31 2
## text9 44 50 3
## text10 22 24 2
##
## Source: /home/ecpr40/code/* on x86_64 by ecpr40
## Created: Mon Jul 30 10:11:47 2018
## Notes:
A very useful feature of corpus objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.
kwic(twcorpus, "immigration", window=10)[1:5,]
##
## [text1, 14] today to hear directly from the AMERICAN VICTIMS of ILLEGAL
## [text10, 17] today to hear directly from the AMERICAN VICTIMS of ILLEGAL
## [text14, 11] .... If this is done, illegal
## [text15, 9] HOUSE REPUBLICANS SHOULD PASS THE STRONG BUT FAIR
## [text41, 6] .... Our
##
## | IMMIGRATION |
## | IMMIGRATION |
## | immigration |
## | IMMIGRATION |
## | Immigration |
##
## . These are the American Citizens permanently separated from their
## . These are the American Citize…
## will be stopped in it's tracks- and at very
## BILL, KNOWN AS GOODLATTE II, IN THEIR AFTERNOON
## policy, laughed at all over the world, is
kwic(twcorpus, "healthcare", window=10)[1:5,]
##
## [text46, 17] help to me on Cutting Taxes, creating great new |
## [text182, 37] He is tough on Crime and Strong on Borders, |
## [text507, 48] Warren lines, loves sanctuary cities, bad and expensive |
## [text530, 6] The American people deserve a |
## [text554, 27] will be a great Governor with a heavy focus on |
##
## healthcare | programs at low cost, fighting for Border Security,
## Healthcare | , the Military and our great Vets. Henry has
## healthcare | ...
## healthcare | system that takes care of them- not one that
## HealthCare | and Jobs. His Socialist opponent in November should not
kwic(twcorpus, "clinton", window=10)[1:5,]
##
## [text141, 23] the Bush Dynasty, then I had to beat the |
## [text161, 20] the Bush Dynasty, then I had to beat the |
## [text204, 9] FBI Agent Peter Strzok, who headed the |
## [text216, 13] :.@jasoninthehouse: All of this started because Hillary |
## [text252, 10] .... Schneiderman, who ran the |
##
## Clinton | Dynasty, and now I…
## Clinton | Dynasty, and now I have to beat a phony
## Clinton | & amp; Russia investigations, texted to his lover
## Clinton | set up her private server https:// t.co
## Clinton | campaign in New York, never had the guts to
We can then convert a corpus into a document-feature matrix using the dfm
function.
twdfm <- dfm(twcorpus, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 3,866 documents, 9,930 features
## ... created a 3,866 x 9,930 sparse dfm
## ... complete.
## Elapsed time: 0.251 seconds.
twdfm
## Document-feature matrix of: 3,866 documents, 9,930 features (99.7% sparse).
The dfm
will show the count of times each word appears in each document (tweet):
twdfm[1:5, 1:10]
## Document-feature matrix of: 5 documents, 10 features (72% sparse).
## 5 x 10 sparse Matrix of class "dfm"
## features
## docs we are gathered today to hear directly from the american
## text1 1 3 1 1 1 1 1 2 4 2
## text2 0 0 0 0 0 0 0 0 0 0
## text3 0 0 0 0 0 0 0 0 0 0
## text4 0 0 0 0 2 0 0 0 2 0
## text5 0 0 0 0 2 0 0 0 2 0
dfm
has many useful options (check out ?dfm
for more information). Let’s actually use it extract n-grams, remove punctuation, keep Twitter features…
twdfm <- dfm(twcorpus, tolower=TRUE, remove_punct = TRUE, remove_url=TRUE, ngrams=1:3, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 3,866 documents, 128,909 features
## ... created a 3,866 x 128,909 sparse dfm
## ... complete.
## Elapsed time: 0.953 seconds.
twdfm
## Document-feature matrix of: 3,866 documents, 128,909 features (99.9% sparse).
Note that here we use ngrams – this will extract all combinations of one, two, and three words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens in the matrix).
In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim
the dfm:
twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
## Removing features occurring:
## - in fewer than 3 documents: 117,710
## Total features removed: 117,710 (91.3%).
twdfm
## Document-feature matrix of: 3,866 documents, 11,199 features (99.7% sparse).
It’s often a good idea to take a look at a wordcloud of the most frequent features to see if there’s anything weird.
textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)
What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures
topfeatures(twdfm, 25)
## the to and of a in is for on our will great
## 4580 2697 2493 1945 1549 1455 1299 1088 920 887 836 825
## with are we i be that amp it have at you was
## 815 793 764 735 714 707 637 601 536 523 520 492
## they
## 474
We can remove the stopwords when we create the dfm
object:
twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), remove_url=TRUE, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 3,866 documents, 8,456 features
## ... removed 165 features
## ... created a 3,866 x 8,291 sparse dfm
## ... complete.
## Elapsed time: 0.27 seconds.
textplot_wordcloud(twdfm, rotation=0, min_size=.75, max_size=3, max_words=50)