This is the list of packages we will use today:

install.packages("ROAuth")
install.packages("streamR")
install.packages("devtools")
devtools::install_github("pablobarbera/twitter_ideology/pkg/tweetscores")
install.packages("maps")

Let’s check that they work:

library(streamR)
## Loading required package: RCurl
## Loading required package: rjson
## Loading required package: ndjson
library(tweetscores)
## Loading required package: R2WinBUGS
## Loading required package: coda
## Loading required package: boot
## ##
## ## tweetscores: tools for the analysis of Twitter data
## ## Pablo Barbera (USC)
## ## www.tweetscores.com
## ##
library(maps)
library(ROAuth)

We’ll start by loading my OAuth token, which is required to connect to the API. Note that this will not work for you. In order to create your token, follow the instructions at the end of this script, which will require you create a developer account first.

load("~/my_oauth")

Collecting data from Twitter’s Streaming API

Collecting tweets filtering by keyword:

library(streamR)
filterStream(file.name="../data/biden-streaming-tweets.json", 
             track="biden", 
    timeout=20, oauth=my_oauth)
## Capturing tweets...
## Connection to Twitter stream was closed after 20 seconds with up to 140 tweets downloaded.

Note the options: - file.name indicates the file in your disk where the tweets will be downloaded
- track is the keyword(s) mentioned in the tweets we want to capture. - timeout is the number of seconds that the connection will remain open
- oauth is the OAuth token we are using

Once it has finished, we can open it in R as a data frame with the parseTweets function

tweets <- parseTweets("../data/biden-streaming-tweets.json")
## 140 tweets have been parsed.
tweets[1,]
##                                                                                                                                                                                                                                                           text
## 1 RT @kangoroo17:@RonnyJacksonTX President Biden has also accomplished to reduce the deficit by 1.7 trillion dollars.  \nRepublicans only gave tax cuts for the rich of 1 trillion dollars being paid by the middle class and the poor https://t.co/aQmzXHDnCP
##   retweet_count favorite_count favorited truncated              id_str
## 1             7             10     FALSE     FALSE 1583980261712150533
##   in_reply_to_screen_name
## 1                    <NA>
##                                                                              source
## 1 <a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>
##   retweeted                     created_at in_reply_to_status_id_str
## 1     FALSE Sun Oct 23 00:34:57 +0000 2022                      <NA>
##   in_reply_to_user_id_str lang listed_count verified location user_id_str
## 1                    <NA>   en          155    FALSE      USA   423931126
##                                                                                                                                           description
## 1 Dem since a child and a Tesla lover - Vote Blue - live Green! 💜 #TheResistance 🇺🇸 #BlackLivesMatter #🌻🇺🇦🌻 #SaveUkraine 🌻 dog and cat lover 🐶🐱
##   geo_enabled                user_created_at statuses_count followers_count
## 1       FALSE Tue Nov 29 03:28:05 +0000 2011         344177            3936
##   favourites_count protected user_url               name time_zone user_lang
## 1            65407     FALSE     <NA> (((Mother Earth)))        NA        NA
##   utc_offset friends_count screen_name country_code country place_type
## 1         NA          4974  G8trz4ever         <NA>    <NA>         NA
##   full_name place_name place_id place_lat place_lon lat lon expanded_url  url
## 1      <NA>       <NA>     <NA>       NaN       NaN  NA  NA         <NA> <NA>

If we want, we could also export it to a csv file to be opened later with Excel

write.csv(tweets, file="../data/biden-streaming-tweets.csv", row.names=FALSE)

We can also filter tweets in a specific language:

filterStream(file.name="../data/spanish-tweets.json", 
    track="trump", language='es',
    timeout=20, oauth=my_oauth)

tweets <- parseTweets("../data/spanish-tweets.json")
sample(tweets$text, 10)

And we can filter tweets by / retweeting / mentioning a specific user:

filterStream(file.name="../data/trump-follow-tweets.json", 
    follow=25073877, timeout=10, oauth=my_oauth)

tweets <- parseTweets("../data/trump-follow-tweets.json")
sample(tweets$text, 10)

We now turn to tweets collect filtering by location instead. To be able to apply this type of filter, we need to set a geographical box and collect only the tweets that are coming from that area.

For example, imagine we want to collect tweets from the United States. The way to do it is to find two pairs of coordinates (longitude and latitude) that indicate the southwest corner AND the northeast corner. Note the reverse order: it’s not (lat, long), but (long, lat).

In the case of the US, it would be approx. (-125,25) and (-66,50). How to find these coordinates? You can use Google Maps, and right-click on the desired location. (Just note that long and lat are reversed here!)

filterStream(file.name="../data/tweets_geo.json", locations=c(-125, 25, -66, 50), 
    timeout=30, oauth=my_oauth)
## Capturing tweets...
## Connection to Twitter stream was closed after 30 seconds with up to 321 tweets downloaded.

We can do as before and open the tweets in R

tweets <- parseTweets("../data/tweets_geo.json")
## 322 tweets have been parsed.

And use the maps library to see where most tweets are coming from. Note that there are two types of geographic information on tweets: lat/lon (from geolocated tweets) and place_lat and place_lon (from tweets with place information). We will work with whatever is available.

library(maps)
tweets$lat <- ifelse(is.na(tweets$lat), tweets$place_lat, tweets$lat)
tweets$lon <- ifelse(is.na(tweets$lon), tweets$place_lon, tweets$lon)
tweets <- tweets[!is.na(tweets$lat),]
states <- map.where("state", tweets$lon, tweets$lat)
head(sort(table(states), decreasing=TRUE))
## states
##    california         texas       florida  pennsylvania new york:main 
##            46            39            26            13            12 
##       arizona 
##            11

We can also prepare a map of the exact locations of the tweets.

library(ggplot2)

## First create a data frame with the map data 
map.data <- map_data("state")

# And we use ggplot2 to draw the map:
# 1) map base
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "grey90", 
    color = "grey50", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) + 
    # 2) limits for x and y axis
    scale_x_continuous(limits=c(-125,-66)) + scale_y_continuous(limits=c(25,50)) +
    # 3) adding the dot for each tweet
    geom_point(data = tweets, 
    aes(x = lon, y = lat), size = 1, alpha = 1/5, color = "darkblue") +
    # 4) removing unnecessary graph elements
    theme(axis.line = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank(), 
        axis.title = element_blank(), 
        panel.background = element_blank(), 
        panel.border = element_blank(), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(), 
        plot.background = element_blank()) 
## Warning: Removed 2 rows containing missing values (geom_point).

And here’s how to extract the edges of a network of retweets (at least one possible way of doing it):

tweets <- parseTweets("../data/biden-streaming-tweets.json")
## 140 tweets have been parsed.
# subset only RTs
rts <- tweets[grep("RT @", tweets$text),]
library(stringr)
edges <- data.frame(
  node1 = rts$screen_name,
  node2 = str_extract(rts$text, 'RT @[a-zA-Z0-9_]+'),
  stringsAsFactors=F
)
edges$node2 <- str_replace(edges$node2, 'RT @', '')

# plotting largest connected component
library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
g <- graph_from_data_frame(d=edges, directed=FALSE)
comp <- decompose(g, min.vertices=2)
plot(comp[[1]])

Finally, it’s also possible to collect a random sample of tweets. That’s what the “sampleStream” function does:

sampleStream(file.name="../data/tweets_random.json", timeout=30, oauth=my_oauth)
## Capturing tweets...
## Connection to Twitter stream was closed after 30 seconds with up to 1272 tweets downloaded.

Here I’m collecting 30 seconds of tweets. And once again, to open the tweets in R…

tweets <- parseTweets("../data/tweets_random.json")
## 1173 tweets have been parsed.

What is the most retweeted tweet?

tweets[which.max(tweets$retweet_count),]
##                                                                                                                                                                                                                      text
## 545 RT @MhorRitz:ยินดีด้วยกับการ #ว่ายน้ำข้ามโขง ของพี่ #โตโน่ภาคิน ในวันนี้นะครับ ที่ปลอดภัย และได้รับเงินบริจาคจำนวนมาก อย่างแรกต้องขอขอบคุณในน้ำใจและความเสียสละของพี่ที่มีต่อบุคลากรทางการแพทย์ คนที่พร้อมจะเสียสละเพื่อคนอื่นแบบพี่ ไม่ได้หาได้ง่ายเลย นับถือใจจริงๆ (1)
##     retweet_count favorite_count favorited truncated              id_str
## 545         58024          20422     FALSE     FALSE 1583980533004242944
##     in_reply_to_screen_name
## 545                    <NA>
##                                                                                 source
## 545 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
##     retweeted                     created_at in_reply_to_status_id_str
## 545     FALSE Sun Oct 23 00:36:02 +0000 2022                      <NA>
##     in_reply_to_user_id_str lang listed_count verified         location
## 545                    <NA>   th            0    FALSE I'm @chonburi ☀☁
##     user_id_str description geo_enabled                user_created_at
## 545   196240071        <NA>        TRUE Tue Sep 28 17:17:42 +0000 2010
##     statuses_count followers_count favourites_count protected
## 545          10252             201              367     FALSE
##                           user_url name time_zone user_lang utc_offset
## 545 https://twitter.com/pployployz   🥳        NA        NA         NA
##     friends_count screen_name country_code country place_type full_name
## 545           354 Ploynatjira         <NA>    <NA>         NA      <NA>
##     place_name place_id place_lat place_lon lat lon expanded_url  url
## 545       <NA>     <NA>       NaN       NaN  NA  NA         <NA> <NA>

What are the most popular hashtags at the moment? We’ll use regular expressions to extract hashtags.

library(stringr)
ht <- str_extract_all(tweets$text, '#[A-Za-z0-9_]+')
ht <- unlist(ht)
head(sort(table(ht), decreasing = TRUE))
## ht
##       #kapakl       #aliaga         #foca      #menemen #whatsappshow 
##            13            11            11            11            11 
##      #bornova 
##             8

And who are the most frequently mentioned users?

handles <- str_extract_all(tweets$text, '@[0-9_A-Za-z]+')
handles_vector <- unlist(handles)
head(sort(table(handles_vector), decreasing = TRUE), n=10)
## handles_vector
##   @gregeruemolor         @BTS_twt @2UCvzMCivJLIvWX      @wass123451 
##               11                9                6                6 
##    @milephakphum      @Nnattawin1 @Abhijit67300789          @Azteca 
##                5                5                3                3 
##        @elonmusk    @emma_muyingo 
##                3                3

How many tweets mention BTS?

length(grep("bts", tweets$text, ignore.case=TRUE))
## [1] 7

These are toy examples, but for large files with tweets in JSON format, there might be faster ways to parse the data. For example, the ndjson package offers a robust and fast way to parse JSON data:

library(ndjson)
json <- stream_in("../data/tweets_geo.json")
json$text[1:5]
## [1] "@RealLyndaCarter @jcarteraltman 💜💜💜"                                        
## [2] "https://t.co/JFTOwDRpAm"                                                       
## [3] "@longhorndave Sorry bout your Horns though 🤗"                                 
## [4] "no saben cuanto odio en verdad los fines de semana"                            
## [5] "The Cleveland Cavaliers play tonight right now vs Chicago bulls . NBA channel."

Authenticating

Before we can start collecting Twitter data, we need to create an OAuth token that will allow us to authenticate our connection and access our personal data.

NOTE: getting a new token requires submitting an application for a developer account, which may take a few days.

After your application has been approved, you can create a token following these steps:

  1. Go to https://developer.twitter.com/en/apps and sign in.
  2. Once it’s approved, click on “Create New App”. You will need to have a phone number associated with your account in order to be able to create a token.
  3. Fill name, description, and website (it can be anything, even http://www.google.com). Make sure you leave ‘Callback URL’ empty.
  4. Agree to user conditions.
  5. From the “Keys and Access Tokens” tab, copy consumer key and consumer secret and paste below
  6. Click on “Create my access token”, then copy and paste your access token and access token secret below
my_oauth <- list(consumer_key = "CONSUMER_KEY",
   consumer_secret = "CONSUMER_SECRET",
   access_token="ACCESS_TOKEN",
   access_token_secret = "ACCESS_TOKEN_SECRET")
save(my_oauth, file="~/my_oauth")
load("~/my_oauth")

What can go wrong here? Make sure all the consumer and token keys are pasted here as is, without any additional space character. If you don’t see any output in the console after running the code above, that’s a good sign.

Note that I saved the list as a file in my hard drive. That will save us some time later on, but you could also just re-run the code in lines 22 to 27 before conecting to the API in the future.

To check that it worked, try running the line below:

library(tweetscores)
getUsers(screen_names="uscpoir", oauth = my_oauth)[[1]]$screen_name
## [1] "uscpoir"

If this displays uscpoir then we’re good to go!

Some of the functions below will work with more than one token. If you want to save multiple tokens, see the help menu in getUsers, for example.