Scraping data from Twitter’s Streaming API

Authenticating

Before we can start collecting Twitter data, we need to create an OAuth token that will allow us to authenticate our connection and access our personal data.

Follow these steps to create your token:

Go to apps.twitter.com and sign in.
Click on “Create New App”. You will need to have a phone number associated with your account in order to be able to create a token.
Fill name, description, and website (it can be anything, even http://www.google.com). Make sure you leave ‘Callback URL’ empty.
Agree to user conditions.
From the “Keys and Access Tokens” tab, copy consumer key and consumer secret and paste below
Click on “Create my access token”, then copy and paste your access token and access token secret below

library(ROAuth)
my_oauth <- list(consumer_key = "CONSUMER_KEY",
   consumer_secret = "CONSUMER_SECRET",
   access_token="ACCESS_TOKEN",
   access_token_secret = "ACCESS_TOKEN_SECRET")
save(my_oauth, file="~/my_oauth")

load("~/my_oauth")

What can go wrong here? Make sure all the consumer and token keys are pasted here as is, without any additional space character. If you don’t see any output in the console after running the code above, that’s a good sign.

Note that I saved the list as a file in my hard drive. That will save us some time later on, but you could also just re-run the code in lines 22 to 27 before conecting to the API in the future.

To check that it worked, try running the line below:

library(tweetscores)

## Loading required package: R2WinBUGS

## Loading required package: coda

## Loading required package: boot

## ##
## ## tweetscores: tools for the analysis of Twitter data

## ## Pablo Barbera (LSE)

## ## www.tweetscores.com
## ##

getUsers(screen_names="LSEnews", oauth = my_oauth)[[1]]$screen_name

## [1] "LSEnews"

If this displays LSEnews then we’re good to go!

Some of the functions below will work with more than one token. If you want to save multiple tokens, see the instructions at the end of the file.

Collecting data from Twitter’s Streaming API

Collecting tweets filtering by keyword:

library(streamR)

## Loading required package: RCurl

## Loading required package: bitops

## Loading required package: rjson

## Warning: package 'rjson' was built under R version 3.4.4

## Loading required package: ndjson

## Warning: package 'ndjson' was built under R version 3.4.4

filterStream(file.name="../data/trump-streaming-tweets.json", track="trump", 
    timeout=20, oauth=my_oauth)

## Capturing tweets...

## Connection to Twitter stream was closed after 20 seconds with up to 707 tweets downloaded.

Note the options: - file.name indicates the file in your disk where the tweets will be downloaded
- track is the keyword(s) mentioned in the tweets we want to capture. - timeout is the number of seconds that the connection will remain open
- oauth is the OAuth token we are using

Once it has finished, we can open it in R as a data frame with the parseTweets function

tweets <- parseTweets("../data/trump-streaming-tweets.json")

## 653 tweets have been parsed.

tweets[1,]

##                                                                                                                                               text
## 1 RT @Iran: Iran tells @realDonaldTrump: Stop tweeting, it’s driving up oil prices\n\n#Iran #US #oilprice #OOTT @VezaratNaft\n\nhttps://t.co/Hqfq…
##   retweet_count favorite_count favorited truncated              id_str
## 1             2              1     FALSE     FALSE 1014828371438694401
##   in_reply_to_screen_name
## 1                    <NA>
##                                                                               source
## 1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
##   retweeted                     created_at in_reply_to_status_id_str
## 1     FALSE Thu Jul 05 11:08:33 +0000 2018                      <NA>
##   in_reply_to_user_id_str lang listed_count verified location user_id_str
## 1                    <NA>   en           36    FALSE     <NA>  2748933258
##   description geo_enabled                user_created_at statuses_count
## 1        <NA>       FALSE Wed Aug 20 13:01:35 +0000 2014          16270
##   followers_count favourites_count protected user_url    name time_zone
## 1             402             4586     FALSE     <NA> Mr Khan        NA
##   user_lang utc_offset friends_count   screen_name country_code country
## 1     en-gb         NA          5000 evergreatkhan         <NA>    <NA>
##   place_type full_name place_name place_id place_lat place_lon lat lon
## 1         NA      <NA>       <NA>     <NA>       NaN       NaN  NA  NA
##   expanded_url  url
## 1         <NA> <NA>

If we want, we could also export it to a csv file to be opened later with Excel

write.csv(tweets, file="../data/trump-streaming-tweets.csv", row.names=FALSE)

And this is how we would capture tweets mentioning multiple keywords:

filterStream(file.name="../data/politics-tweets.json", 
    track=c("graham", "sessions", "trump", "clinton"),
    tweets=20, oauth=my_oauth)

Note that here I choose a different option, tweets, which indicates how many tweets (approximately) the function should capture before we close the connection to the Twitter API.

This second example shows how to collect tweets filtering by location instead. In other words, we can set a geographical box and collect only the tweets that are coming from that area.

For example, imagine we want to collect tweets from the United States. The way to do it is to find two pairs of coordinates (longitude and latitude) that indicate the southwest corner AND the northeast corner. Note the reverse order: it’s not (lat, long), but (long, lat).

In the case of the US, it would be approx. (-125,25) and (-66,50). How to find these coordinates? I use: http://itouchmap.com/latlong.html

filterStream(file.name="../data/tweets_geo.json", locations=c(-125, 25, -66, 50), 
    timeout=30, oauth=my_oauth)

## Capturing tweets...

## Connection to Twitter stream was closed after 30 seconds with up to 152 tweets downloaded.

We can do as before and open the tweets in R

tweets <- parseTweets("../data/tweets_geo.json")

## 304 tweets have been parsed.

And use the maps library to see where most tweets are coming from. Note that there are two types of geographic information on tweets: lat/lon (from geolocated tweets) and place_lat and place_lon (from tweets with place information). We will work with whatever is available.

library(maps)
tweets$lat <- ifelse(is.na(tweets$lat), tweets$place_lat, tweets$lat)
tweets$lon <- ifelse(is.na(tweets$lon), tweets$place_lon, tweets$lon)
tweets <- tweets[!is.na(tweets$lat),]
states <- map.where("state", tweets$lon, tweets$lat)
head(sort(table(states), decreasing=TRUE))

## states
##         california       pennsylvania            florida 
##                 23                 23                 16 
##              texas               ohio massachusetts:main 
##                 15                 14                 12

We can also prepare a map of the exact locations of the tweets.

library(ggplot2)

## First create a data frame with the map data 
map.data <- map_data("state")

# And we use ggplot2 to draw the map:
# 1) map base
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "grey90", 
    color = "grey50", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) + 
    # 2) limits for x and y axis
    scale_x_continuous(limits=c(-125,-66)) + scale_y_continuous(limits=c(25,50)) +
    # 3) adding the dot for each tweet
    geom_point(data = tweets, 
    aes(x = lon, y = lat), size = 1, alpha = 1/5, color = "darkblue") +
    # 4) removing unnecessary graph elements
    theme(axis.line = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank(), 
        axis.title = element_blank(), 
        panel.background = element_blank(), 
        panel.border = element_blank(), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(), 
        plot.background = element_blank())

## Warning: Removed 2 rows containing missing values (geom_point).

And here’s how to extract the edges of a network of retweets (at least one possible way of doing it):

tweets <- parseTweets("../data/trump-streaming-tweets.json")

## 653 tweets have been parsed.

# subset only RTs
rts <- tweets[grep("RT @", tweets$text),]

edges <- data.frame(
  node1 = rts$screen_name,
  node2 = gsub('.*RT @([a-zA-Z0-9_]+):? ?.*', rts$text, repl="\\1"),
  stringsAsFactors=F
)

library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

g <- graph_from_data_frame(d=edges, directed=TRUE)

Finally, it’s also possible to collect a random sample of tweets. That’s what the “sampleStream” function does:

sampleStream(file.name="../data/tweets_random.json", timeout=30, oauth=my_oauth)

## Capturing tweets...

## Connection to Twitter stream was closed after 30 seconds with up to 2421 tweets downloaded.

Here I’m collecting 30 seconds of tweets. And once again, to open the tweets in R…

tweets <- parseTweets("../data/tweets_random.json")

## 1269 tweets have been parsed.

What is the most retweeted tweet?

tweets[which.max(tweets$retweet_count),]

##                                                                                                                                                          text
## 502 RT @FIFAWorldCup: Turn it up and feel the beat! \n\nFrom the final four, which beat do you want heard inside the stadium? VOTE NOW! \U0001f3b6 \n\n#FIFA…
##     retweet_count favorite_count favorited truncated              id_str
## 502        359215         370167     FALSE     FALSE 1014829334664630279
##     in_reply_to_screen_name
## 502                    <NA>
##                                                                 source
## 502 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
##     retweeted                     created_at in_reply_to_status_id_str
## 502     FALSE Thu Jul 05 11:12:22 +0000 2018                      <NA>
##     in_reply_to_user_id_str lang listed_count verified           location
## 502                    <NA>   en            0    FALSE hinahanap ko pa po
##            user_id_str
## 502 888357725486198788
##                                                                                    description
## 502 \U0001f430\U0001f348\U0001f351\U0001f439\U0001f984\U0001f427\U0001f985\U0001f42f\U0001f436
##     geo_enabled                user_created_at statuses_count
## 502       FALSE Fri Jul 21 11:19:21 +0000 2017           5716
##     followers_count favourites_count protected
## 502             252            31768     FALSE
##                                 user_url     name time_zone user_lang
## 502 https://www.instagram.com/_muylindo/ 雷克萨斯        NA        en
##     utc_offset friends_count   screen_name country_code country place_type
## 502         NA           124 LexusEntienza         <NA>    <NA>         NA
##     full_name place_name place_id place_lat place_lon lat lon expanded_url
## 502      <NA>       <NA>     <NA>       NaN       NaN  NA  NA         <NA>
##      url
## 502 <NA>

What are the most popular hashtags at the moment? We’ll use regular expressions to extract hashtags.

library(stringr)
ht <- str_extract_all(tweets$text, "#(\\d|\\w)+")
ht <- unlist(ht)
head(sort(table(ht), decreasing = TRUE))

## ht
##     #FIFAFakeLove    #FIFAxEXOPower #WorldCupEXOPower              #EXO 
##                36                27                21                 7 
##              #BTS       #방탄소년단 
##                 6                 5

And who are the most frequently mentioned users?

users <- str_extract_all(tweets$text, '@[a-zA-Z0-9_]+')
users <- unlist(users)
head(sort(table(users), decreasing = TRUE))

## users
##       @BTS_twt  @FIFAWorldCup @kentaro_s_711   @weareoneEXO       @YouTube 
##             47             25             22             21              6 
##    @bts_bighit 
##              3

How many tweets mention Justin Bieber?

length(grep("bieber", tweets$text, ignore.case=TRUE))

## [1] 0

These are toy examples, but for large files with tweets in JSON format, there might be faster ways to parse the data. For example, the ndjson package offers a robust and fast way to parse JSON data:

library(ndjson)
json <- stream_in("../data/tweets_geo.json")
json

## Source: local data table [304 x 1,174]
## 
## # A tibble: 304 x 1,174
##    contributors coordinates                     created_at
##           <int>       <int>                          <chr>
##  1           NA          NA Thu Jul 05 11:08:53 +0000 2018
##  2           NA          NA Thu Jul 05 11:08:54 +0000 2018
##  3           NA          NA Thu Jul 05 11:08:54 +0000 2018
##  4           NA          NA Thu Jul 05 11:08:54 +0000 2018
##  5           NA          NA Thu Jul 05 11:08:54 +0000 2018
##  6           NA          NA Thu Jul 05 11:08:54 +0000 2018
##  7           NA          NA Thu Jul 05 11:08:54 +0000 2018
##  8           NA          NA Thu Jul 05 11:08:55 +0000 2018
##  9           NA          NA Thu Jul 05 11:08:55 +0000 2018
## 10           NA          NA Thu Jul 05 11:08:55 +0000 2018
## # ... with 294 more rows, and 1171 more variables:
## #   display_text_range.0 <dbl>, display_text_range.1 <dbl>,
## #   entities.hashtags <int>, entities.symbols <int>,
## #   entities.urls.0.display_url <chr>, entities.urls.0.expanded_url <chr>,
## #   entities.urls.0.indices.0 <dbl>, entities.urls.0.indices.1 <dbl>,
## #   entities.urls.0.url <chr>, entities.user_mentions <int>,
## #   favorite_count <dbl>, favorited <lgl>, filter_level <chr>, geo <int>,
## #   id <dbl>, id_str <chr>, in_reply_to_screen_name <chr>,
## #   in_reply_to_status_id <dbl>, in_reply_to_status_id_str <chr>,
## #   in_reply_to_user_id <dbl>, in_reply_to_user_id_str <chr>,
## #   is_quote_status <lgl>, lang <chr>, place.attributes <int>,
## #   place.bounding_box.coordinates.0.0.0 <dbl>,
## #   place.bounding_box.coordinates.0.0.1 <dbl>,
## #   place.bounding_box.coordinates.0.1.0 <dbl>,
## #   place.bounding_box.coordinates.0.1.1 <dbl>,
## #   place.bounding_box.coordinates.0.2.0 <dbl>,
## #   place.bounding_box.coordinates.0.2.1 <dbl>,
## #   place.bounding_box.coordinates.0.3.0 <dbl>,
## #   place.bounding_box.coordinates.0.3.1 <dbl>,
## #   place.bounding_box.type <chr>, place.country <chr>,
## #   place.country_code <chr>, place.full_name <chr>, place.id <chr>,
## #   place.name <chr>, place.place_type <chr>, place.url <chr>,
## #   possibly_sensitive <lgl>, quote_count <dbl>,
## #   quoted_status.contributors <int>, quoted_status.coordinates <int>,
## #   quoted_status.created_at <chr>,
## #   quoted_status.display_text_range.0 <dbl>,
## #   quoted_status.display_text_range.1 <dbl>,
## #   quoted_status.entities.hashtags <int>,
## #   quoted_status.entities.symbols <int>,
## #   quoted_status.entities.urls <int>,
## #   quoted_status.entities.user_mentions.0.id <dbl>,
## #   quoted_status.entities.user_mentions.0.id_str <chr>,
## #   quoted_status.entities.user_mentions.0.indices.0 <dbl>,
## #   quoted_status.entities.user_mentions.0.indices.1 <dbl>,
## #   quoted_status.entities.user_mentions.0.name <chr>,
## #   quoted_status.entities.user_mentions.0.screen_name <chr>,
## #   quoted_status.favorite_count <dbl>, quoted_status.favorited <lgl>,
## #   quoted_status.filter_level <chr>, quoted_status.geo <int>,
## #   quoted_status.id <dbl>, quoted_status.id_str <chr>,
## #   quoted_status.in_reply_to_screen_name <chr>,
## #   quoted_status.in_reply_to_status_id <dbl>,
## #   quoted_status.in_reply_to_status_id_str <chr>,
## #   quoted_status.in_reply_to_user_id <dbl>,
## #   quoted_status.in_reply_to_user_id_str <chr>,
## #   quoted_status.is_quote_status <lgl>, quoted_status.lang <chr>,
## #   quoted_status.place <int>, quoted_status.quote_count <dbl>,
## #   quoted_status.reply_count <dbl>, quoted_status.retweet_count <dbl>,
## #   quoted_status.retweeted <lgl>, quoted_status.source <chr>,
## #   quoted_status.text <chr>, quoted_status.truncated <lgl>,
## #   quoted_status.user.contributors_enabled <lgl>,
## #   quoted_status.user.created_at <chr>,
## #   quoted_status.user.default_profile <lgl>,
## #   quoted_status.user.default_profile_image <lgl>,
## #   quoted_status.user.description <chr>,
## #   quoted_status.user.favourites_count <dbl>,
## #   quoted_status.user.follow_request_sent <int>,
## #   quoted_status.user.followers_count <dbl>,
## #   quoted_status.user.following <int>,
## #   quoted_status.user.friends_count <dbl>,
## #   quoted_status.user.geo_enabled <lgl>, quoted_status.user.id <dbl>,
## #   quoted_status.user.id_str <chr>,
## #   quoted_status.user.is_translator <lgl>, quoted_status.user.lang <chr>,
## #   quoted_status.user.listed_count <dbl>,
## #   quoted_status.user.location <chr>, quoted_status.user.name <chr>,
## #   quoted_status.user.notifications <int>,
## #   quoted_status.user.profile_background_color <chr>,
## #   quoted_status.user.profile_background_image_url <chr>,
## #   quoted_status.user.profile_background_image_url_https <chr>,
## #   quoted_status.user.profile_background_tile <lgl>, ...

Now it’s your turn to practice! Let’s do our first challenge of today’s workshop.

Creating more than one token

The code below provides an alternative way to create an OAuth token that will allow you to save it to disk.

Follow these steps to create your token:

Go to apps.twitter.com and sign in.
Click on “Create New App”. You will need to have a phone number associated with your account in order to be able to create a token.
Fill name, description, and website (it can be anything, even http://www.google.com). Make sure you leave ‘Callback URL’ empty.
Agree to user conditions.
From the “Keys and Access Tokens” tab, copy consumer key and consumer secret and paste below

library(ROAuth)
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "YOUR_CONSUMER_KEY"
consumerSecret <- "YOUR_CONSUMER_SECRET"

my_oauth <- OAuthFactory$new(consumerKey=consumerKey,
  consumerSecret=consumerSecret, requestURL=requestURL,
  accessURL=accessURL, authURL=authURL)

What can go wrong here? Make sure the consumer key and consumer secret are pasted here as is, without any additional space character. If you don’t see any output in the console after running the code above, that’s a good sign.

Run the below line and go to the URL that appears on screen. Then, type the PIN into the console (RStudio sometimes doesn’t show what you’re typing, but it’s there!)

my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

Now you can save oauth token for use in future sessions with tweetscores or streamR. Make sure you save it in a folder where this is the only file.

save(my_oauth, file="../credentials/twitter-token.Rdata")

Scraping data from Twitter’s Streaming API

Pablo Barbera

July 6, 2018

Authenticating

Collecting data from Twitter’s Streaming API

Creating more than one token