Scraping web data from Twitter

Authenticating

Follow these steps to create your token:

  1. Go to apps.twitter.com and sign in.
  2. Click on “Create New App”. You will need to have a phone number associated with your account in order to be able to create a token.
  3. Fill name, description, and website (it can be anything, even http://www.google.com). Make sure you leave ‘Callback URL’ empty.
  4. Agree to user conditions.
  5. From the “Keys and Access Tokens” tab, copy consumer key and consumer secret and paste below
#install.packages("ROAuth")
library(ROAuth)
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "YOUR_CONSUMER_KEY"
consumerSecret <- "YOUR_CONSUMER_SECRET"

my_oauth <- OAuthFactory$new(consumerKey=consumerKey,
  consumerSecret=consumerSecret, requestURL=requestURL,
  accessURL=accessURL, authURL=authURL)

Run the below line and go to the URL that appears on screen. Then, type the PIN into the console (RStudio sometimes doesn’t show what you’re typing, but it’s there!)

my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

Now you can save oauth token for use in future sessions with netdemR or streamR. Make sure you save it in a folder where this is the only file.

save(my_oauth, file="../credentials/twitter-token.Rdata")

Collecting data from Twitter’s Streaming API

Collecting tweets filtering by keyword:

library(streamR)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
load("../credentials/twitter-token.Rdata")
filterStream(file.name="trump-tweets.json", track="trump", 
    timeout=20, oauth=my_oauth)
## Capturing tweets...
## Connection to Twitter stream was closed after 20 seconds with up to 1029 tweets downloaded.

Note the options: - file.name indicates the file in your disk where the tweets will be downloaded
- track is the keyword(s) mentioned in the tweets we want to capture. - timeout is the number of seconds that the connection will remain open
- oauth is the OAuth token we are using

Once it has finished, we can open it in R as a data frame with the parseTweets function

tweets <- parseTweets("trump-tweets.json")
## Warning in vect[notnulls] <- unlist(lapply(lst[notnulls], function(x)
## x[[field[1]]][[field[2]]][[as.numeric(field[3])]][[field[4]]])): number of
## items to replace is not a multiple of replacement length
## 291 tweets have been parsed.
tweets[1,]
##                                                                                                                                           text
## 1 @WhiteHouse @HouseofCommons @cducsubt @veteranstoday @nytimes @POTUS @tagesschau @ZDF @BBC @polizei_nrw_k @SZ @Zeit… https://t.co/gnUF8eq6w2
##   retweet_count favorited truncated             id_str
## 1             0     FALSE      TRUE 892693802854608897
##   in_reply_to_screen_name
## 1         marcohoffmann67
##                                                                                 source
## 1 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
##   retweeted                     created_at in_reply_to_status_id_str
## 1     FALSE Wed Aug 02 10:29:22 +0000 2017        892687928182398977
##   in_reply_to_user_id_str lang listed_count verified        location
## 1              1217700500   de           36    FALSE Veddel, Hamburg
##   user_id_str description geo_enabled                user_created_at
## 1  1217700500    I love u       FALSE Mon Feb 25 08:53:01 +0000 2013
##   statuses_count followers_count favourites_count protected user_url
## 1          20928              48               20     FALSE     <NA>
##             name time_zone user_lang utc_offset friends_count
## 1 Marco Hoffmann    Berlin        de       7200            12
##       screen_name country_code country place_type full_name place_name
## 1 marcohoffmann67         <NA>    <NA>       <NA>      <NA>         NA
##   place_id place_lat place_lon lat lon
## 1       NA       NaN       NaN  NA  NA
##                                          expanded_url
## 1 https://twitter.com/i/web/status/892693802854608897
##                       url
## 1 https://t.co/gnUF8eq6w2

If we want, we could also export it to a csv file to be opened later with Excel

write.csv(tweets, file="trump-tweets.csv", row.names=FALSE)

And this is how we would capture tweets mentioning multiple keywords:

filterStream(file.name="politics-tweets.json", 
    track=c("graham", "sessions", "trump", "clinton"),
    tweets=20, oauth=my_oauth)

Note that here I choose a different option, tweets, which indicates how many tweets (approximately) the function should capture before we close the connection to the Twitter API.

This second example shows how to collect tweets filtering by location instead. In other words, we can set a geographical box and collect only the tweets that are coming from that area.

For example, imagine we want to collect tweets from the United States. The way to do it is to find two pairs of coordinates (longitude and latitude) that indicate the southwest corner AND the northeast corner. Note the reverse order: it’s not (lat, long), but (long, lat).

In the case of the US, it would be approx. (-125,25) and (-66,50). How to find these coordinates? I use: http://itouchmap.com/latlong.html

filterStream(file.name="tweets_geo.json", locations=c(-125, 25, -66, 50), 
    timeout=30, oauth=my_oauth)
## Capturing tweets...
## Connection to Twitter stream was closed after 30 seconds with up to 258 tweets downloaded.

We can do as before and open the tweets in R

tweets <- parseTweets("tweets_geo.json")
## 144 tweets have been parsed.

And use the maps library to see where most tweets are coming from. Note that there are two types of geographic information on tweets: lat/lon (from geolocated tweets) and place_lat and place_lon (from tweets with place information). We will work with whatever is available.

library(maps)
tweets$lat <- ifelse(is.na(tweets$lat), tweets$place_lat, tweets$lat)
tweets$lon <- ifelse(is.na(tweets$lon), tweets$place_lon, tweets$lon)
states <- map.where("state", tweets$lon, tweets$lat)
head(sort(table(states), decreasing=TRUE))
## states
##    california         texas       florida       georgia  pennsylvania 
##            16             9             8             7             7 
## virginia:main 
##             7

We can also prepare a map of the exact locations of the tweets.

library(ggplot2)

## First create a data frame with the map data 
map.data <- map_data("state")

# And we use ggplot2 to draw the map:
# 1) map base
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "grey90", 
    color = "grey50", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) + 
    # 2) limits for x and y axis
    scale_x_continuous(limits=c(-125,-66)) + scale_y_continuous(limits=c(25,50)) +
    # 3) adding the dot for each tweet
    geom_point(data = tweets, 
    aes(x = lon, y = lat), size = 1, alpha = 1/5, color = "darkblue") +
    # 4) removing unnecessary graph elements
    theme(axis.line = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank(), 
        axis.title = element_blank(), 
        panel.background = element_blank(), 
        panel.border = element_blank(), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(), 
        plot.background = element_blank()) 

And here’s how to extract the edges of a network of retweets (at least one possible way of doing it):

tweets <- parseTweets("trump-tweets.json")
## Warning in vect[notnulls] <- unlist(lapply(lst[notnulls], function(x)
## x[[field[1]]][[field[2]]][[as.numeric(field[3])]][[field[4]]])): number of
## items to replace is not a multiple of replacement length
## 291 tweets have been parsed.
# subset only RTs
rts <- tweets[grep("RT @", tweets$text),]

edges <- data.frame(
  node1 = rts$screen_name,
  node2 = gsub('.*RT @([a-zA-Z0-9_]+):? ?.*', rts$text, repl="\\1"),
  stringsAsFactors=F
)

library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
g <- graph_from_data_frame(d=edges, directed=TRUE)

Finally, it’s also possible to collect a random sample of tweets. That’s what the “sampleStream” function does:

sampleStream(file.name="tweets_random.json", timeout=30, oauth=my_oauth)
## Capturing tweets...
## Connection to Twitter stream was closed after 30 seconds with up to 3743 tweets downloaded.

Here I’m collecting 30 seconds of tweets. And once again, to open the tweets in R…

tweets <- parseTweets("tweets_random.json")
## Warning in vect[notnulls] <- unlist(lapply(lst[notnulls], function(x)
## x[[field[1]]][[field[2]]][[as.numeric(field[3])]][[field[4]]])): number of
## items to replace is not a multiple of replacement length
## 1331 tweets have been parsed.

What is the most retweeted tweet?

tweets[which.max(tweets$retweet_count),]
##                                                                                                                                                                                                                                            text
## 68 RT @kindai_boys: 「任天堂スイッチスプラトゥーン2同梱版」\n1台プレゼントします!\n\n応募方法は\nこのツイートをリツイートのみ!\n\n当選連絡はDMで行うのでフォローまたはDM解放お願いします。\n\n⬇︎因みにユーチューブでは20台プレゼントしてます… 
##    retweet_count favorited truncated             id_str
## 68        161065     FALSE     FALSE 892694033369149440
##    in_reply_to_screen_name
## 68                    <NA>
##                                                                                source
## 68 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
##    retweeted                     created_at in_reply_to_status_id_str
## 68     FALSE Wed Aug 02 10:30:17 +0000 2017                      <NA>
##    in_reply_to_user_id_str lang listed_count verified location
## 68                    <NA>   ja            0    FALSE     <NA>
##           user_id_str description geo_enabled
## 68 892394790620282881  Youtuber垢       FALSE
##                   user_created_at statuses_count followers_count
## 68 Tue Aug 01 14:41:12 +0000 2017              2               2
##    favourites_count protected user_url             name time_zone
## 68                1     FALSE     <NA> あんず@YouTube垢      <NA>
##    user_lang utc_offset friends_count screen_name country_code country
## 68        ja         NA            44    uni_0820         <NA>    <NA>
##    place_type full_name place_name place_id place_lat place_lon lat lon
## 68       <NA>      <NA>         NA       NA       NaN       NaN  NA  NA
##            expanded_url url
## 68 http://bit.ly/xlOqWT

What are the most popular hashtags at the moment? We’ll use regular expressions to extract hashtags.

library(stringr)
## 
## Attaching package: 'stringr'
## The following object is masked from 'package:igraph':
## 
##     %>%
ht <- str_extract_all(tweets$text, "#(\\d|\\w)+")
ht <- unlist(ht)
head(sort(table(ht), decreasing = TRUE))
## ht
##  #MTVHottest  #bucaescort #izmirescort        #지수  #プレゼント 
##           20           10           10            6            5 
##       #อิมเมจ 
##            5

And who are the most frequently mentioned users?

users <- str_extract_all(tweets$text, '@[a-zA-Z0-9_]+')
users <- unlist(users)
head(sort(table(users), decreasing = TRUE))
## users
##  @zawabakogesamu           @CG_jp       @pinkdixry @Gurmeetramrahim 
##               10                8                6                5 
##  @IZUMI_Products         @YouTube 
##                4                4

How many tweets mention Justin Bieber?

length(grep("bieber", tweets$text, ignore.case=TRUE))
## [1] 8