Before we can start collecting Twitter data, we need to create an OAuth token that will allow us to authenticate our connection and access our personal data.
After the new API changes, getting a new token requires submitting an application for a developer account, which may take a few days. For teaching purposes only, I will temporarily share one of my tokens with each of you, so that we can use the API without having to do the authentication.
However, if in the future you want to get your own token, here’s how you would do it:
Follow these steps to create your token:
library(ROAuth)
my_oauth <- list(consumer_key = "CONSUMER_KEY",
consumer_secret = "CONSUMER_SECRET",
access_token="ACCESS_TOKEN",
access_token_secret = "ACCESS_TOKEN_SECRET")
save(my_oauth, file="~/my_oauth")
load("~/my_oauth")
What can go wrong here? Make sure all the consumer and token keys are pasted here as is, without any additional space character. If you don’t see any output in the console after running the code above, that’s a good sign.
Note that I saved the list as a file in my hard drive. That will save us some time later on, but you could also just re-run the code in lines 22 to 27 before conecting to the API in the future.
To check that it worked, try running the line below:
library(tweetscores)
## Loading required package: R2WinBUGS
## Loading required package: coda
## Loading required package: boot
## ##
## ## tweetscores: tools for the analysis of Twitter data
## ## Pablo Barbera (LSE)
## ## www.tweetscores.com
## ##
getUsers(screen_names="LSEnews", oauth = my_oauth)[[1]]$screen_name
## [1] "LSEnews"
If this displays LSEnews
then we’re good to go!
Some of the functions below will work with more than one token. If you want to save multiple tokens, see the instructions at the end of the file.
Collecting tweets filtering by keyword:
library(streamR)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
## Warning: package 'rjson' was built under R version 3.4.4
## Loading required package: ndjson
## Warning: package 'ndjson' was built under R version 3.4.4
filterStream(file.name="~/data/trump-streaming-tweets.json", track="trump",
timeout=20, oauth=my_oauth)
## Capturing tweets...
## Connection to Twitter stream was closed after 20 seconds with up to 550 tweets downloaded.
Note the options: - file.name
indicates the file in your disk where the tweets will be downloaded
- track
is the keyword(s) mentioned in the tweets we want to capture. - timeout
is the number of seconds that the connection will remain open
- oauth
is the OAuth token we are using
Once it has finished, we can open it in R as a data frame with the parseTweets
function
tweets <- parseTweets("~/data/trump-streaming-tweets.json")
## 326 tweets have been parsed.
tweets[1,]
## text
## 1 RT @tonyposnanski: Lebron James has done more for education than Betsy DeVos, more for charity than Donald Trump, and more for inner cities…
## retweet_count favorite_count favorited truncated id_str
## 1 13580 43680 FALSE FALSE 1024241236054441984
## in_reply_to_screen_name
## 1 <NA>
## source
## 1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## retweeted created_at in_reply_to_status_id_str
## 1 FALSE Tue Jul 31 10:31:54 +0000 2018 <NA>
## in_reply_to_user_id_str lang listed_count verified location
## 1 <NA> en 24 FALSE Palmdale, CA
## user_id_str
## 1 328332576
## description
## 1 J ❤️ B 06/19/15. Jaquelyn Arce is my one & only . Basketball player Video Gamer Future Successor. Jeep Owner Laker Fan
## geo_enabled user_created_at statuses_count
## 1 TRUE Sun Jul 03 05:07:08 +0000 2011 73601
## followers_count favourites_count protected user_url name
## 1 586 9985 FALSE <NA> Brian Dominguez
## time_zone user_lang utc_offset friends_count screen_name country_code
## 1 NA en NA 451 BD23s_ <NA>
## country place_type full_name place_name place_id place_lat place_lon lat
## 1 <NA> NA <NA> <NA> <NA> NaN NaN NA
## lon expanded_url url
## 1 NA <NA> <NA>
If we want, we could also export it to a csv file to be opened later with Excel
write.csv(tweets, file="~/data/trump-streaming-tweets.csv", row.names=FALSE)
And this is how we would capture tweets mentioning multiple keywords:
filterStream(file.name="~/data/politics-tweets.json",
track=c("graham", "sessions", "trump", "clinton"),
tweets=20, oauth=my_oauth)
Note that here I choose a different option, tweets
, which indicates how many tweets (approximately) the function should capture before we close the connection to the Twitter API.
This second example shows how to collect tweets filtering by location instead. In other words, we can set a geographical box and collect only the tweets that are coming from that area.
For example, imagine we want to collect tweets from the United States. The way to do it is to find two pairs of coordinates (longitude and latitude) that indicate the southwest corner AND the northeast corner. Note the reverse order: it’s not (lat, long), but (long, lat).
In the case of the US, it would be approx. (-125,25) and (-66,50). How to find these coordinates? I use: http://itouchmap.com/latlong.html
filterStream(file.name="~/data/tweets_geo.json", locations=c(-125, 25, -66, 50),
timeout=30, oauth=my_oauth)
## Capturing tweets...
## Connection to Twitter stream was closed after 30 seconds with up to 229 tweets downloaded.
We can do as before and open the tweets in R
tweets <- parseTweets("~/data/tweets_geo.json")
## 199 tweets have been parsed.
And use the maps library to see where most tweets are coming from. Note that there are two types of geographic information on tweets: lat
/lon
(from geolocated tweets) and place_lat
and place_lon
(from tweets with place information). We will work with whatever is available.
library(maps)
tweets$lat <- ifelse(is.na(tweets$lat), tweets$place_lat, tweets$lat)
tweets$lon <- ifelse(is.na(tweets$lon), tweets$place_lon, tweets$lon)
tweets <- tweets[!is.na(tweets$lat),]
states <- map.where("state", tweets$lon, tweets$lat)
head(sort(table(states), decreasing=TRUE))
## states
## new york:main california new york:long island
## 36 15 11
## texas georgia new jersey
## 11 9 9
We can also prepare a map of the exact locations of the tweets.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
## First create a data frame with the map data
map.data <- map_data("state")
# And we use ggplot2 to draw the map:
# 1) map base
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "grey90",
color = "grey50", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) +
# 2) limits for x and y axis
scale_x_continuous(limits=c(-125,-66)) + scale_y_continuous(limits=c(25,50)) +
# 3) adding the dot for each tweet
geom_point(data = tweets,
aes(x = lon, y = lat), size = 1, alpha = 1/5, color = "darkblue") +
# 4) removing unnecessary graph elements
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_blank())
## Warning: Removed 1 rows containing missing values (geom_point).
And here’s how to extract the edges of a network of retweets (at least one possible way of doing it):
tweets <- parseTweets("~/data/trump-streaming-tweets.json")
## 326 tweets have been parsed.
# subset only RTs
rts <- tweets[grep("RT @", tweets$text),]
edges <- data.frame(
node1 = rts$screen_name,
node2 = gsub('.*RT @([a-zA-Z0-9_]+):? ?.*', rts$text, repl="\\1"),
stringsAsFactors=F
)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
g <- graph_from_data_frame(d=edges, directed=TRUE)
Finally, it’s also possible to collect a random sample of tweets. That’s what the “sampleStream” function does:
sampleStream(file.name="~/data/tweets_random.json", timeout=30, oauth=my_oauth)
## Capturing tweets...
## Connection to Twitter stream was closed after 30 seconds with up to 1986 tweets downloaded.
Here I’m collecting 30 seconds of tweets. And once again, to open the tweets in R…
tweets <- parseTweets("~/data/tweets_random.json")
## 1096 tweets have been parsed.
What is the most retweeted tweet?
tweets[which.max(tweets$retweet_count),]
## text
## 823 RT @Jon_Christian: Bless this doggo who stole a GoPro https://t.co/tZwVdniJoQ
## retweet_count favorite_count favorited truncated id_str
## 823 293438 791017 FALSE FALSE 1024241541605466112
## in_reply_to_screen_name
## 823 <NA>
## source
## 823 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## retweeted created_at in_reply_to_status_id_str
## 823 FALSE Tue Jul 31 10:33:07 +0000 2018 <NA>
## in_reply_to_user_id_str lang listed_count verified location
## 823 <NA> en 2 FALSE <NA>
## user_id_str description geo_enabled user_created_at
## 823 4883255627 <NA> TRUE Sat Feb 06 22:49:15 +0000 2016
## statuses_count followers_count favourites_count protected user_url
## 823 16393 180 10840 FALSE <NA>
## name time_zone user_lang utc_offset friends_count screen_name
## 823 momone NA en NA 178 BibitefFaustin
## country_code country place_type full_name place_name place_id
## 823 <NA> <NA> NA <NA> <NA> <NA>
## place_lat place_lon lat lon expanded_url url
## 823 NaN NaN NA NA <NA> <NA>
What are the most popular hashtags at the moment? We’ll use regular expressions to extract hashtags.
library(stringr)
ht <- str_extract_all(tweets$text, "#(\\d|\\w)+")
ht <- unlist(ht)
head(sort(table(ht), decreasing = TRUE))
## ht
## #VRoid #MTVHottest #워너원 #4 #WANNAONE #박우진
## 16 7 4 3 3 3
And who are the most frequently mentioned users?
users <- str_extract_all(tweets$text, '@[a-zA-Z0-9_]+')
users <- unlist(users)
head(sort(table(users), decreasing = TRUE))
## users
## @BTS_twt @weareoneEXO @YouTube @B1A4_gongchan @alhajitekno
## 6 4 4 3 2
## @belldelagua
## 2
How many tweets mention Justin Bieber?
length(grep("bieber", tweets$text, ignore.case=TRUE))
## [1] 0
These are toy examples, but for large files with tweets in JSON format, there might be faster ways to parse the data. For example, the ndjson
package offers a robust and fast way to parse JSON data:
library(ndjson)
json <- stream_in("~/data/tweets_geo.json")
json
## Source: local data table [199 x 783]
##
## # A tibble: 199 x 783
## contributors coordinates created_at
## <int> <int> <chr>
## 1 NA NA Tue Jul 31 10:32:15 +0000 2018
## 2 NA NA Tue Jul 31 10:32:16 +0000 2018
## 3 NA NA Tue Jul 31 10:32:16 +0000 2018
## 4 NA NA Tue Jul 31 10:32:16 +0000 2018
## 5 NA NA Tue Jul 31 10:32:16 +0000 2018
## 6 NA NA Tue Jul 31 10:32:16 +0000 2018
## 7 NA NA Tue Jul 31 10:32:16 +0000 2018
## 8 NA NA Tue Jul 31 10:32:17 +0000 2018
## 9 NA NA Tue Jul 31 10:32:17 +0000 2018
## 10 NA NA Tue Jul 31 10:32:17 +0000 2018
## # ... with 189 more rows, and 780 more variables:
## # display_text_range.0 <dbl>, display_text_range.1 <dbl>,
## # entities.hashtags <int>, entities.media.0.display_url <chr>,
## # entities.media.0.expanded_url <chr>, entities.media.0.id <dbl>,
## # entities.media.0.id_str <chr>, entities.media.0.indices.0 <dbl>,
## # entities.media.0.indices.1 <dbl>, entities.media.0.media_url <chr>,
## # entities.media.0.media_url_https <chr>,
## # entities.media.0.sizes.large.h <dbl>,
## # entities.media.0.sizes.large.resize <chr>,
## # entities.media.0.sizes.large.w <dbl>,
## # entities.media.0.sizes.medium.h <dbl>,
## # entities.media.0.sizes.medium.resize <chr>,
## # entities.media.0.sizes.medium.w <dbl>,
## # entities.media.0.sizes.small.h <dbl>,
## # entities.media.0.sizes.small.resize <chr>,
## # entities.media.0.sizes.small.w <dbl>,
## # entities.media.0.sizes.thumb.h <dbl>,
## # entities.media.0.sizes.thumb.resize <chr>,
## # entities.media.0.sizes.thumb.w <dbl>, entities.media.0.type <chr>,
## # entities.media.0.url <chr>, entities.symbols <int>,
## # entities.urls <int>, entities.user_mentions <int>,
## # extended_entities.media.0.display_url <chr>,
## # extended_entities.media.0.expanded_url <chr>,
## # extended_entities.media.0.id <dbl>,
## # extended_entities.media.0.id_str <chr>,
## # extended_entities.media.0.indices.0 <dbl>,
## # extended_entities.media.0.indices.1 <dbl>,
## # extended_entities.media.0.media_url <chr>,
## # extended_entities.media.0.media_url_https <chr>,
## # extended_entities.media.0.sizes.large.h <dbl>,
## # extended_entities.media.0.sizes.large.resize <chr>,
## # extended_entities.media.0.sizes.large.w <dbl>,
## # extended_entities.media.0.sizes.medium.h <dbl>,
## # extended_entities.media.0.sizes.medium.resize <chr>,
## # extended_entities.media.0.sizes.medium.w <dbl>,
## # extended_entities.media.0.sizes.small.h <dbl>,
## # extended_entities.media.0.sizes.small.resize <chr>,
## # extended_entities.media.0.sizes.small.w <dbl>,
## # extended_entities.media.0.sizes.thumb.h <dbl>,
## # extended_entities.media.0.sizes.thumb.resize <chr>,
## # extended_entities.media.0.sizes.thumb.w <dbl>,
## # extended_entities.media.0.type <chr>,
## # extended_entities.media.0.url <chr>,
## # extended_entities.media.0.video_info.aspect_ratio.0 <dbl>,
## # extended_entities.media.0.video_info.aspect_ratio.1 <dbl>,
## # extended_entities.media.0.video_info.variants.0.bitrate <dbl>,
## # extended_entities.media.0.video_info.variants.0.content_type <chr>,
## # extended_entities.media.0.video_info.variants.0.url <chr>,
## # favorite_count <dbl>, favorited <lgl>, filter_level <chr>, geo <int>,
## # id <dbl>, id_str <chr>, in_reply_to_screen_name <chr>,
## # in_reply_to_status_id <dbl>, in_reply_to_status_id_str <chr>,
## # in_reply_to_user_id <dbl>, in_reply_to_user_id_str <chr>,
## # is_quote_status <lgl>, lang <chr>, place.attributes <int>,
## # place.bounding_box.coordinates.0.0.0 <dbl>,
## # place.bounding_box.coordinates.0.0.1 <dbl>,
## # place.bounding_box.coordinates.0.1.0 <dbl>,
## # place.bounding_box.coordinates.0.1.1 <dbl>,
## # place.bounding_box.coordinates.0.2.0 <dbl>,
## # place.bounding_box.coordinates.0.2.1 <dbl>,
## # place.bounding_box.coordinates.0.3.0 <dbl>,
## # place.bounding_box.coordinates.0.3.1 <dbl>,
## # place.bounding_box.type <chr>, place.country <chr>,
## # place.country_code <chr>, place.full_name <chr>, place.id <chr>,
## # place.name <chr>, place.place_type <chr>, place.url <chr>,
## # possibly_sensitive <lgl>, quote_count <dbl>, reply_count <dbl>,
## # retweet_count <dbl>, retweeted <lgl>, source <chr>, text <chr>,
## # timestamp_ms <chr>, truncated <lgl>, user.contributors_enabled <lgl>,
## # user.created_at <chr>, user.default_profile <lgl>,
## # user.default_profile_image <lgl>, user.description <chr>,
## # user.favourites_count <dbl>, ...