Follow these steps to create your token:
#install.packages("ROAuth")
library(ROAuth)
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "YOUR_CONSUMER_KEY"
consumerSecret <- "YOUR_CONSUMER_SECRET"
my_oauth <- OAuthFactory$new(consumerKey=consumerKey,
consumerSecret=consumerSecret, requestURL=requestURL,
accessURL=accessURL, authURL=authURL)
Run the below line and go to the URL that appears on screen. Then, type the PIN into the console (RStudio sometimes doesn’t show what you’re typing, but it’s there!)
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
Now you can save oauth token for use in future sessions with netdemR or streamR. Make sure you save it in a folder where this is the only file.
save(my_oauth, file="../credentials/twitter-token.Rdata")
Collecting tweets filtering by keyword:
library(streamR)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
load("../credentials/twitter-token.Rdata")
filterStream(file.name="trump-tweets.json", track="trump",
timeout=20, oauth=my_oauth)
## Capturing tweets...
## Connection to Twitter stream was closed after 20 seconds with up to 2433 tweets downloaded.
Note the options: - file.name
indicates the file in your disk where the tweets will be downloaded
- track
is the keyword(s) mentioned in the tweets we want to capture. - timeout
is the number of seconds that the connection will remain open
- oauth
is the OAuth token we are using
Once it has finished, we can open it in R as a data frame with the parseTweets
function
tweets <- parseTweets("trump-tweets.json")
## 819 tweets have been parsed.
tweets[1,]
## text
## 1 @RippersZipper Christmas in Trump’s America!..the gift that keeps on giving...laser light penetration ...now that’s penetration!...
## retweet_count favorited truncated id_str
## 1 0 FALSE FALSE 922960852700897280
## in_reply_to_screen_name
## 1 RippersZipper
## source
## 1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## retweeted created_at in_reply_to_status_id_str
## 1 FALSE Tue Oct 24 22:59:49 +0000 2017 918544877666881537
## in_reply_to_user_id_str lang listed_count verified location user_id_str
## 1 847949582369771520 en 15 FALSE <NA> 529783017
## description
## 1 daughter/Korean War vet,#MAGA, Conservative Christian, love my cats& dogs#2Amend...workout daily to keep my sanity+ hooked on my own endorphins!
## geo_enabled user_created_at statuses_count
## 1 TRUE Mon Mar 19 22:24:34 +0000 2012 37264
## followers_count favourites_count protected user_url name
## 1 2909 14998 FALSE <NA> Cynthia
## time_zone user_lang utc_offset friends_count
## 1 Pacific Time (US & Canada) en -25200 3319
## screen_name country_code country place_type full_name place_name
## 1 stand4honor <NA> <NA> <NA> <NA> NA
## place_id place_lat place_lon lat lon expanded_url url
## 1 NA NaN NaN NA NA <NA> <NA>
If we want, we could also export it to a csv file to be opened later with Excel
write.csv(tweets, file="trump-tweets.csv", row.names=FALSE)
And this is how we would capture tweets mentioning multiple keywords:
filterStream(file.name="politics-tweets.json",
track=c("graham", "sessions", "trump", "clinton"),
tweets=20, oauth=my_oauth)
Note that here I choose a different option, tweets
, which indicates how many tweets (approximately) the function should capture before we close the connection to the Twitter API.
This second example shows how to collect tweets filtering by location instead. In other words, we can set a geographical box and collect only the tweets that are coming from that area.
For example, imagine we want to collect tweets from the United States. The way to do it is to find two pairs of coordinates (longitude and latitude) that indicate the southwest corner AND the northeast corner. Note the reverse order: it’s not (lat, long), but (long, lat).
In the case of the US, it would be approx. (-125,25) and (-66,50). How to find these coordinates? I use: http://itouchmap.com/latlong.html
filterStream(file.name="tweets_geo.json", locations=c(-125, 25, -66, 50),
timeout=30, oauth=my_oauth)
## Capturing tweets...
## Connection to Twitter stream was closed after 30 seconds with up to 1381 tweets downloaded.
We can do as before and open the tweets in R
tweets <- parseTweets("tweets_geo.json")
## 732 tweets have been parsed.
And use the maps library to see where most tweets are coming from. Note that there are two types of geographic information on tweets: lat
/lon
(from geolocated tweets) and place_lat
and place_lon
(from tweets with place information). We will work with whatever is available.
library(maps)
tweets$lat <- ifelse(is.na(tweets$lat), tweets$place_lat, tweets$lat)
tweets$lon <- ifelse(is.na(tweets$lon), tweets$place_lon, tweets$lon)
states <- map.where("state", tweets$lon, tweets$lat)
head(sort(table(states), decreasing=TRUE))
## states
## california texas ohio pennsylvania florida
## 102 74 34 30 27
## illinois
## 27
We can also prepare a map of the exact locations of the tweets.
library(ggplot2)
## First create a data frame with the map data
map.data <- map_data("state")
# And we use ggplot2 to draw the map:
# 1) map base
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "grey90",
color = "grey50", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) +
# 2) limits for x and y axis
scale_x_continuous(limits=c(-125,-66)) + scale_y_continuous(limits=c(25,50)) +
# 3) adding the dot for each tweet
geom_point(data = tweets,
aes(x = lon, y = lat), size = 1, alpha = 1/5, color = "darkblue") +
# 4) removing unnecessary graph elements
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_blank())
## Warning: Removed 2 rows containing missing values (geom_point).
And here’s how to extract the edges of a network of retweets (at least one possible way of doing it):
tweets <- parseTweets("trump-tweets.json")
## 819 tweets have been parsed.
# subset only RTs
rts <- tweets[grep("RT @", tweets$text),]
edges <- data.frame(
node1 = rts$screen_name,
node2 = gsub('.*RT @([a-zA-Z0-9_]+):? ?.*', rts$text, repl="\\1"),
stringsAsFactors=F
)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
g <- graph_from_data_frame(d=edges, directed=TRUE)
Finally, it’s also possible to collect a random sample of tweets. That’s what the “sampleStream” function does:
sampleStream(file.name="tweets_random.json", timeout=30, oauth=my_oauth)
## Capturing tweets...
## Connection to Twitter stream was closed after 30 seconds with up to 4936 tweets downloaded.
Here I’m collecting 30 seconds of tweets. And once again, to open the tweets in R…
tweets <- parseTweets("tweets_random.json")
## Warning in readLines(tweets, encoding = "UTF-8"): incomplete final line
## found on 'tweets_random.json'
## 1379 tweets have been parsed.
What is the most retweeted tweet?
tweets[which.max(tweets$retweet_count),]
## text
## 1226 RT @lauranotclaire: I was raped when I was 7, when I had no idea what sex was and while wearing overalls and a long sleeve shirt. Fucki…
## retweet_count favorited truncated id_str
## 1226 147438 FALSE FALSE 922961200572485632
## in_reply_to_screen_name
## 1226 <NA>
## source
## 1226 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## retweeted created_at in_reply_to_status_id_str
## 1226 FALSE Tue Oct 24 23:01:12 +0000 2017 <NA>
## in_reply_to_user_id_str lang listed_count verified location
## 1226 <NA> en 0 FALSE 407
## user_id_str description geo_enabled user_created_at
## 1226 2852793753 <NA> TRUE Fri Oct 31 02:06:06 +0000 2014
## statuses_count followers_count favourites_count protected user_url
## 1226 8347 453 14592 FALSE <NA>
## name time_zone
## 1226 ileana\u2728\xed\xa0\xbd\xed\xb2\x9b Eastern Time (US & Canada)
## user_lang utc_offset friends_count screen_name country_code country
## 1226 en -14400 306 ellielaborde <NA> <NA>
## place_type full_name place_name place_id place_lat place_lon lat lon
## 1226 <NA> <NA> NA NA NaN NaN NA NA
## expanded_url url
## 1226 <NA> <NA>
What are the most popular hashtags at the moment? We’ll use regular expressions to extract hashtags.
library(stringr)
##
## Attaching package: 'stringr'
## The following object is masked from 'package:igraph':
##
## %>%
ht <- str_extract_all(tweets$text, "#(\\d|\\w)+")
ht <- unlist(ht)
head(sort(table(ht), decreasing = TRUE))
## ht
## #MelhorClipeTVZAnitta #MelhorClipeTVZLuan
## 11 10
## #MPN #MelhorClipeTVZPablloVittar
## 5 4
## #ادعم_قايمه_الخطيب #سكس
## 4 3
And who are the most frequently mentioned users?
users <- str_extract_all(tweets$text, '@[a-zA-Z0-9_]+')
users <- unlist(users)
head(sort(table(users), decreasing = TRUE))
## users
## @realDonaldTrump @hiteffective @mohamed_rageb55 @YouTube
## 5 4 4 4
## @5HBrasil @CentralDeFasLS
## 3 3
How many tweets mention Justin Bieber?
length(grep("bieber", tweets$text, ignore.case=TRUE))
## [1] 0