To scrape data from Facebook’s API, we’ll use the Rfacebook
package.
library(Rfacebook)
## Loading required package: httr
## Loading required package: rjson
## Loading required package: httpuv
##
## Attaching package: 'Rfacebook'
## The following object is masked from 'package:methods':
##
## getGroup
To get access to the Facebook API, you need an OAuth code. You can get yours going to the following URL: https://developers.facebook.com/tools/explorer
Once you’re there:
1. Click on “Get Access Token”
2. Copy the long code (“Access Token”) and paste it here below, substituting the fake one I wrote:
fb_oauth = 'EAACEdEose0cBAFiPzcXyDLZBVaZCvUR0ZBq3yvKS0IOU01JgYCcYuRKV9xT33pTYZAZCtbdZCEMZAihqlZBGCexN5o7g2ZCgl72cLbJQzrR8ZCFZC8DPaUW5ZCCwHoxsZCRa9IhptCY2P0i3TJwBZC8yN979Mr41gfSF3CeejrRHNxiu6aPuWBcpOfBvp65ASclJ2CjFsZD'
Now try running the following line:
getUsers("me", token=fb_oauth, private_info=TRUE)
## id name username first_name middle_name last_name gender
## 1 557698085 Pablo Barberá NA NA NA NA NA
## locale likes picture birthday location hometown relationship_status
## 1 NA NA NA NA NA NA NA
Does it return your Facebook public information? Yes? Then we’re ready to go. See also ?fbOAuth
for information on how to get a long-lived OAuth token.
At the moment, the only information that can be scraped from Facebook is the content of public pages.
The following line downloads the ~200 most recent posts on the facebook page of Donald Trump
page <- getPage("DonaldTrump", token=fb_oauth, n=20, reactions=TRUE, api="v2.9")
## 20 posts
What information is available for each of these posts?
page[1,]
## id likes_count from_id from_name
## 2 153080620724_10160456890500725 30190 153080620724 Donald J. Trump
## message
## 2 Join me for a few minutes, live - at H&K Equipment in Coraopolis, Pennsylvania! TAX CUTS, TAX CUTS, TAX CUTS! Together, WE are all MAKING AMERICA GREAT AGAIN!
## created_time type
## 2 2018-01-18T20:11:27+0000 video
## link
## 2 https://www.facebook.com/DonaldTrump/videos/10160456890500725/
## story comments_count
## 2 Donald J. Trump was live — at H&K Equipment, Inc.. 10895
## shares_count love_count haha_count wow_count sad_count angry_count
## 2 3578 NA NA NA NA NA
Which post got more likes, more comments, and more shares?
page[which.max(page$likes_count),]
## id likes_count from_id from_name
## 10 153080620724_10160465024010725 120389 153080620724 Donald J. Trump
## message
## 10 AMERICA FIRST!\xed\xa0\xbc\xed\xb7\xba\xed\xa0\xbc\xed\xb7\xb8
## created_time type
## 10 2018-01-20T14:38:18+0000 photo
## link
## 10 https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10160465019575725/?type=3
## story comments_count shares_count love_count haha_count wow_count
## 10 <NA> 11343 12721 NA NA NA
## sad_count angry_count
## 10 NA NA
page[which.max(page$comments_count),]
## id likes_count from_id from_name
## 15 153080620724_10160467189520725 68921 153080620724 Donald J. Trump
## message
## 15 Chuck Schumer and the Democrats continue to put the interests of illegal immigrants over those of Americans!\n\n#MAGA #MakeAmericaSafeAgain
## created_time type
## 15 2018-01-20T21:51:14+0000 video
## link story
## 15 https://www.facebook.com/DonaldTrump/videos/10160467189520725/ <NA>
## comments_count shares_count love_count haha_count wow_count sad_count
## 15 18281 28905 NA NA NA NA
## angry_count
## 15 NA
page[which.max(page$shares_count),]
## id likes_count from_id from_name
## 15 153080620724_10160467189520725 68921 153080620724 Donald J. Trump
## message
## 15 Chuck Schumer and the Democrats continue to put the interests of illegal immigrants over those of Americans!\n\n#MAGA #MakeAmericaSafeAgain
## created_time type
## 15 2018-01-20T21:51:14+0000 video
## link story
## 15 https://www.facebook.com/DonaldTrump/videos/10160467189520725/ <NA>
## comments_count shares_count love_count haha_count wow_count sad_count
## 15 18281 28905 NA NA NA NA
## angry_count
## 15 NA
What about other reactions?
page[which.max(page$love_count),]
## [1] id likes_count from_id from_name
## [5] message created_time type link
## [9] story comments_count shares_count love_count
## [13] haha_count wow_count sad_count angry_count
## <0 rows> (or 0-length row.names)
page[which.max(page$haha_count),]
## [1] id likes_count from_id from_name
## [5] message created_time type link
## [9] story comments_count shares_count love_count
## [13] haha_count wow_count sad_count angry_count
## <0 rows> (or 0-length row.names)
page[which.max(page$wow_count),]
## [1] id likes_count from_id from_name
## [5] message created_time type link
## [9] story comments_count shares_count love_count
## [13] haha_count wow_count sad_count angry_count
## <0 rows> (or 0-length row.names)
page[which.max(page$sad_count),]
## [1] id likes_count from_id from_name
## [5] message created_time type link
## [9] story comments_count shares_count love_count
## [13] haha_count wow_count sad_count angry_count
## <0 rows> (or 0-length row.names)
page[which.max(page$angry_count),]
## [1] id likes_count from_id from_name
## [5] message created_time type link
## [9] story comments_count shares_count love_count
## [13] haha_count wow_count sad_count angry_count
## <0 rows> (or 0-length row.names)
Let’s do another example, looking at the Facebook page of Political Analysis:
page <- getPage("104544669596569", token=fb_oauth, n=100, reactions=TRUE, api="v2.9")
## 25 posts 50 posts 75 posts 100 posts
# most popular posts
page[which.max(page$likes_count),]
## id likes_count from_id
## 25 104544669596569_1603451839705837 56 104544669596569
## from_name
## 25 Political Analysis
## message
## 25 We just published our final virtual issue as editors of Political Analysis. This virtual issue collects eight papers, from two different areas of methodology, where the editors believe that there has been important progress made since 2010: measurement and causation. \n\nThese eight papers are available free access online until early 2018. Make sure to give them a read, they are excellent examples of the important work begin published in Political Analysis.
## created_time type
## 25 2017-10-23T15:22:38+0000 link
## link
## 25 https://www.cambridge.org/core/journals/political-analysis/special-collections/greatest-hits-2-measurement-and-causation
## story comments_count shares_count love_count haha_count wow_count
## 25 <NA> 1 20 0 0 0
## sad_count angry_count
## 25 0 0
page[which.max(page$comments_count),]
## id likes_count from_id
## 69 104544669596569_388577707859929 1 104544669596569
## from_name
## 69 Political Analysis
## message
## 69 Nominations for the Political Methodology Emerging Scholar Award? Send to the committee chair jackman@stanford.edu.
## created_time type link story comments_count shares_count
## 69 2012-05-23T19:10:31+0000 status <NA> <NA> 5 0
## love_count haha_count wow_count sad_count angry_count
## 69 0 0 0 0 0
page[which.max(page$shares_count),]
## id likes_count from_id
## 20 104544669596569_1422423774475312 43 104544669596569
## from_name
## 20 Political Analysis
## message
## 20 We hit an important milestone. There are now 300 replication studies in the journal's Dataverse. It's a pretty remarkable accomplishment, and since we've been requiring that authors provide replication materials prior to publication, we have had universal compliance. While there's still things to improve in our efforts to improve research transparency and replication, it's clear that our current practices work for authors and readers alike. \n\nWhether you are looking for code and data for a class, or materials that you can use in to learn about a particular methodology for your own research, there's a great deal of useful code and data now in the journal's Dataverse.
## created_time type
## 20 2017-04-23T00:36:57+0000 link
## link story comments_count
## 20 https://dataverse.harvard.edu/dataverse/pan <NA> 3
## shares_count love_count haha_count wow_count sad_count angry_count
## 20 26 1 0 0 0 0
We can also subset by date. For example, imagine we want to get all the posts from early November 2012 on Barack Obama’s Facebook page
page <- getPage("barackobama", token=fb_oauth, n=100,
since='2012/11/01', until='2012/11/10')
## 25 posts 29 posts
page[which.max(page$likes_count),]
## from_id from_name message created_time type
## 4 6815841748 Barack Obama Four more years. 2012-11-07T04:15:08+0000 photo
## link
## 4 https://www.facebook.com/barackobama/photos/a.53081056748.66806.6815841748/10151255420886749/?type=3
## id story likes_count comments_count
## 4 6815841748_10151255420886749 <NA> 4819658 218575
## shares_count
## 4 658102
And if we need to, we can also extract the specific comments from each post.
post_id <- page$id[which.max(page$likes_count)]
post <- getPost(post_id, token=fb_oauth, n.comments=1000, likes=FALSE)
This is how you can view those comments:
comments <- post$comments
head(comments)
## from_id from_name message created_time
## 1 509226872540260 Jesse Talafili OBAMA ! 2012-11-07T04:15:16+0000
## 2 485613484893917 Zain Ahmed Turk yayy 2012-11-07T04:15:17+0000
## 3 675870897427 Gary D Ploski <3 2012-11-07T04:15:17+0000
## 4 802034289809838 David Furka YES 2012-11-07T04:15:18+0000
## 5 10201918108506766 Pinky Keys :X 2012-11-07T04:15:18+0000
## 6 10102278537299904 Zac Bowling Hell yes! 2012-11-07T04:15:19+0000
## likes_count comments_count id
## 1 18 0 10151255420886749_11954305
## 2 3 0 10151255420886749_11954306
## 3 2 0 10151255420886749_11954307
## 4 5 0 10151255420886749_11954309
## 5 1 0 10151255420886749_11954311
## 6 9 0 10151255420886749_11954315
Also, note that users can like comments! What is the comment that got the most likes?
comments[which.max(comments$likes_count),]
## from_id from_name message created_time
## 1 509226872540260 Jesse Talafili OBAMA ! 2012-11-07T04:15:16+0000
## likes_count comments_count id
## 1 18 0 10151255420886749_11954305
This is how you get nested comments:
page <- getPage("barackobama", token=fb_oauth, n=1)
## 1 posts
post <- getPost(page$id, token=fb_oauth, comments=TRUE, n=100, likes=FALSE)
comment <- getCommentReplies(post$comments$id[1],
token=fb_oauth, n=500, likes=TRUE)
If we want to scrape an entire page that contains many posts, given that the API can sometimes give an error, it is a good idea to embed the function within a loop and collect the data by month.
# list of dates to sample
dates <- seq(as.Date("2011/01/01"), as.Date("2017/08/01"), by="3 months")
n <- length(dates)-1
df <- list()
# loop over months
for (i in 1:n){
message(as.character(dates[i]))
df[[i]] <- getPage("GameOfThrones", token=fb_oauth, n=1000, since=dates[i],
until=dates[i+1], verbose=FALSE)
Sys.sleep(0.5)
}
df <- do.call(rbind, df)
write.csv(df, file="../data/gameofthrones.csv", row.names=FALSE)
And we can then look at the popularity over time:
library(tweetscores)
## Loading required package: R2WinBUGS
## Loading required package: coda
## Loading required package: boot
## ##
## ## tweetscores: tools for the analysis of Twitter data
## ## Pablo Barbera (LSE)
## ## www.tweetscores.com
## ##
##
## Attaching package: 'tweetscores'
## The following objects are masked from 'package:Rfacebook':
##
## getFriends, getUsers
library(stringr)
library(reshape2)
df <- read.csv("../data/gameofthrones.csv", stringsAsFactors=FALSE)
# parse date into month
df$month <- df$created_time %>% str_sub(1, 7) %>% paste0("-01") %>% as.Date()
# computing average by month
metrics <- aggregate(cbind(likes_count, comments_count, shares_count) ~ month,
data=df, FUN=mean)
# reshaping into long format
metrics <- melt(metrics, id.vars="month")
# visualize evolution in metric
library(ggplot2)
library(scales)
ggplot(metrics, aes(x = month, y = value, group = variable)) +
geom_line(aes(color = variable)) +
scale_x_date(date_breaks = "years", labels = date_format("%Y")) +
scale_y_log10("Average count per post",
breaks = c(10, 100, 1000, 10000, 100000, 200000), labels=scales::comma) +
theme_bw() + theme(axis.title.x = element_blank())
Just like public Facebook pages, the data from public groups can also be easily downloaded with the getGroup function.
group <- getGroup("150048245063649", token=fb_oauth, n=50)
## 25 posts 50 posts
Now let’s turn to our last challenge of the day…