Scraping web data from Facebook

To scrape data from Facebook’s API, we’ll use the Rfacebook package.

library(Rfacebook)
## Loading required package: httr
## Loading required package: rjson
## Loading required package: httpuv
## 
## Attaching package: 'Rfacebook'
## The following object is masked from 'package:methods':
## 
##     getGroup

To get access to the Facebook API, you need an OAuth code. You can get yours going to the following URL: https://developers.facebook.com/tools/explorer

Once you’re there:
1. Click on “Get Access Token”
2. Copy the long code (“Access Token”) and paste it here below, substituting the fake one I wrote:

fb_oauth = 'EAACEdEose0cBAFiPzcXyDLZBVaZCvUR0ZBq3yvKS0IOU01JgYCcYuRKV9xT33pTYZAZCtbdZCEMZAihqlZBGCexN5o7g2ZCgl72cLbJQzrR8ZCFZC8DPaUW5ZCCwHoxsZCRa9IhptCY2P0i3TJwBZC8yN979Mr41gfSF3CeejrRHNxiu6aPuWBcpOfBvp65ASclJ2CjFsZD'

Now try running the following line:

getUsers("me", token=fb_oauth, private_info=TRUE)
##          id          name username first_name middle_name last_name gender
## 1 557698085 Pablo Barberá       NA         NA          NA        NA     NA
##   locale likes picture birthday location hometown relationship_status
## 1     NA    NA      NA       NA       NA       NA                  NA

Does it return your Facebook public information? Yes? Then we’re ready to go. See also ?fbOAuth for information on how to get a long-lived OAuth token.

At the moment, the only information that can be scraped from Facebook is the content of public pages.

The following line downloads the ~200 most recent posts on the facebook page of Donald Trump

page <- getPage("DonaldTrump", token=fb_oauth, n=20, reactions=TRUE, api="v2.9") 
## 20 posts

What information is available for each of these posts?

page[1,]
##                               id likes_count      from_id       from_name
## 2 153080620724_10160456890500725       30190 153080620724 Donald J. Trump
##                                                                                                                                                          message
## 2 Join me for a few minutes, live - at H&K Equipment in Coraopolis, Pennsylvania! TAX CUTS, TAX CUTS, TAX CUTS! Together, WE are all MAKING AMERICA GREAT AGAIN!
##               created_time  type
## 2 2018-01-18T20:11:27+0000 video
##                                                             link
## 2 https://www.facebook.com/DonaldTrump/videos/10160456890500725/
##                                                story comments_count
## 2 Donald J. Trump was live — at H&K Equipment, Inc..          10895
##   shares_count love_count haha_count wow_count sad_count angry_count
## 2         3578         NA         NA        NA        NA          NA

Which post got more likes, more comments, and more shares?

page[which.max(page$likes_count),]
##                                id likes_count      from_id       from_name
## 10 153080620724_10160465024010725      120389 153080620724 Donald J. Trump
##                                                           message
## 10 AMERICA FIRST!\xed\xa0\xbc\xed\xb7\xba\xed\xa0\xbc\xed\xb7\xb8
##                created_time  type
## 10 2018-01-20T14:38:18+0000 photo
##                                                                                                        link
## 10 https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10160465019575725/?type=3
##    story comments_count shares_count love_count haha_count wow_count
## 10  <NA>          11343        12721         NA         NA        NA
##    sad_count angry_count
## 10        NA          NA
page[which.max(page$comments_count),]
##                                id likes_count      from_id       from_name
## 15 153080620724_10160467189520725       68921 153080620724 Donald J. Trump
##                                                                                                                                        message
## 15 Chuck Schumer and the Democrats continue to put the interests of illegal immigrants over those of Americans!\n\n#MAGA #MakeAmericaSafeAgain
##                created_time  type
## 15 2018-01-20T21:51:14+0000 video
##                                                              link story
## 15 https://www.facebook.com/DonaldTrump/videos/10160467189520725/  <NA>
##    comments_count shares_count love_count haha_count wow_count sad_count
## 15          18281        28905         NA         NA        NA        NA
##    angry_count
## 15          NA
page[which.max(page$shares_count),]
##                                id likes_count      from_id       from_name
## 15 153080620724_10160467189520725       68921 153080620724 Donald J. Trump
##                                                                                                                                        message
## 15 Chuck Schumer and the Democrats continue to put the interests of illegal immigrants over those of Americans!\n\n#MAGA #MakeAmericaSafeAgain
##                created_time  type
## 15 2018-01-20T21:51:14+0000 video
##                                                              link story
## 15 https://www.facebook.com/DonaldTrump/videos/10160467189520725/  <NA>
##    comments_count shares_count love_count haha_count wow_count sad_count
## 15          18281        28905         NA         NA        NA        NA
##    angry_count
## 15          NA

What about other reactions?

page[which.max(page$love_count),]
##  [1] id             likes_count    from_id        from_name     
##  [5] message        created_time   type           link          
##  [9] story          comments_count shares_count   love_count    
## [13] haha_count     wow_count      sad_count      angry_count   
## <0 rows> (or 0-length row.names)
page[which.max(page$haha_count),]
##  [1] id             likes_count    from_id        from_name     
##  [5] message        created_time   type           link          
##  [9] story          comments_count shares_count   love_count    
## [13] haha_count     wow_count      sad_count      angry_count   
## <0 rows> (or 0-length row.names)
page[which.max(page$wow_count),]
##  [1] id             likes_count    from_id        from_name     
##  [5] message        created_time   type           link          
##  [9] story          comments_count shares_count   love_count    
## [13] haha_count     wow_count      sad_count      angry_count   
## <0 rows> (or 0-length row.names)
page[which.max(page$sad_count),]
##  [1] id             likes_count    from_id        from_name     
##  [5] message        created_time   type           link          
##  [9] story          comments_count shares_count   love_count    
## [13] haha_count     wow_count      sad_count      angry_count   
## <0 rows> (or 0-length row.names)
page[which.max(page$angry_count),]
##  [1] id             likes_count    from_id        from_name     
##  [5] message        created_time   type           link          
##  [9] story          comments_count shares_count   love_count    
## [13] haha_count     wow_count      sad_count      angry_count   
## <0 rows> (or 0-length row.names)

Let’s do another example, looking at the Facebook page of Political Analysis:

page <- getPage("104544669596569", token=fb_oauth, n=100, reactions=TRUE, api="v2.9") 
## 25 posts 50 posts 75 posts 100 posts
# most popular posts
page[which.max(page$likes_count),]
##                                  id likes_count         from_id
## 25 104544669596569_1603451839705837          56 104544669596569
##             from_name
## 25 Political Analysis
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                              message
## 25 We just published our final virtual issue as editors of Political Analysis.  This virtual issue collects eight papers, from two different areas of methodology, where the editors believe that there has been important progress made since 2010:  measurement and causation.  \n\nThese eight papers are available free access online until early 2018.  Make sure to give them a read, they are excellent examples of the important work begin published in Political Analysis.
##                created_time type
## 25 2017-10-23T15:22:38+0000 link
##                                                                                                                        link
## 25 https://www.cambridge.org/core/journals/political-analysis/special-collections/greatest-hits-2-measurement-and-causation
##    story comments_count shares_count love_count haha_count wow_count
## 25  <NA>              1           20          0          0         0
##    sad_count angry_count
## 25         0           0
page[which.max(page$comments_count),]
##                                 id likes_count         from_id
## 69 104544669596569_388577707859929           1 104544669596569
##             from_name
## 69 Political Analysis
##                                                                                                                 message
## 69 Nominations for the Political Methodology Emerging Scholar Award?  Send to the committee chair jackman@stanford.edu.
##                created_time   type link story comments_count shares_count
## 69 2012-05-23T19:10:31+0000 status <NA>  <NA>              5            0
##    love_count haha_count wow_count sad_count angry_count
## 69          0          0         0         0           0
page[which.max(page$shares_count),]
##                                  id likes_count         from_id
## 20 104544669596569_1422423774475312          43 104544669596569
##             from_name
## 20 Political Analysis
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     message
## 20 We hit an important milestone.  There are now 300 replication studies in the journal's Dataverse.  It's a pretty remarkable accomplishment, and since we've been requiring that authors provide replication materials prior to publication, we have had universal compliance.  While there's still things to improve in our efforts to improve research transparency and replication, it's clear that our current practices work for authors and readers alike.  \n\nWhether you are looking for code and data for a class, or materials that you can use in to learn about a particular methodology for your own research, there's a great deal of useful code and data now in the journal's Dataverse.
##                created_time type
## 20 2017-04-23T00:36:57+0000 link
##                                           link story comments_count
## 20 https://dataverse.harvard.edu/dataverse/pan  <NA>              3
##    shares_count love_count haha_count wow_count sad_count angry_count
## 20           26          1          0         0         0           0

We can also subset by date. For example, imagine we want to get all the posts from early November 2012 on Barack Obama’s Facebook page

page <- getPage("barackobama", token=fb_oauth, n=100,
    since='2012/11/01', until='2012/11/10')
## 25 posts 29 posts
page[which.max(page$likes_count),]
##      from_id    from_name          message             created_time  type
## 4 6815841748 Barack Obama Four more years. 2012-11-07T04:15:08+0000 photo
##                                                                                                   link
## 4 https://www.facebook.com/barackobama/photos/a.53081056748.66806.6815841748/10151255420886749/?type=3
##                             id story likes_count comments_count
## 4 6815841748_10151255420886749  <NA>     4819658         218575
##   shares_count
## 4       658102

And if we need to, we can also extract the specific comments from each post.

post_id <- page$id[which.max(page$likes_count)]
post <- getPost(post_id, token=fb_oauth, n.comments=1000, likes=FALSE)

This is how you can view those comments:

comments <- post$comments
head(comments)
##             from_id       from_name   message             created_time
## 1   509226872540260  Jesse Talafili   OBAMA ! 2012-11-07T04:15:16+0000
## 2   485613484893917 Zain Ahmed Turk      yayy 2012-11-07T04:15:17+0000
## 3      675870897427   Gary D Ploski        <3 2012-11-07T04:15:17+0000
## 4   802034289809838     David Furka       YES 2012-11-07T04:15:18+0000
## 5 10201918108506766      Pinky Keys        :X 2012-11-07T04:15:18+0000
## 6 10102278537299904     Zac Bowling Hell yes! 2012-11-07T04:15:19+0000
##   likes_count comments_count                         id
## 1          18              0 10151255420886749_11954305
## 2           3              0 10151255420886749_11954306
## 3           2              0 10151255420886749_11954307
## 4           5              0 10151255420886749_11954309
## 5           1              0 10151255420886749_11954311
## 6           9              0 10151255420886749_11954315

Also, note that users can like comments! What is the comment that got the most likes?

comments[which.max(comments$likes_count),]
##           from_id      from_name message             created_time
## 1 509226872540260 Jesse Talafili OBAMA ! 2012-11-07T04:15:16+0000
##   likes_count comments_count                         id
## 1          18              0 10151255420886749_11954305

This is how you get nested comments:

page <- getPage("barackobama", token=fb_oauth, n=1)
## 1 posts
post <- getPost(page$id, token=fb_oauth, comments=TRUE, n=100, likes=FALSE)
comment <- getCommentReplies(post$comments$id[1],
                             token=fb_oauth, n=500, likes=TRUE)

If we want to scrape an entire page that contains many posts, given that the API can sometimes give an error, it is a good idea to embed the function within a loop and collect the data by month.

# list of dates to sample
dates <- seq(as.Date("2011/01/01"), as.Date("2017/08/01"), by="3 months")
n <- length(dates)-1
df <- list()
# loop over months
for (i in 1:n){
    message(as.character(dates[i]))
    df[[i]] <- getPage("GameOfThrones", token=fb_oauth, n=1000, since=dates[i],
        until=dates[i+1], verbose=FALSE)
    Sys.sleep(0.5)
}
df <- do.call(rbind, df)
write.csv(df, file="../data/gameofthrones.csv", row.names=FALSE)

And we can then look at the popularity over time:

library(tweetscores)
## Loading required package: R2WinBUGS
## Loading required package: coda
## Loading required package: boot
## ##
## ## tweetscores: tools for the analysis of Twitter data
## ## Pablo Barbera (LSE)
## ## www.tweetscores.com
## ##
## 
## Attaching package: 'tweetscores'
## The following objects are masked from 'package:Rfacebook':
## 
##     getFriends, getUsers
library(stringr)
library(reshape2)
df <- read.csv("../data/gameofthrones.csv", stringsAsFactors=FALSE)
# parse date into month
df$month <- df$created_time %>% str_sub(1, 7) %>% paste0("-01") %>% as.Date()
# computing average by month
metrics <- aggregate(cbind(likes_count, comments_count, shares_count) ~ month,
          data=df, FUN=mean)
# reshaping into long format
metrics <- melt(metrics, id.vars="month")
# visualize evolution in metric
library(ggplot2)
library(scales)
ggplot(metrics, aes(x = month, y = value, group = variable)) + 
  geom_line(aes(color = variable)) + 
    scale_x_date(date_breaks = "years", labels = date_format("%Y")) + 
  scale_y_log10("Average count per post", 
    breaks = c(10, 100, 1000, 10000, 100000, 200000), labels=scales::comma) + 
  theme_bw() + theme(axis.title.x = element_blank())

Just like public Facebook pages, the data from public groups can also be easily downloaded with the getGroup function.

group <- getGroup("150048245063649", token=fb_oauth, n=50)
## 25 posts 50 posts

Now let’s turn to our last challenge of the day…