Connecting and authenticating to the database

To illustrate how to query a database using SQL, we will use an example Google BigQuery project I created with two different tables: (1) congress, which contains information about the Facebook Pages of all Members of the U.S. Congress as of late 2017, and (2) posts, which contains information about the Facebook posts shared by these Pages during 2017, as collected from the public Pages API.

The code below shows how we can authenticate and connect to this database:

# setting display options
options(tibble.width = Inf)
# loading packages; you may need to install them first
library(DBI)
library(bigrquery)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# loading authentication token
# (code assumes it's in your working directory)
bq_auth(path="usc-poir613-key.json")
# connecting to database
db <- dbConnect(
  drv=bigquery(),
  project='usc-barbera',
  dataset='facebook'
)

And this is how we find which tables are in the database, and their fields:

dbListTables(db)
## [1] "congress" "posts"
dbListFields(db, name = 'congress')
## [1] "bioid"       "screen_name" "name"        "gender"      "type"       
## [6] "party"
dbListFields(db, name = 'posts')
##  [1] "screen_name"    "id"             "from_name"      "date"          
##  [5] "datetime"       "message"        "type"           "link"          
##  [9] "domain"         "likes_count"    "comments_count" "shares_count"  
## [13] "love_count"     "haha_count"     "wow_count"      "sad_count"     
## [17] "angry_count"

It’s generally good practice to close the connection after you’re done:

dbDisconnect(db)

Querying a SQL database

Now that we know how to connect to the database, let’s see how we can query a table. First we connect using dbConnect and then we type our query with dbGetQuery. Generally, we will write the query clauses in capital letters and the column names in lowercase.

We’ll save data returned from dbGetQuery into a dataframe, to compare the R and SQL code

db <- dbConnect(
  drv=bigquery(),
  project='usc-barbera',
  dataset='facebook',
  quiet=TRUE
)
test_congress <- dbGetQuery(db, 'SELECT * FROM congress')
str(test_congress)
## tibble [519 × 6] (S3: tbl_df/tbl/data.frame)
##  $ bioid      : chr [1:519] "A000370" "B001230" "B001245" "B001270" ...
##  $ screen_name: chr [1:519] "CongresswomanAdams" "senatortammybaldwin" "madeleine.bordallo" "RepKarenBass" ...
##  $ name       : chr [1:519] "Alma S. Adams" "Tammy Baldwin" "Madeleine Z. Bordallo" "Karen Bass" ...
##  $ gender     : chr [1:519] "F" "F" "F" "F" ...
##  $ type       : chr [1:519] "rep" "sen" "rep" "rep" ...
##  $ party      : chr [1:519] "Democrat" "Democrat" "Democrat" "Democrat" ...
test_posts <- dbGetQuery(db, 'SELECT * FROM posts LIMIT 100')
str(test_posts)
## tibble [100 × 17] (S3: tbl_df/tbl/data.frame)
##  $ screen_name   : chr [1:100] "106631626049851" "115356957005" "115356957005" "118514606128" ...
##  $ id            : chr [1:100] "106631626049851_1275484255831243" "115356957005_10154703789472006" "115356957005_10154916550987006" "118514606128_10154157152401129" ...
##  $ from_name     : chr [1:100] "Ted Poe" "Congressman Hank Johnson" "Congressman Hank Johnson" "Sam Graves" ...
##  $ date          : Date[1:100], format: "2017-01-05" "2017-06-20" ...
##  $ datetime      : POSIXct[1:100], format: "2017-01-05 15:38:22" "2017-06-20 15:57:44" ...
##  $ message       : chr [1:100] "\"Being sworn in is always an honor and a privilege, but this year is even more special for me. I am happy to a"| __truncated__ "Want to fully understand President Trump’s conflicts of interest? Want to see how they could impact the choices"| __truncated__ "NA" "Everyone who lives around Mound City knows how important Squaw Creek is to people here. But, in the last days o"| __truncated__ ...
##  $ type          : chr [1:100] "link" "link" "link" "link" ...
##  $ link          : chr [1:100] "http://poe.house.gov/press-releases?ID=443ADE64-03E2-461A-BA56-FFB1489E9402" "https://goo.gl/7CR9Mg" "https://www-projectcasting-com.cdn.ampproject.org/c/www.projectcasting.com/news/georgia-filming/amp" "http://www.maryvilledailyforum.com/news/article_70511708-d273-11e6-ba6f-6730f071a26a.html" ...
##  $ domain        : chr [1:100] "poe.house.gov" "goo.gl" "-projectcasting-com.cdn.ampproject.org" "maryvilledailyforum.com" ...
##  $ likes_count   : int [1:100] 1125 51 167 598 122 109 115 112 119 623 ...
##  $ comments_count: int [1:100] 166 23 3 135 53 24 26 33 51 87 ...
##  $ shares_count  : int [1:100] 170 22 109 218 269 37 39 44 167 200 ...
##  $ love_count    : int [1:100] 244 0 23 7 1 1 2 5 1 81 ...
##  $ haha_count    : int [1:100] 0 0 1 0 0 0 1 1 1 1 ...
##  $ wow_count     : int [1:100] 10 10 10 16 30 12 13 18 18 12 ...
##  $ sad_count     : int [1:100] 0 1 0 6 4 7 0 2 42 1 ...
##  $ angry_count   : int [1:100] 0 9 0 65 126 33 0 11 334 8 ...

RMarkdown actually allows us to write the SQL query directly as a chunk of code:

SELECT * FROM congress LIMIT 5
5 records
bioid screen_name name gender type party
A000370 CongresswomanAdams Alma S. Adams F rep Democrat
B001230 senatortammybaldwin Tammy Baldwin F sen Democrat
B001245 madeleine.bordallo Madeleine Z. Bordallo F rep Democrat
B001270 RepKarenBass Karen Bass F rep Democrat
B001278 congresswomanbonamici Suzanne Bonamici F rep Democrat

This is the list of query clauses that we will use to work with the database table, and the dplyr equivalents:

Let’s start with some examples of SELECT. R equivalents using dplyr are offered below:

# querying just one column
dbGetQuery(db, "SELECT name FROM congress LIMIT 10")
## # A tibble: 10 × 1
##    name                 
##    <chr>                
##  1 Alma S. Adams        
##  2 Tammy Baldwin        
##  3 Madeleine Z. Bordallo
##  4 Karen Bass           
##  5 Suzanne Bonamici     
##  6 Joyce Beatty         
##  7 Julia Brownley       
##  8 Cheri Bustos         
##  9 Nanette Diaz Barragán
## 10 Lisa Blunt Rochester
# multiple columns
dbGetQuery(db, "SELECT name, party FROM congress LIMIT 10")
## # A tibble: 10 × 2
##    name                  party   
##    <chr>                 <chr>   
##  1 Alma S. Adams         Democrat
##  2 Tammy Baldwin         Democrat
##  3 Madeleine Z. Bordallo Democrat
##  4 Karen Bass            Democrat
##  5 Suzanne Bonamici      Democrat
##  6 Joyce Beatty          Democrat
##  7 Julia Brownley        Democrat
##  8 Cheri Bustos          Democrat
##  9 Nanette Diaz Barragán Democrat
## 10 Lisa Blunt Rochester  Democrat
## R: 
test_congress %>%
        select(name, party) %>%
        slice(1:10)
## # A tibble: 10 × 2
##    name                  party   
##    <chr>                 <chr>   
##  1 Alma S. Adams         Democrat
##  2 Tammy Baldwin         Democrat
##  3 Madeleine Z. Bordallo Democrat
##  4 Karen Bass            Democrat
##  5 Suzanne Bonamici      Democrat
##  6 Joyce Beatty          Democrat
##  7 Julia Brownley        Democrat
##  8 Cheri Bustos          Democrat
##  9 Nanette Diaz Barragán Democrat
## 10 Lisa Blunt Rochester  Democrat

SQL can perform transformations on the data directly in SELECT:

# adding functions
dbGetQuery(db, "SELECT from_name, likes_count/comments_count, UPPER(type) FROM posts LIMIT 10")
## # A tibble: 10 × 3
##    from_name                         f0_ f1_  
##    <chr>                           <dbl> <chr>
##  1 Ted Poe                          6.78 LINK 
##  2 Congressman Hank Johnson         2.22 LINK 
##  3 Congressman Hank Johnson        55.7  LINK 
##  4 Sam Graves                       4.43 LINK 
##  5 Walter Jones                     2.30 LINK 
##  6 Walter Jones                     4.54 LINK 
##  7 Walter Jones                     4.42 LINK 
##  8 Walter Jones                     3.39 LINK 
##  9 Congressman John Yarmuth (KY-3)  2.33 LINK 
## 10 Congressman John Yarmuth (KY-3)  7.16 LINK
# adding aliases to the new columns
dbGetQuery(db, "SELECT from_name, likes_count/comments_count AS like_ratio FROM posts LIMIT 10")
## # A tibble: 10 × 2
##    from_name                       like_ratio
##    <chr>                                <dbl>
##  1 Ted Poe                               6.78
##  2 Congressman Hank Johnson              2.22
##  3 Congressman Hank Johnson             55.7 
##  4 Sam Graves                            4.43
##  5 Walter Jones                          2.30
##  6 Walter Jones                          4.54
##  7 Walter Jones                          4.42
##  8 Walter Jones                          3.39
##  9 Congressman John Yarmuth (KY-3)       2.33
## 10 Congressman John Yarmuth (KY-3)       7.16
# it's often a good idea to write a query in multiple lines so that it's easier to read it:
dbGetQuery(db, "SELECT from_name,
           likes_count/comments_count AS like_ratio
           FROM posts
           LIMIT 10")
## # A tibble: 10 × 2
##    from_name                       like_ratio
##    <chr>                                <dbl>
##  1 Ted Poe                               6.78
##  2 Congressman Hank Johnson              2.22
##  3 Congressman Hank Johnson             55.7 
##  4 Sam Graves                            4.43
##  5 Walter Jones                          2.30
##  6 Walter Jones                          4.54
##  7 Walter Jones                          4.42
##  8 Walter Jones                          3.39
##  9 Congressman John Yarmuth (KY-3)       2.33
## 10 Congressman John Yarmuth (KY-3)       7.16
## R: test_posts %>%
#      select(from_name, likes_count, comments_count, type) %>%
#      mutate(like_ratio = likes_count/comments_count,
#             type = toupper(type) ) %>%
#      slice(1:10)

# selecting unique values from a column
dbGetQuery(db, "SELECT DISTINCT from_name
           FROM posts
           LIMIT 10")
## # A tibble: 10 × 1
##    from_name                      
##    <chr>                          
##  1 Ted Poe                        
##  2 Congressman Hank Johnson       
##  3 Sam Graves                     
##  4 Walter Jones                   
##  5 Congressman John Yarmuth (KY-3)
##  6 Congressman Joe Kennedy III    
##  7 Congressman Joaquin Castro     
##  8 Louie Gohmert                  
##  9 Congressman Ken Calvert        
## 10 Nydia Velazquez
## R: test_posts %>%
#      select(from_name)%>%
#      distinct() %>%
#      slice(1:10)

Filtering data using WHERE

Now some examples of how we can use WHERE to subset the database based on values of specific columns:

# selecting based on values of a column
dbGetQuery(db, "SELECT name, party
           FROM congress
           WHERE party = 'Republican'
           LIMIT 10")
## # A tibble: 10 × 2
##    name                 party     
##    <chr>                <chr>     
##  1 Marsha Blackburn     Republican
##  2 Diane Black          Republican
##  3 Susan W. Brooks      Republican
##  4 Susan M. Collins     Republican
##  5 Shelley Moore Capito Republican
##  6 Barbara Comstock     Republican
##  7 Liz Cheney           Republican
##  8 Joni Ernst           Republican
##  9 Virginia Foxx        Republican
## 10 Deb Fischer          Republican
## R: test_congress %>%
#      select(name, party)%>%
#      filter(party== 'Republican') %>%
#      slice(1:10)

# AND operator
dbGetQuery(db, "SELECT from_name, type, likes_count
           FROM posts
           WHERE type != 'photo'
              AND likes_count > 500
           LIMIT 10")
## # A tibble: 10 × 3
##    from_name                       type  likes_count
##    <chr>                           <chr>       <int>
##  1 Ted Poe                         link         1125
##  2 Sam Graves                      link          598
##  3 Congressman John Yarmuth (KY-3) link          623
##  4 Congressman John Yarmuth (KY-3) link         1173
##  5 Congressman John Yarmuth (KY-3) link         1081
##  6 Congressman John Yarmuth (KY-3) link          966
##  7 Congressman John Yarmuth (KY-3) link          882
##  8 Congressman John Yarmuth (KY-3) link          871
##  9 Congressman Joe Kennedy III     link         2683
## 10 Congressman Joe Kennedy III     link         2920
## R: test_posts %>%
#      select(from_name, type, likes_count)%>%
#      filter(type != 'photo' &
#             likes_count > 500) %>%
#      slice(1:10)

# OR operator
dbGetQuery(db, "SELECT from_name, type, comments_count
           FROM posts
           WHERE comments_count>100 
            AND (type = 'photo' OR type = 'video')
           LIMIT 10")
## # A tibble: 10 × 3
##    from_name                       type  comments_count
##    <chr>                           <chr>          <int>
##  1 Congressman John Yarmuth (KY-3) photo            134
##  2 Congressman John Yarmuth (KY-3) photo            152
##  3 Congressman John Yarmuth (KY-3) photo           1150
##  4 Congressman John Yarmuth (KY-3) photo            255
##  5 Congressman Joe Kennedy III     photo            160
##  6 Congressman Joe Kennedy III     photo            300
##  7 Congressman Joe Kennedy III     photo            119
##  8 Congressman Joe Kennedy III     photo            191
##  9 Congressman Joe Kennedy III     photo            203
## 10 Congressman Joe Kennedy III     photo            379
# membership, IN
dbGetQuery(db, "SELECT from_name, type, comments_count
           FROM posts
           WHERE type IN ('video', 'event')
           LIMIT 10")
## # A tibble: 10 × 3
##    from_name                            type  comments_count
##    <chr>                                <chr>          <int>
##  1 House Democratic Leader Nancy Pelosi event            437
##  2 U.S. Senator Kamala Harris           event            284
##  3 Ted Poe                              video             10
##  4 Gregorio Kilili Camacho Sablan       video             59
##  5 Congressman John Yarmuth (KY-3)      video           1347
##  6 Congressman John Yarmuth (KY-3)      video            607
##  7 Congressman John Yarmuth (KY-3)      video            748
##  8 Congressman John Yarmuth (KY-3)      video            268
##  9 Congressman John Yarmuth (KY-3)      video            127
## 10 Congressman John Yarmuth (KY-3)      video             63
## R: test_posts %>%
#      select(from_name, type, comments_count)%>%
#      filter(type %in% c('video', 'event')) %>%
#      slice(1:10)

Computing aggregated metrics with GROUP BY

Very useful for us – GROUP BY allows us to aggregate results based on values of a specific variables. We will come back to this later.

dbGetQuery(db,
  "SELECT type, COUNT(*) AS type_count
  FROM posts
  GROUP BY type")
## # A tibble: 7 × 2
##   type   type_count
##   <chr>       <int>
## 1 link        60847
## 2 event        1333
## 3 photo       67067
## 4 video       29639
## 5 status      14636
## 6 note           23
## 7 music          21
## R: test_posts %>%
#       group_by(type)%>%
#       summarize(type_count= n())

Sorting output tables with ORDER BY

And we can then order the results using ORDER BY. The default is alphabetical order for the first column, but we can change that:

# sort by type_count
dbGetQuery(db,
  "SELECT type, COUNT(type) AS type_count
  FROM posts
  GROUP BY type
  ORDER BY type_count")
## # A tibble: 7 × 2
##   type   type_count
##   <chr>       <int>
## 1 music          21
## 2 note           23
## 3 event        1333
## 4 status      14636
## 5 video       29639
## 6 link        60847
## 7 photo       67067
## R: test_posts %>%
#       group_by(type)%>%
#       summarize(type_count= n())%>%
#       arrange(type_count)

# now in descending order
dbGetQuery(db,
  "SELECT type, COUNT(type) AS type_count
  FROM posts
  GROUP BY type
  ORDER BY type_count DESC")
## # A tibble: 7 × 2
##   type   type_count
##   <chr>       <int>
## 1 photo       67067
## 2 link        60847
## 3 video       29639
## 4 status      14636
## 5 event        1333
## 6 note           23
## 7 music          21
# which was the most liked post?
dbGetQuery(db,
  "SELECT from_name, message, likes_count, datetime
  FROM posts
  ORDER BY likes_count DESC
  LIMIT 1")
## # A tibble: 1 × 4
##   from_name                    
##   <chr>                        
## 1 U.S. Senator Elizabeth Warren
##   message                                                                       
##   <chr>                                                                         
## 1 Tonight we fight for American values at airports all across this country. I'm…
##   likes_count datetime           
##         <int> <dttm>             
## 1      421064 2017-01-29 01:59:12
## R: test_posts %>%
#       select(from_name, message, likes_count, datetime) %>%
#       arrange(desc(likes_count)) %>%
#       slice(1)

# You can also specify the column number instead of the name
dbGetQuery(db,
  "SELECT from_name, message, likes_count, datetime
  FROM posts
  ORDER BY 3 DESC
  LIMIT 1")
## # A tibble: 1 × 4
##   from_name                    
##   <chr>                        
## 1 U.S. Senator Elizabeth Warren
##   message                                                                       
##   <chr>                                                                         
## 1 Tonight we fight for American values at airports all across this country. I'm…
##   likes_count datetime           
##         <int> <dttm>             
## 1      421064 2017-01-29 01:59:12
# What was the photo posted by a Member of Congress that received the most "angry" reactions?
dbGetQuery(db, 
           "SELECT from_name, message, angry_count, type
           FROM posts
           WHERE type = 'photo'
           ORDER BY angry_count DESC
           LIMIT 1")
## # A tibble: 1 × 4
##   from_name  
##   <chr>      
## 1 John Cornyn
##   message                                                                       
##   <chr>                                                                         
## 1 This week, I’ll be voting to confirm Betsy DeVos as Secretary of Education. P…
##   angry_count type 
##         <int> <chr>
## 1       30552 photo