Querying multiple SQL tables

Now’s where we get to see the magic of SQL. It makes it really easy to combine multiple SQL tables using JOIN. We just select columns from different tables and then specify what key should be used to match them using ON.

The example belows shows an inner join, but there are other types (left vs right outer joins, cross joins…).

options(tibble.width = Inf)
library(DBI)
library(bigrquery)
bq_auth(path="usc-poir613-key.json")
db <- dbConnect(
  drv=bigquery(),
  project='usc-barbera',
  dataset='facebook',
  quiet=TRUE
)
# taking samples to compare with R
test_congress <- dbGetQuery(db, 'SELECT * FROM congress')
test_posts <- dbGetQuery(db, 'SELECT * FROM posts LIMIT 1000')

dbGetQuery(db, 
  "SELECT p.id, p.likes_count, c.party, p.date
  FROM posts p
  JOIN congress c
    ON c.screen_name = p.screen_name
  LIMIT 10")
## # A tibble: 10 × 4
##    id                               likes_count party      date      
##    <chr>                                  <int> <chr>      <date>    
##  1 106631626049851_1275484255831243        1125 Republican 2017-01-05
##  2 115356957005_10154703789472006            51 Democrat   2017-06-20
##  3 115356957005_10154916550987006           167 Democrat   2017-08-30
##  4 118514606128_10154157152401129           598 Republican 2017-01-04
##  5 15083070102_10158053631470103            122 Republican 2017-01-19
##  6 15083070102_10158076655530103            109 Republican 2017-01-24
##  7 15083070102_10158952404760103            115 Republican 2017-07-12
##  8 15083070102_10159708444985103            112 Republican 2017-12-28
##  9 214258646163_10155941688936164           119 Democrat   2017-02-03
## 10 214258646163_10156013088956164           623 Democrat   2017-02-22

Once we know how to merge tables, we can use columns from both tables. For example, let’s say we want to find the most shared post by a Democrat. Here’s how we would do it:

dbGetQuery(db, 
  "SELECT p.from_name, p.message, p.shares_count, c.party 
  FROM posts p
  JOIN congress c
    ON c.screen_name = p.screen_name
  WHERE c.party = 'Democrat'
  ORDER BY p.shares_count DESC
  LIMIT 1")
## # A tibble: 1 × 4
##   from_name              
##   <chr>                  
## 1 Congressman Mark Takano
##   message                                                                       
##   <chr>                                                                         
## 1 "This remarkable line of questioning from Congresswoman Suzan DelBene demonst…
##   shares_count party   
##          <int> <chr>   
## 1       341039 Democrat
## R:
#test_posts %>%
#  inner_join(test_congress, by='screen_name') %>% 
#  filter(party=='Democrat') %>% 
#  arrange(-shares_count) %>% 
#  select(from_name, message, shares_count, party) %>% 
#  slice(1)

Grouping and aggregating

SQL makes it really easy to compute aggregate statistics of one column based on values of a different column, with a series of built-in functions (COUNT, AVG, MAX, MIN…). Let’s now see a few examples of this type of queries:

# use COUNT with * to return the total number of rows
dbGetQuery(db, 'SELECT COUNT(*) AS row_count FROM posts')
## # A tibble: 1 × 1
##   row_count
##       <int>
## 1    173566
dbGetQuery(db, 'SELECT COUNT(*) AS row_count FROM congress')
## # A tibble: 1 × 1
##   row_count
##       <int>
## 1       519

Combining everything we’ve learned, we can for example count the total number of posts by party, or by party and gender.

dbGetQuery(db, 
  "SELECT c.party, COUNT(*) AS total_posts
  FROM posts p
  JOIN congress c
    ON c.screen_name = p.screen_name
  GROUP BY c.party")
## # A tibble: 3 × 2
##   party       total_posts
##   <chr>             <int>
## 1 Democrat          84226
## 2 Republican        88447
## 3 Independent         893
## R:
#test_posts %>%
#  inner_join(test_congress, by='screen_name') %>% 
#  group_by(party) %>% 
#  summarise(total_posts=n())

dbGetQuery(db, 
  "SELECT c.gender, COUNT(*) AS total_posts
  FROM posts p
  JOIN congress c
    ON c.screen_name = p.screen_name
  GROUP BY c.gender")
## # A tibble: 2 × 2
##   gender total_posts
##   <chr>        <int>
## 1 M           133785
## 2 F            39781
## R:
#test_posts %>%
#  inner_join(test_congress, by='screen_name') %>% 
#  group_by(gender) %>% 
#  summarise(total_posts=n())

Here’s another example. Let’s say we want to look at what domains Republicans and Democrats share more frequently on their Facebook pages. How would we do this?

dbGetQuery(db, 
  "SELECT 
    ANY_VALUE(c.party) AS party,
    p.domain,
    COUNT(*) AS domain_count
  FROM posts p
  JOIN congress c
    ON c.screen_name = p.screen_name
  WHERE c.party = 'Democrat'
    AND p.domain != 'NA'
  GROUP BY domain
  ORDER BY domain_count DESC
  LIMIT 10")
## # A tibble: 10 × 3
##    party    domain             domain_count
##    <chr>    <chr>                     <int>
##  1 Democrat nytimes.com                2130
##  2 Democrat bit.ly                     2033
##  3 Democrat washingtonpost.com         1922
##  4 Democrat thehill.com                 886
##  5 Democrat cnn.com                     635
##  6 Democrat politico.com                583
##  7 Democrat t.co                        445
##  8 Democrat huffingtonpost.com          375
##  9 Democrat vox.com                     350
## 10 Democrat medium.com                  331
### R:
#test_posts %>%
#  filter(!is.na(domain)) %>% 
#  inner_join(test_congress, by='screen_name') %>% 
#  filter(!is.na(domain) & party == 'Democrat') %>% 
#  group_by(domain) %>% 
#  summarise(domain_count = n()) %>% 
#  arrange(-domain_count) %>% 
#  slice(10)
  
dbGetQuery(db, 
  "SELECT
    ANY_VALUE(c.party) AS party,
    p.domain,
    COUNT(*) AS domain_count
  FROM posts p
  JOIN congress c
    ON c.screen_name = p.screen_name
  WHERE c.party = 'Republican'
    AND p.domain != 'NA'
  GROUP BY domain
  ORDER BY domain_count DESC
  LIMIT 10")
## # A tibble: 10 × 3
##    party      domain                 domain_count
##    <chr>      <chr>                         <int>
##  1 Republican bit.ly                         1628
##  2 Republican foxnews.com                     707
##  3 Republican thehill.com                     662
##  4 Republican washingtonexaminer.com          464
##  5 Republican washingtonpost.com              424
##  6 Republican t.co                            405
##  7 Republican twitter.com                     375
##  8 Republican goo.gl                          293
##  9 Republican wsj.com                         248
## 10 Republican fairandsimple.gop               243

There are other functions besides just COUNT. We can also compute averages, the maximum and minimum of a column, the sum…

# average number of posts by party
dbGetQuery(db, 
  "SELECT
    c.party,
    AVG(p.likes_count) AS avg_likes_count,
    COUNT(*) AS post_count
  FROM posts p
  JOIN congress c
    ON c.screen_name = p.screen_name
  GROUP BY c.party")
## # A tibble: 3 × 3
##   party       avg_likes_count post_count
##   <chr>                 <dbl>      <int>
## 1 Republican             171.      88447
## 2 Democrat               401.      84226
## 3 Independent          17207.        893
# do women receive more comments than men on average?
dbGetQuery(db,
           "SELECT c.gender AS gender,
              AVG(p.comments_count) AS avg_comment_count 
           FROM posts p
           JOIN congress c
            ON p.screen_name = c.screen_name
           GROUP BY c.gender")
## # A tibble: 2 × 2
##   gender avg_comment_count
##   <chr>              <dbl>
## 1 M                   76.4
## 2 F                   75.8
# what is the average number of reactions by party and reaction type?
dbGetQuery(db, 
           "SELECT 
              c.party, 
              AVG(p.love_count) AS avg_love_count, 
              AVG(p.haha_count) AS avg_haha_count, 
              AVG(p.wow_count) AS avg_wow_count,
              AVG(p.sad_count)  AS avg_sad_count,
              AVG(p.angry_count) AS avg_angry_count
           FROM posts p
           JOIN congress c
            ON p.screen_name = c.screen_name
           GROUP BY c.party")
## # A tibble: 3 × 6
##   party       avg_love_count avg_haha_count avg_wow_count avg_sad_count
##   <chr>                <dbl>          <dbl>         <dbl>         <dbl>
## 1 Republican            14.3           2.76          1.32          4.40
## 2 Democrat              45.1           3.02          5.01         18.5 
## 3 Independent         2307.          168.          209.          706.  
##   avg_angry_count
##             <dbl>
## 1            21.9
## 2            45.4
## 3          1700.

CTEs

For complex queries, sometimes it’s necessary to write Common Table Expressions, which are tables that exist only in the context of the query.

For example, assume that we want to compute the average ratio of comments to likes for legislators of each party. We need to break this into two steps: (1) compute legislator-level ratios; then (2) aggregate at the party level. This is how we’d do it:

dbGetQuery(db, 
  "WITH ratios AS (
    SELECT
      screen_name,
      SUM(comments_count)/SUM(likes_count) AS ratio
    FROM posts
    GROUP BY screen_name
  )
  SELECT
    c.party, AVG(r.ratio) AS avg_ratio
  FROM congress c
  JOIN ratios r
    ON c.screen_name = r.screen_name
  GROUP BY c.party
  ")
## # A tibble: 3 × 2
##   party       avg_ratio
##   <chr>           <dbl>
## 1 Republican      0.569
## 2 Democrat        0.210
## 3 Independent     0.116
# equivalent to: (but much harder to read)
dbGetQuery(db, 
  "SELECT
    c.party, AVG(r.ratio) AS avg_ratio
  FROM congress c
  JOIN (
    SELECT
      screen_name,
      SUM(comments_count)/SUM(likes_count) AS ratio
    FROM posts
    GROUP BY screen_name
  ) r
    ON c.screen_name = r.screen_name
  GROUP BY c.party
  ")
## # A tibble: 3 × 2
##   party       avg_ratio
##   <chr>           <dbl>
## 1 Democrat        0.210
## 2 Republican      0.569
## 3 Independent     0.116