Now’s where we get to see the magic of SQL. It makes it really easy to combine multiple SQL tables using JOIN. We just select columns from different tables and then specify what key should be used to match them using ON.
The example belows shows an inner join, but there are other types (left vs right outer joins, cross joins…).
options(tibble.width = Inf)
library(DBI)
library(bigrquery)
bq_auth(path="usc-poir613-key.json")
db <- dbConnect(
drv=bigquery(),
project='usc-barbera',
dataset='facebook',
quiet=TRUE
)
# taking samples to compare with R
test_congress <- dbGetQuery(db, 'SELECT * FROM congress')
test_posts <- dbGetQuery(db, 'SELECT * FROM posts LIMIT 1000')
dbGetQuery(db,
"SELECT p.id, p.likes_count, c.party, p.date
FROM posts p
JOIN congress c
ON c.screen_name = p.screen_name
LIMIT 10")
## # A tibble: 10 × 4
## id likes_count party date
## <chr> <int> <chr> <date>
## 1 106631626049851_1275484255831243 1125 Republican 2017-01-05
## 2 115356957005_10154703789472006 51 Democrat 2017-06-20
## 3 115356957005_10154916550987006 167 Democrat 2017-08-30
## 4 118514606128_10154157152401129 598 Republican 2017-01-04
## 5 15083070102_10158053631470103 122 Republican 2017-01-19
## 6 15083070102_10158076655530103 109 Republican 2017-01-24
## 7 15083070102_10158952404760103 115 Republican 2017-07-12
## 8 15083070102_10159708444985103 112 Republican 2017-12-28
## 9 214258646163_10155941688936164 119 Democrat 2017-02-03
## 10 214258646163_10156013088956164 623 Democrat 2017-02-22
Once we know how to merge tables, we can use columns from both tables. For example, let’s say we want to find the most shared post by a Democrat. Here’s how we would do it:
dbGetQuery(db,
"SELECT p.from_name, p.message, p.shares_count, c.party
FROM posts p
JOIN congress c
ON c.screen_name = p.screen_name
WHERE c.party = 'Democrat'
ORDER BY p.shares_count DESC
LIMIT 1")
## # A tibble: 1 × 4
## from_name
## <chr>
## 1 Congressman Mark Takano
## message
## <chr>
## 1 "This remarkable line of questioning from Congresswoman Suzan DelBene demonst…
## shares_count party
## <int> <chr>
## 1 341039 Democrat
## R:
#test_posts %>%
# inner_join(test_congress, by='screen_name') %>%
# filter(party=='Democrat') %>%
# arrange(-shares_count) %>%
# select(from_name, message, shares_count, party) %>%
# slice(1)
SQL makes it really easy to compute aggregate statistics of one column based on values of a different column, with a series of built-in functions (COUNT, AVG, MAX, MIN…). Let’s now see a few examples of this type of queries:
# use COUNT with * to return the total number of rows
dbGetQuery(db, 'SELECT COUNT(*) AS row_count FROM posts')
## # A tibble: 1 × 1
## row_count
## <int>
## 1 173566
dbGetQuery(db, 'SELECT COUNT(*) AS row_count FROM congress')
## # A tibble: 1 × 1
## row_count
## <int>
## 1 519
Combining everything we’ve learned, we can for example count the total number of posts by party, or by party and gender.
dbGetQuery(db,
"SELECT c.party, COUNT(*) AS total_posts
FROM posts p
JOIN congress c
ON c.screen_name = p.screen_name
GROUP BY c.party")
## # A tibble: 3 × 2
## party total_posts
## <chr> <int>
## 1 Democrat 84226
## 2 Republican 88447
## 3 Independent 893
## R:
#test_posts %>%
# inner_join(test_congress, by='screen_name') %>%
# group_by(party) %>%
# summarise(total_posts=n())
dbGetQuery(db,
"SELECT c.gender, COUNT(*) AS total_posts
FROM posts p
JOIN congress c
ON c.screen_name = p.screen_name
GROUP BY c.gender")
## # A tibble: 2 × 2
## gender total_posts
## <chr> <int>
## 1 M 133785
## 2 F 39781
## R:
#test_posts %>%
# inner_join(test_congress, by='screen_name') %>%
# group_by(gender) %>%
# summarise(total_posts=n())
Here’s another example. Let’s say we want to look at what domains Republicans and Democrats share more frequently on their Facebook pages. How would we do this?
dbGetQuery(db,
"SELECT
ANY_VALUE(c.party) AS party,
p.domain,
COUNT(*) AS domain_count
FROM posts p
JOIN congress c
ON c.screen_name = p.screen_name
WHERE c.party = 'Democrat'
AND p.domain != 'NA'
GROUP BY domain
ORDER BY domain_count DESC
LIMIT 10")
## # A tibble: 10 × 3
## party domain domain_count
## <chr> <chr> <int>
## 1 Democrat nytimes.com 2130
## 2 Democrat bit.ly 2033
## 3 Democrat washingtonpost.com 1922
## 4 Democrat thehill.com 886
## 5 Democrat cnn.com 635
## 6 Democrat politico.com 583
## 7 Democrat t.co 445
## 8 Democrat huffingtonpost.com 375
## 9 Democrat vox.com 350
## 10 Democrat medium.com 331
### R:
#test_posts %>%
# filter(!is.na(domain)) %>%
# inner_join(test_congress, by='screen_name') %>%
# filter(!is.na(domain) & party == 'Democrat') %>%
# group_by(domain) %>%
# summarise(domain_count = n()) %>%
# arrange(-domain_count) %>%
# slice(10)
dbGetQuery(db,
"SELECT
ANY_VALUE(c.party) AS party,
p.domain,
COUNT(*) AS domain_count
FROM posts p
JOIN congress c
ON c.screen_name = p.screen_name
WHERE c.party = 'Republican'
AND p.domain != 'NA'
GROUP BY domain
ORDER BY domain_count DESC
LIMIT 10")
## # A tibble: 10 × 3
## party domain domain_count
## <chr> <chr> <int>
## 1 Republican bit.ly 1628
## 2 Republican foxnews.com 707
## 3 Republican thehill.com 662
## 4 Republican washingtonexaminer.com 464
## 5 Republican washingtonpost.com 424
## 6 Republican t.co 405
## 7 Republican twitter.com 375
## 8 Republican goo.gl 293
## 9 Republican wsj.com 248
## 10 Republican fairandsimple.gop 243
There are other functions besides just COUNT. We can also compute averages, the maximum and minimum of a column, the sum…
# average number of posts by party
dbGetQuery(db,
"SELECT
c.party,
AVG(p.likes_count) AS avg_likes_count,
COUNT(*) AS post_count
FROM posts p
JOIN congress c
ON c.screen_name = p.screen_name
GROUP BY c.party")
## # A tibble: 3 × 3
## party avg_likes_count post_count
## <chr> <dbl> <int>
## 1 Republican 171. 88447
## 2 Democrat 401. 84226
## 3 Independent 17207. 893
# do women receive more comments than men on average?
dbGetQuery(db,
"SELECT c.gender AS gender,
AVG(p.comments_count) AS avg_comment_count
FROM posts p
JOIN congress c
ON p.screen_name = c.screen_name
GROUP BY c.gender")
## # A tibble: 2 × 2
## gender avg_comment_count
## <chr> <dbl>
## 1 M 76.4
## 2 F 75.8
# what is the average number of reactions by party and reaction type?
dbGetQuery(db,
"SELECT
c.party,
AVG(p.love_count) AS avg_love_count,
AVG(p.haha_count) AS avg_haha_count,
AVG(p.wow_count) AS avg_wow_count,
AVG(p.sad_count) AS avg_sad_count,
AVG(p.angry_count) AS avg_angry_count
FROM posts p
JOIN congress c
ON p.screen_name = c.screen_name
GROUP BY c.party")
## # A tibble: 3 × 6
## party avg_love_count avg_haha_count avg_wow_count avg_sad_count
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Republican 14.3 2.76 1.32 4.40
## 2 Democrat 45.1 3.02 5.01 18.5
## 3 Independent 2307. 168. 209. 706.
## avg_angry_count
## <dbl>
## 1 21.9
## 2 45.4
## 3 1700.
For complex queries, sometimes it’s necessary to write Common Table Expressions, which are tables that exist only in the context of the query.
For example, assume that we want to compute the average ratio of comments to likes for legislators of each party. We need to break this into two steps: (1) compute legislator-level ratios; then (2) aggregate at the party level. This is how we’d do it:
dbGetQuery(db,
"WITH ratios AS (
SELECT
screen_name,
SUM(comments_count)/SUM(likes_count) AS ratio
FROM posts
GROUP BY screen_name
)
SELECT
c.party, AVG(r.ratio) AS avg_ratio
FROM congress c
JOIN ratios r
ON c.screen_name = r.screen_name
GROUP BY c.party
")
## # A tibble: 3 × 2
## party avg_ratio
## <chr> <dbl>
## 1 Republican 0.569
## 2 Democrat 0.210
## 3 Independent 0.116
# equivalent to: (but much harder to read)
dbGetQuery(db,
"SELECT
c.party, AVG(r.ratio) AS avg_ratio
FROM congress c
JOIN (
SELECT
screen_name,
SUM(comments_count)/SUM(likes_count) AS ratio
FROM posts
GROUP BY screen_name
) r
ON c.screen_name = r.screen_name
GROUP BY c.party
")
## # A tibble: 3 × 2
## party avg_ratio
## <chr> <dbl>
## 1 Democrat 0.210
## 2 Republican 0.569
## 3 Independent 0.116