Now’s where we get to see the magic of SQL. It makes it really easy to combine multiple SQL tables using JOIN. We just select columns from different tables and then specify what key should be used to match them using ON.
The example belows shows an inner join, but there are other types (left vs right outer joins, cross joins…) that we will not cover in this workshop.
library(DBI)
db <- dbConnect(RSQLite::SQLite(), "facebook-db.sqlite")
dbGetQuery(db,
"SELECT posts.likes_count, congress.party, posts.date
FROM posts JOIN congress
ON congress.screen_name = posts.screen_name
LIMIT 10")
## likes_count party date
## 1 201 Republican 2017-01-03
## 2 220 Republican 2017-01-04
## 3 116 Republican 2017-01-10
## 4 235 Republican 2017-01-10
## 5 571 Republican 2017-01-10
## 6 122 Republican 2017-01-13
## 7 40 Republican 2017-01-17
## 8 43 Republican 2017-01-17
## 9 46 Republican 2017-01-18
## 10 101 Republican 2017-01-20
Once we know how to merge tables, we can use columns from both tables. For example, let’s say we want to find the most shared post by a Democrat, or by a Republican. Here’s how we would do it:
dbGetQuery(db,
"SELECT posts.from_name, posts.message, posts.shares_count, congress.party
FROM posts JOIN congress
ON congress.screen_name = posts.screen_name
WHERE party = 'Democrat'
ORDER BY shares_count DESC
LIMIT 1")
## from_name
## 1 Congressman Mark Takano
## message
## 1 This remarkable line of questioning from Congresswoman Suzan DelBene demonstrates just a few of the ways that the GOP tax plan treats corporations better than people. Under the Republican plan, corporations are still allowed to deduct state and local taxes. Workers are not. Corporations are still allowed to deduct business expenses. Teachers are not. Corporations are still allowed to deduct more than $10,000 in property taxes. Homeowners are not. Corporations are still allowed to deduct moving expenses. Families are not. And this is on top of a $1.5 trillion corporate tax cut. Let's be clear, this is not a "middle-class tax cut." Working families get the crumbs and the super-wealthy get everything else.
## shares_count party
## 1 341039 Democrat
dbGetQuery(db,
"SELECT posts.from_name, posts.message, posts.shares_count, congress.party
FROM posts JOIN congress
ON congress.screen_name = posts.screen_name
WHERE party = 'Republican'
ORDER BY shares_count DESC
LIMIT 1")
## from_name
## 1 John McCain
## message
## 1 Our government has a responsibility to defend our borders, but we must do so in a way that makes us safer and upholds all that is decent and exceptional about our nation. It is clear from the confusion at our airports across the nation that President Trump’s executive order was not properly vetted. We are particularly concerned by reports that this order went into effect with little to no consultation with the Departments of State, Defense, Justice, and Homeland Security. Such a hasty process risks harmful results. We should not stop green-card holders from returning to the country they call home. We should not stop those who have served as interpreters for our military and diplomats from seeking refuge in the country they risked their lives to help. And we should not turn our backs on those refugees who have been shown through extensive vetting to pose no demonstrable threat to our nation, and who have suffered unspeakable horrors, most of them women and children. Ultimately, we fear this executive order will become a self-inflicted wound in the fight against terrorism. At this very moment, American troops are fighting side-by-side with our Iraqi partners to defeat ISIL. But this executive order bans Iraqi pilots from coming to military bases in Arizona to fight our common enemies. Our most important allies in the fight against ISIL are the vast majority of Muslims who reject its apocalyptic ideology of hatred. This executive order sends a signal, intended or not, that America does not want Muslims coming into our country. That is why we fear this executive order may do more to help terrorist recruitment than improve our security.
## shares_count party
## 1 100376 Republican
SQL makes it really easy to compute aggregate statistics of one column based on values of a different column, with a series of built-in functions (COUNT, AVG, MAX, MIN…). Let’s now see a few examples of this type of queries:
# use COUNT with * to return the total number of rows
dbGetQuery(db, 'SELECT COUNT(*) FROM posts')
## COUNT(*)
## 1 175411
dbGetQuery(db, 'SELECT COUNT(*) FROM congress')
## COUNT(*)
## 1 518
Combining everything we’ve learned, we can for example count the total number of posts by party, or by party and gender.
dbGetQuery(db,
"SELECT congress.party, COUNT(*) AS total_posts
FROM posts JOIN congress
ON congress.screen_name = posts.screen_name
GROUP BY congress.party")
## party total_posts
## 1 Democrat 85129
## 2 Independent 893
## 3 Republican 89389
dbGetQuery(db,
"SELECT congress.party, congress.gender, COUNT(*) AS total_posts
FROM posts JOIN congress
ON congress.screen_name = posts.screen_name
GROUP BY congress.party, congress.gender")
## party gender total_posts
## 1 Democrat F 30454
## 2 Democrat M 54675
## 3 Independent M 893
## 4 Republican F 9893
## 5 Republican M 79496
Here’s another example. Let’s say we want to look at what domains Republicans and Democrats share more frequently on their Facebook pages. How would we do this?
dbGetQuery(db,
"SELECT congress.party, domain, COUNT(*) AS domain_count
FROM posts JOIN Congress
ON congress.screen_name = posts.screen_name
WHERE congress.party = 'Democrat'
GROUP BY domain
ORDER BY domain_count DESC
LIMIT 20")
## party domain domain_count
## 1 Democrat <NA> 53929
## 2 Democrat nytimes.com 2159
## 3 Democrat bit.ly 2043
## 4 Democrat washingtonpost.com 1945
## 5 Democrat thehill.com 899
## 6 Democrat cnn.com 647
## 7 Democrat politico.com 594
## 8 Democrat t.co 445
## 9 Democrat huffingtonpost.com 382
## 10 Democrat vox.com 364
## 11 Democrat medium.com 340
## 12 Democrat latimes.com 324
## 13 Democrat npr.org 308
## 14 Democrat goo.gl 295
## 15 Democrat nbcnews.com 278
## 16 Democrat twitter.com 276
## 17 Democrat c-span.org 268
## 18 Democrat apo.st 261
## 19 Democrat usatoday.com 260
## 20 Democrat nyti.ms 256
dbGetQuery(db,
"SELECT congress.party, domain, COUNT(*) AS domain_count
FROM posts JOIN Congress
ON congress.screen_name = posts.screen_name
WHERE congress.party = 'Republican'
GROUP BY domain
ORDER BY domain_count DESC
LIMIT 20")
## party domain domain_count
## 1 Republican <NA> 59195
## 2 Republican bit.ly 1723
## 3 Republican foxnews.com 708
## 4 Republican thehill.com 664
## 5 Republican washingtonexaminer.com 464
## 6 Republican washingtonpost.com 426
## 7 Republican t.co 405
## 8 Republican twitter.com 376
## 9 Republican goo.gl 293
## 10 Republican wsj.com 249
## 11 Republican fairandsimple.gop 243
## 12 Republican usatoday.com 240
## 13 Republican medium.com 205
## 14 Republican cnn.com 199
## 15 Republican c-span.org 195
## 16 Republican facebook.com 185
## 17 Republican iqconnect.lmhostediq.com 180
## 18 Republican politico.com 180
## 19 Republican cotton.senate.gov 163
## 20 Republican washingtontimes.com 147
There are other functions besides just COUNT. We can also compute averages, the maximum and minimum of a column, the sum… We can also functions like SUBSTR to extract parts of a character string.
# average number of posts by party
dbGetQuery(db,
"SELECT congress.party, AVG(posts.likes_count), COUNT(*)
FROM posts JOIN congress
ON congress.screen_name = posts.screen_name
GROUP BY congress.party")
## party AVG(posts.likes_count) COUNT(*)
## 1 Democrat 404.5807 85129
## 2 Independent 17207.3303 893
## 3 Republican 171.2773 89389
# we can also compute it in a different way...
dbGetQuery(db,
"SELECT congress.party, SUM(posts.likes_count)/COUNT(*) AS average
FROM posts JOIN congress
ON congress.screen_name = posts.screen_name
GROUP BY congress.party")
## party average
## 1 Democrat 404
## 2 Independent 17207
## 3 Republican 171
# most popular post by party
dbGetQuery(db,
"SELECT from_name, message, congress.party, MAX(posts.likes_count), COUNT(*)
FROM posts JOIN congress
ON congress.screen_name = posts.screen_name
GROUP BY congress.party")
## from_name
## 1 U.S. Senator Elizabeth Warren
## 2 U.S. Senator Bernie Sanders
## 3 John McCain
## message
## 1 Tonight we fight for American values at airports all across this country. I'm at Logan Airport tonight to say: our diversity makes our country strong. We will not allow a Muslim ban in the United States of America. #NoBanNoWall #nomuslimban
## 2 Today is going to be a tough day for millions of Americans, including myself. But we cannot throw up our hands in despair. We have to fight back as effectively and as vigorously as we can. We are not giving up.
## 3 Our government has a responsibility to defend our borders, but we must do so in a way that makes us safer and upholds all that is decent and exceptional about our nation. It is clear from the confusion at our airports across the nation that President Trump’s executive order was not properly vetted. We are particularly concerned by reports that this order went into effect with little to no consultation with the Departments of State, Defense, Justice, and Homeland Security. Such a hasty process risks harmful results. We should not stop green-card holders from returning to the country they call home. We should not stop those who have served as interpreters for our military and diplomats from seeking refuge in the country they risked their lives to help. And we should not turn our backs on those refugees who have been shown through extensive vetting to pose no demonstrable threat to our nation, and who have suffered unspeakable horrors, most of them women and children. Ultimately, we fear this executive order will become a self-inflicted wound in the fight against terrorism. At this very moment, American troops are fighting side-by-side with our Iraqi partners to defeat ISIL. But this executive order bans Iraqi pilots from coming to military bases in Arizona to fight our common enemies. Our most important allies in the fight against ISIL are the vast majority of Muslims who reject its apocalyptic ideology of hatred. This executive order sends a signal, intended or not, that America does not want Muslims coming into our country. That is why we fear this executive order may do more to help terrorist recruitment than improve our security.
## party MAX(posts.likes_count) COUNT(*)
## 1 Democrat 421064 85129
## 2 Independent 335572 893
## 3 Republican 288231 89389
# number of posts by day of the month
dbGetQuery(db,
"SELECT SUBSTR(date, 9, 10) AS day_of_month, COUNT(*) as post_count
FROM posts
GROUP BY day_of_month")
## day_of_month post_count
## 1 01 6198
## 2 02 5491
## 3 03 5493
## 4 04 5702
## 5 05 5306
## 6 06 6495
## 7 07 6331
## 8 08 5651
## 9 09 5137
## 10 10 5511
## 11 11 5806
## 12 12 6061
## 13 13 6722
## 14 14 6282
## 15 15 5876
## 16 16 5933
## 17 17 5684
## 18 18 5732
## 19 19 5508
## 20 20 6123
## 21 21 5747
## 22 22 5328
## 23 23 5192
## 24 24 5754
## 25 25 5431
## 26 26 5430
## 27 27 6014
## 28 28 6307
## 29 29 5091
## 30 30 4527
## 31 31 3548