To illustrate how to query a database using SQL, we will use an example Google BigQuery project I created with two different tables: (1) congress
, which contains information about the Facebook Pages of all Members of the U.S. Congress as of late 2017, and (2) posts
, which contains information about the Facebook posts shared by these Pages during 2017, as collected from the public Pages API.
The code below shows how we can authenticate and connect to this database:
# setting display options
options(tibble.width = Inf)
# loading packages; you may need to install them first
library(DBI)
library(bigrquery)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# loading authentication token
# (code assumes it's in your working directory)
bq_auth(path="usc-poir613-key.json")
# connecting to database
db <- dbConnect(
drv=bigquery(),
project='usc-barbera',
dataset='facebook'
)
And this is how we find which tables are in the database, and their fields:
dbListTables(db)
## [1] "congress" "posts"
dbListFields(db, name = 'congress')
## [1] "bioid" "screen_name" "name" "gender" "type"
## [6] "party"
dbListFields(db, name = 'posts')
## [1] "screen_name" "id" "from_name" "date"
## [5] "datetime" "message" "type" "link"
## [9] "domain" "likes_count" "comments_count" "shares_count"
## [13] "love_count" "haha_count" "wow_count" "sad_count"
## [17] "angry_count"
It’s generally good practice to close the connection after you’re done:
dbDisconnect(db)
Now that we know how to connect to the database, let’s see how we can query a table. First we connect using dbConnect
and then we type our query with dbGetQuery
. Generally, we will write the query clauses in capital letters and the column names in lowercase.
We’ll save data returned from dbGetQuery into a dataframe, to compare the R and SQL code
db <- dbConnect(
drv=bigquery(),
project='usc-barbera',
dataset='facebook',
quiet=TRUE
)
test_congress <- dbGetQuery(db, 'SELECT * FROM congress')
str(test_congress)
## tibble [519 × 6] (S3: tbl_df/tbl/data.frame)
## $ bioid : chr [1:519] "A000370" "B001230" "B001245" "B001270" ...
## $ screen_name: chr [1:519] "CongresswomanAdams" "senatortammybaldwin" "madeleine.bordallo" "RepKarenBass" ...
## $ name : chr [1:519] "Alma S. Adams" "Tammy Baldwin" "Madeleine Z. Bordallo" "Karen Bass" ...
## $ gender : chr [1:519] "F" "F" "F" "F" ...
## $ type : chr [1:519] "rep" "sen" "rep" "rep" ...
## $ party : chr [1:519] "Democrat" "Democrat" "Democrat" "Democrat" ...
test_posts <- dbGetQuery(db, 'SELECT * FROM posts LIMIT 100')
str(test_posts)
## tibble [100 × 17] (S3: tbl_df/tbl/data.frame)
## $ screen_name : chr [1:100] "106631626049851" "115356957005" "115356957005" "118514606128" ...
## $ id : chr [1:100] "106631626049851_1275484255831243" "115356957005_10154703789472006" "115356957005_10154916550987006" "118514606128_10154157152401129" ...
## $ from_name : chr [1:100] "Ted Poe" "Congressman Hank Johnson" "Congressman Hank Johnson" "Sam Graves" ...
## $ date : Date[1:100], format: "2017-01-05" "2017-06-20" ...
## $ datetime : POSIXct[1:100], format: "2017-01-05 15:38:22" "2017-06-20 15:57:44" ...
## $ message : chr [1:100] "\"Being sworn in is always an honor and a privilege, but this year is even more special for me. I am happy to a"| __truncated__ "Want to fully understand President Trump’s conflicts of interest? Want to see how they could impact the choices"| __truncated__ "NA" "Everyone who lives around Mound City knows how important Squaw Creek is to people here. But, in the last days o"| __truncated__ ...
## $ type : chr [1:100] "link" "link" "link" "link" ...
## $ link : chr [1:100] "http://poe.house.gov/press-releases?ID=443ADE64-03E2-461A-BA56-FFB1489E9402" "https://goo.gl/7CR9Mg" "https://www-projectcasting-com.cdn.ampproject.org/c/www.projectcasting.com/news/georgia-filming/amp" "http://www.maryvilledailyforum.com/news/article_70511708-d273-11e6-ba6f-6730f071a26a.html" ...
## $ domain : chr [1:100] "poe.house.gov" "goo.gl" "-projectcasting-com.cdn.ampproject.org" "maryvilledailyforum.com" ...
## $ likes_count : int [1:100] 1125 51 167 598 122 109 115 112 119 623 ...
## $ comments_count: int [1:100] 166 23 3 135 53 24 26 33 51 87 ...
## $ shares_count : int [1:100] 170 22 109 218 269 37 39 44 167 200 ...
## $ love_count : int [1:100] 244 0 23 7 1 1 2 5 1 81 ...
## $ haha_count : int [1:100] 0 0 1 0 0 0 1 1 1 1 ...
## $ wow_count : int [1:100] 10 10 10 16 30 12 13 18 18 12 ...
## $ sad_count : int [1:100] 0 1 0 6 4 7 0 2 42 1 ...
## $ angry_count : int [1:100] 0 9 0 65 126 33 0 11 334 8 ...
RMarkdown actually allows us to write the SQL query directly as a chunk of code:
SELECT * FROM congress LIMIT 5
bioid | screen_name | name | gender | type | party |
---|---|---|---|---|---|
A000370 | CongresswomanAdams | Alma S. Adams | F | rep | Democrat |
B001230 | senatortammybaldwin | Tammy Baldwin | F | sen | Democrat |
B001245 | madeleine.bordallo | Madeleine Z. Bordallo | F | rep | Democrat |
B001270 | RepKarenBass | Karen Bass | F | rep | Democrat |
B001278 | congresswomanbonamici | Suzanne Bonamici | F | rep | Democrat |
This is the list of query clauses that we will use to work with the database table, and the dplyr
equivalents:
SELECT: determines which columns to include in the query’s result set. dplyr: select
FROM: identifies the tables from which to draw data and how tables should be joined. dplyr: dataframe name
LIMIT: in combination with SELECT, limits the number of records returned. dplyr: slice_head
WHERE: filter out unwanted data. dplyr: filter
or where
argument within select
GROUP BY: groups rows together by common column values dplyr: group_by
ORDER BY: sorts the rows in the final result set by one or more columns. dplyr: arrange
Let’s start with some examples of SELECT. R equivalents using dplyr
are offered below:
# querying just one column
dbGetQuery(db, "SELECT name FROM congress LIMIT 10")
## # A tibble: 10 × 1
## name
## <chr>
## 1 Alma S. Adams
## 2 Tammy Baldwin
## 3 Madeleine Z. Bordallo
## 4 Karen Bass
## 5 Suzanne Bonamici
## 6 Joyce Beatty
## 7 Julia Brownley
## 8 Cheri Bustos
## 9 Nanette Diaz Barragán
## 10 Lisa Blunt Rochester
# multiple columns
dbGetQuery(db, "SELECT name, party FROM congress LIMIT 10")
## # A tibble: 10 × 2
## name party
## <chr> <chr>
## 1 Alma S. Adams Democrat
## 2 Tammy Baldwin Democrat
## 3 Madeleine Z. Bordallo Democrat
## 4 Karen Bass Democrat
## 5 Suzanne Bonamici Democrat
## 6 Joyce Beatty Democrat
## 7 Julia Brownley Democrat
## 8 Cheri Bustos Democrat
## 9 Nanette Diaz Barragán Democrat
## 10 Lisa Blunt Rochester Democrat
## R:
test_congress %>%
select(name, party) %>%
slice(1:10)
## # A tibble: 10 × 2
## name party
## <chr> <chr>
## 1 Alma S. Adams Democrat
## 2 Tammy Baldwin Democrat
## 3 Madeleine Z. Bordallo Democrat
## 4 Karen Bass Democrat
## 5 Suzanne Bonamici Democrat
## 6 Joyce Beatty Democrat
## 7 Julia Brownley Democrat
## 8 Cheri Bustos Democrat
## 9 Nanette Diaz Barragán Democrat
## 10 Lisa Blunt Rochester Democrat
SQL can perform transformations on the data directly in SELECT:
# adding functions
dbGetQuery(db, "SELECT from_name, likes_count/comments_count, UPPER(type) FROM posts LIMIT 10")
## # A tibble: 10 × 3
## from_name f0_ f1_
## <chr> <dbl> <chr>
## 1 Ted Poe 6.78 LINK
## 2 Congressman Hank Johnson 2.22 LINK
## 3 Congressman Hank Johnson 55.7 LINK
## 4 Sam Graves 4.43 LINK
## 5 Walter Jones 2.30 LINK
## 6 Walter Jones 4.54 LINK
## 7 Walter Jones 4.42 LINK
## 8 Walter Jones 3.39 LINK
## 9 Congressman John Yarmuth (KY-3) 2.33 LINK
## 10 Congressman John Yarmuth (KY-3) 7.16 LINK
# adding aliases to the new columns
dbGetQuery(db, "SELECT from_name, likes_count/comments_count AS like_ratio FROM posts LIMIT 10")
## # A tibble: 10 × 2
## from_name like_ratio
## <chr> <dbl>
## 1 Ted Poe 6.78
## 2 Congressman Hank Johnson 2.22
## 3 Congressman Hank Johnson 55.7
## 4 Sam Graves 4.43
## 5 Walter Jones 2.30
## 6 Walter Jones 4.54
## 7 Walter Jones 4.42
## 8 Walter Jones 3.39
## 9 Congressman John Yarmuth (KY-3) 2.33
## 10 Congressman John Yarmuth (KY-3) 7.16
# it's often a good idea to write a query in multiple lines so that it's easier to read it:
dbGetQuery(db, "SELECT from_name,
likes_count/comments_count AS like_ratio
FROM posts
LIMIT 10")
## # A tibble: 10 × 2
## from_name like_ratio
## <chr> <dbl>
## 1 Ted Poe 6.78
## 2 Congressman Hank Johnson 2.22
## 3 Congressman Hank Johnson 55.7
## 4 Sam Graves 4.43
## 5 Walter Jones 2.30
## 6 Walter Jones 4.54
## 7 Walter Jones 4.42
## 8 Walter Jones 3.39
## 9 Congressman John Yarmuth (KY-3) 2.33
## 10 Congressman John Yarmuth (KY-3) 7.16
## R: test_posts %>%
# select(from_name, likes_count, comments_count, type) %>%
# mutate(like_ratio = likes_count/comments_count,
# type = toupper(type) ) %>%
# slice(1:10)
# selecting unique values from a column
dbGetQuery(db, "SELECT DISTINCT from_name
FROM posts
LIMIT 10")
## # A tibble: 10 × 1
## from_name
## <chr>
## 1 Ted Poe
## 2 Congressman Hank Johnson
## 3 Sam Graves
## 4 Walter Jones
## 5 Congressman John Yarmuth (KY-3)
## 6 Congressman Joe Kennedy III
## 7 Congressman Joaquin Castro
## 8 Louie Gohmert
## 9 Congressman Ken Calvert
## 10 Nydia Velazquez
## R: test_posts %>%
# select(from_name)%>%
# distinct() %>%
# slice(1:10)
Now some examples of how we can use WHERE to subset the database based on values of specific columns:
# selecting based on values of a column
dbGetQuery(db, "SELECT name, party
FROM congress
WHERE party = 'Republican'
LIMIT 10")
## # A tibble: 10 × 2
## name party
## <chr> <chr>
## 1 Marsha Blackburn Republican
## 2 Diane Black Republican
## 3 Susan W. Brooks Republican
## 4 Susan M. Collins Republican
## 5 Shelley Moore Capito Republican
## 6 Barbara Comstock Republican
## 7 Liz Cheney Republican
## 8 Joni Ernst Republican
## 9 Virginia Foxx Republican
## 10 Deb Fischer Republican
## R: test_congress %>%
# select(name, party)%>%
# filter(party== 'Republican') %>%
# slice(1:10)
# AND operator
dbGetQuery(db, "SELECT from_name, type, likes_count
FROM posts
WHERE type != 'photo'
AND likes_count > 500
LIMIT 10")
## # A tibble: 10 × 3
## from_name type likes_count
## <chr> <chr> <int>
## 1 Ted Poe link 1125
## 2 Sam Graves link 598
## 3 Congressman John Yarmuth (KY-3) link 623
## 4 Congressman John Yarmuth (KY-3) link 1173
## 5 Congressman John Yarmuth (KY-3) link 1081
## 6 Congressman John Yarmuth (KY-3) link 966
## 7 Congressman John Yarmuth (KY-3) link 882
## 8 Congressman John Yarmuth (KY-3) link 871
## 9 Congressman Joe Kennedy III link 2683
## 10 Congressman Joe Kennedy III link 2920
## R: test_posts %>%
# select(from_name, type, likes_count)%>%
# filter(type != 'photo' &
# likes_count > 500) %>%
# slice(1:10)
# OR operator
dbGetQuery(db, "SELECT from_name, type, comments_count
FROM posts
WHERE comments_count>100
AND (type = 'photo' OR type = 'video')
LIMIT 10")
## # A tibble: 10 × 3
## from_name type comments_count
## <chr> <chr> <int>
## 1 Congressman John Yarmuth (KY-3) photo 134
## 2 Congressman John Yarmuth (KY-3) photo 152
## 3 Congressman John Yarmuth (KY-3) photo 1150
## 4 Congressman John Yarmuth (KY-3) photo 255
## 5 Congressman Joe Kennedy III photo 160
## 6 Congressman Joe Kennedy III photo 300
## 7 Congressman Joe Kennedy III photo 119
## 8 Congressman Joe Kennedy III photo 191
## 9 Congressman Joe Kennedy III photo 203
## 10 Congressman Joe Kennedy III photo 379
# membership, IN
dbGetQuery(db, "SELECT from_name, type, comments_count
FROM posts
WHERE type IN ('video', 'event')
LIMIT 10")
## # A tibble: 10 × 3
## from_name type comments_count
## <chr> <chr> <int>
## 1 House Democratic Leader Nancy Pelosi event 437
## 2 U.S. Senator Kamala Harris event 284
## 3 Ted Poe video 10
## 4 Gregorio Kilili Camacho Sablan video 59
## 5 Congressman John Yarmuth (KY-3) video 1347
## 6 Congressman John Yarmuth (KY-3) video 607
## 7 Congressman John Yarmuth (KY-3) video 748
## 8 Congressman John Yarmuth (KY-3) video 268
## 9 Congressman John Yarmuth (KY-3) video 127
## 10 Congressman John Yarmuth (KY-3) video 63
## R: test_posts %>%
# select(from_name, type, comments_count)%>%
# filter(type %in% c('video', 'event')) %>%
# slice(1:10)
Very useful for us – GROUP BY allows us to aggregate results based on values of a specific variables. We will come back to this later.
dbGetQuery(db,
"SELECT type, COUNT(*) AS type_count
FROM posts
GROUP BY type")
## # A tibble: 7 × 2
## type type_count
## <chr> <int>
## 1 link 60847
## 2 event 1333
## 3 photo 67067
## 4 video 29639
## 5 status 14636
## 6 note 23
## 7 music 21
## R: test_posts %>%
# group_by(type)%>%
# summarize(type_count= n())
And we can then order the results using ORDER BY. The default is alphabetical order for the first column, but we can change that:
# sort by type_count
dbGetQuery(db,
"SELECT type, COUNT(type) AS type_count
FROM posts
GROUP BY type
ORDER BY type_count")
## # A tibble: 7 × 2
## type type_count
## <chr> <int>
## 1 music 21
## 2 note 23
## 3 event 1333
## 4 status 14636
## 5 video 29639
## 6 link 60847
## 7 photo 67067
## R: test_posts %>%
# group_by(type)%>%
# summarize(type_count= n())%>%
# arrange(type_count)
# now in descending order
dbGetQuery(db,
"SELECT type, COUNT(type) AS type_count
FROM posts
GROUP BY type
ORDER BY type_count DESC")
## # A tibble: 7 × 2
## type type_count
## <chr> <int>
## 1 photo 67067
## 2 link 60847
## 3 video 29639
## 4 status 14636
## 5 event 1333
## 6 note 23
## 7 music 21
# which was the most liked post?
dbGetQuery(db,
"SELECT from_name, message, likes_count, datetime
FROM posts
ORDER BY likes_count DESC
LIMIT 1")
## # A tibble: 1 × 4
## from_name
## <chr>
## 1 U.S. Senator Elizabeth Warren
## message
## <chr>
## 1 Tonight we fight for American values at airports all across this country. I'm…
## likes_count datetime
## <int> <dttm>
## 1 421064 2017-01-29 01:59:12
## R: test_posts %>%
# select(from_name, message, likes_count, datetime) %>%
# arrange(desc(likes_count)) %>%
# slice(1)
# You can also specify the column number instead of the name
dbGetQuery(db,
"SELECT from_name, message, likes_count, datetime
FROM posts
ORDER BY 3 DESC
LIMIT 1")
## # A tibble: 1 × 4
## from_name
## <chr>
## 1 U.S. Senator Elizabeth Warren
## message
## <chr>
## 1 Tonight we fight for American values at airports all across this country. I'm…
## likes_count datetime
## <int> <dttm>
## 1 421064 2017-01-29 01:59:12
# What was the photo posted by a Member of Congress that received the most "angry" reactions?
dbGetQuery(db,
"SELECT from_name, message, angry_count, type
FROM posts
WHERE type = 'photo'
ORDER BY angry_count DESC
LIMIT 1")
## # A tibble: 1 × 4
## from_name
## <chr>
## 1 John Cornyn
## message
## <chr>
## 1 This week, I’ll be voting to confirm Betsy DeVos as Secretary of Education. P…
## angry_count type
## <int> <chr>
## 1 30552 photo