For this first part of the class, we’ll be working with a SQLite database, which is self-contained in a file within our hard drive, without any need to set up a server. The dataset we will work with is all Facebook posts by Members of the U.S. Congress in 2017.
We’ll start by connecting to the database and adding the first table, with information at the member of congress level.
library(DBI)
# create database: this will create a file in our hard drive
db <- dbConnect(RSQLite::SQLite(), "facebook-db.sqlite")
# reading the first file
congress <- read.csv("~/data/congress-facebook-2017.csv",
stringsAsFactors=F)
# adding first table: user-level data
dbWriteTable(db, "congress", congress)
# testing that it works with a simple query
dbListFields(db, "congress")
dbGetQuery(db, 'SELECT * FROM congress LIMIT 5')
Note that we have used dbWriteTable
to add the data frame to our database, and then check that it all worked with dbListFields
and dbGetQuery
. As we will see in a second, this second function is what we will use to query the database, and it will always return a table.
We now switch to the second table. However, the files are too big to open them all in memory. Instead, we will open them one by one, and then append them to the table. Let’s see how that works…
fls <- list.files("~/data/posts", full.names=TRUE)
for (f in fls){
message(f)
# read file into memory
fb <- read.csv(f, stringsAsFactors=F)
# adding to table in SQL database
dbWriteTable(db, "posts", fb, append=TRUE)
}
# testing that it works
dbListFields(db, "posts")
dbGetQuery(db, 'SELECT * FROM posts LIMIT 5')
# what if we make a mistake and want to remove the table?
# dbRemoveTable(db, "posts")
# and we close the connection for now
dbDisconnect(db)
Now that we have our tables in the database, let’s see how we can query them. First we connect using dbConnect
and then we type our query with dbGetQuery
. Generally, we will write the query clauses in capital letters and the column names in lowercase.
db <- dbConnect(RSQLite::SQLite(), "facebook-db.sqlite")
test <- dbGetQuery(db, 'SELECT * FROM congress LIMIT 5')
str(test)
## 'data.frame': 5 obs. of 6 variables:
## $ bioid : chr "A000055" "A000360" "A000367" "A000369" ...
## $ screen_name: chr "RobertAderholt" "senatorlamaralexander" "repjustinamash" "MarkAmodeiNV2" ...
## $ name : chr "Robert B. Aderholt" "Lamar Alexander" "Justin Amash" "Mark E. Amodei" ...
## $ gender : chr "M" "M" "M" "M" ...
## $ type : chr "rep" "sen" "rep" "rep" ...
## $ party : chr "Republican" "Republican" "Republican" "Republican" ...
RMarkdown actually allows us to write the SQL query directly as a chunk of code:
SELECT * FROM congress LIMIT 5
bioid | screen_name | name | gender | type | party |
---|---|---|---|---|---|
A000055 | RobertAderholt | Robert B. Aderholt | M | rep | Republican |
A000360 | senatorlamaralexander | Lamar Alexander | M | sen | Republican |
A000367 | repjustinamash | Justin Amash | M | rep | Republican |
A000369 | MarkAmodeiNV2 | Mark E. Amodei | M | rep | Republican |
A000370 | CongresswomanAdams | Alma S. Adams | F | rep | Democrat |
This is the list of query clauses that we will use to work with the database table:
SELECT: determines which columns to include in the query’s result set
FROM: identifies the tables from which to draw data and how tables should be joined
LIMIT: in combination with SELECT, limits the number of records returned
WHERE: filter out unwanted data
GROUP BY: groups rows together by common column values
ORDER BY: sorts the rows in the final result set by one or more columns
Let’s start with some examples of SELECT:
# querying just one column
dbGetQuery(db, "SELECT name FROM congress LIMIT 10")
## name
## 1 Robert B. Aderholt
## 2 Lamar Alexander
## 3 Justin Amash
## 4 Mark E. Amodei
## 5 Alma S. Adams
## 6 Pete Aguilar
## 7 Rick W. Allen
## 8 Ralph Lee Abraham
## 9 Jodey C. Arrington
## 10 Joe Barton
# multiple columns
dbGetQuery(db, "SELECT name, party FROM congress LIMIT 10")
## name party
## 1 Robert B. Aderholt Republican
## 2 Lamar Alexander Republican
## 3 Justin Amash Republican
## 4 Mark E. Amodei Republican
## 5 Alma S. Adams Democrat
## 6 Pete Aguilar Democrat
## 7 Rick W. Allen Republican
## 8 Ralph Lee Abraham Republican
## 9 Jodey C. Arrington Republican
## 10 Joe Barton Republican
# adding expressions
dbGetQuery(db, "SELECT from_name, likes_count/comments_count, UPPER(type) FROM posts LIMIT 10")
## from_name likes_count/comments_count UPPER(type)
## 1 Ted Poe 13 VIDEO
## 2 Ted Poe 35 PHOTO
## 3 Ted Poe 1 LINK
## 4 Ted Poe 4 LINK
## 5 Ted Poe 21 PHOTO
## 6 Ted Poe 17 VIDEO
## 7 Ted Poe 14 PHOTO
## 8 Ted Poe 5 LINK
## 9 Ted Poe 11 VIDEO
## 10 Ted Poe 9 LINK
# adding aliases to the new columns
dbGetQuery(db, "SELECT from_name, likes_count/comments_count AS like_ratio FROM posts LIMIT 10")
## from_name like_ratio
## 1 Ted Poe 13
## 2 Ted Poe 35
## 3 Ted Poe 1
## 4 Ted Poe 4
## 5 Ted Poe 21
## 6 Ted Poe 17
## 7 Ted Poe 14
## 8 Ted Poe 5
## 9 Ted Poe 11
## 10 Ted Poe 9
# it's often a good idea to write a query in multiple lines so that it's easier to read it:
dbGetQuery(db, "SELECT from_name,
likes_count/comments_count AS like_ratio
FROM posts
LIMIT 10")
## from_name like_ratio
## 1 Ted Poe 13
## 2 Ted Poe 35
## 3 Ted Poe 1
## 4 Ted Poe 4
## 5 Ted Poe 21
## 6 Ted Poe 17
## 7 Ted Poe 14
## 8 Ted Poe 5
## 9 Ted Poe 11
## 10 Ted Poe 9
# selecting unique values from a column
dbGetQuery(db, "SELECT DISTINCT from_name
FROM posts
LIMIT 10")
## from_name
## 1 Ted Poe
## 2 Congressman Wm. Lacy Clay
## 3 Congressman David Scott
## 4 Congressman Hank Johnson
## 5 Rep. Steve Stivers
## 6 Sam Graves
## 7 Walter Jones
## 8 U.S. Congressman Henry Cuellar (TX-28)
## 9 Gregorio Kilili Camacho Sablan
## 10 Representative Mike Kelly
Now some examples of how we can use WHERE to subset the database based on values of specific columns:
# selecting based on values of a column
dbGetQuery(db, "SELECT name, party
FROM congress
WHERE party = 'Republican'
LIMIT 10")
## name party
## 1 Robert B. Aderholt Republican
## 2 Lamar Alexander Republican
## 3 Justin Amash Republican
## 4 Mark E. Amodei Republican
## 5 Rick W. Allen Republican
## 6 Ralph Lee Abraham Republican
## 7 Jodey C. Arrington Republican
## 8 Joe Barton Republican
## 9 Roy Blunt Republican
## 10 Kevin Brady Republican
# greater than, working with dates
dbGetQuery(db, "SELECT from_name, type, date
FROM posts
WHERE date > '2017-01-01'
LIMIT 10")
## from_name type date
## 1 Ted Poe video 2017-01-31
## 2 Ted Poe photo 2017-01-31
## 3 Ted Poe link 2017-01-27
## 4 Ted Poe link 2017-01-27
## 5 Ted Poe photo 2017-01-26
## 6 Ted Poe video 2017-01-26
## 7 Ted Poe photo 2017-01-26
## 8 Ted Poe link 2017-01-25
## 9 Ted Poe video 2017-01-24
## 10 Ted Poe link 2017-01-24
# AND operator
dbGetQuery(db, "SELECT from_name, type, date, likes_count
FROM posts
WHERE date > '2017-06-01' AND type != 'photo'
AND likes_count > 500
LIMIT 10")
## from_name type date likes_count
## 1 Ted Poe status 2017-11-07 1101
## 2 Rep. Steve Stivers status 2017-08-15 1019
## 3 Sam Graves link 2017-09-25 1105
## 4 Walter Jones status 2017-08-22 526
## 5 Walter Jones link 2017-09-14 617
## 6 Gregorio Kilili Camacho Sablan link 2017-06-30 767
## 7 Representative Mike Kelly link 2017-09-24 1659
## 8 Congressman John Yarmuth (KY-3) video 2017-06-22 729
## 9 Congressman John Yarmuth (KY-3) video 2017-06-10 509
## 10 Congressman John Yarmuth (KY-3) status 2017-06-08 1246
# OR operator
dbGetQuery(db, "SELECT from_name, type, date, comments_count
FROM posts
WHERE comments_count>100 AND (type = 'photo' OR type = 'video')
LIMIT 10")
## from_name type date comments_count
## 1 Ted Poe video 2017-10-03 123
## 2 Ted Poe photo 2017-11-24 176
## 3 Ted Poe photo 2017-12-19 164
## 4 Congressman Wm. Lacy Clay video 2017-01-12 163
## 5 Congressman Wm. Lacy Clay photo 2017-01-04 317
## 6 Congressman David Scott video 2017-11-16 108
## 7 Congressman David Scott photo 2017-12-19 158
## 8 Rep. Steve Stivers photo 2017-05-03 132
## 9 Rep. Steve Stivers photo 2017-12-19 121
## 10 Sam Graves photo 2017-02-07 112
# membership, IN
dbGetQuery(db, "SELECT from_name, type, date, comments_count
FROM posts
WHERE type IN ('video', 'event')
LIMIT 10")
## from_name type date comments_count
## 1 Ted Poe video 2017-01-31 19
## 2 Ted Poe video 2017-01-26 14
## 3 Ted Poe video 2017-01-24 23
## 4 Ted Poe video 2017-01-06 7
## 5 Ted Poe video 2017-02-24 27
## 6 Ted Poe video 2017-02-23 47
## 7 Ted Poe video 2017-02-15 22
## 8 Ted Poe video 2017-02-06 44
## 9 Ted Poe video 2017-02-02 44
## 10 Ted Poe video 2017-03-15 27
# matching conditions:
# _ matches exactly one character
# % matches any number of characters
dbGetQuery(db, "SELECT from_name, type, date, comments_count
FROM posts
WHERE date LIKE '2017-01-__'
LIMIT 10")
## from_name type date comments_count
## 1 Ted Poe video 2017-01-31 19
## 2 Ted Poe photo 2017-01-31 2
## 3 Ted Poe link 2017-01-27 21
## 4 Ted Poe link 2017-01-27 5
## 5 Ted Poe photo 2017-01-26 12
## 6 Ted Poe video 2017-01-26 14
## 7 Ted Poe photo 2017-01-26 14
## 8 Ted Poe link 2017-01-25 9
## 9 Ted Poe video 2017-01-24 23
## 10 Ted Poe link 2017-01-24 22
dbGetQuery(db, "SELECT from_name, type, date, comments_count
FROM posts
WHERE date LIKE '2017-03%'
LIMIT 10")
## from_name type date comments_count
## 1 Ted Poe link 2017-03-31 29
## 2 Ted Poe link 2017-03-31 18
## 3 Ted Poe photo 2017-03-29 39
## 4 Ted Poe link 2017-03-17 142
## 5 Ted Poe video 2017-03-15 27
## 6 Ted Poe link 2017-03-10 34
## 7 Ted Poe link 2017-03-08 315
## 8 Ted Poe photo 2017-03-07 7
## 9 Ted Poe link 2017-03-03 27
## 10 Ted Poe photo 2017-03-02 12
# SQLite does not have regular expressions, but we can get creative...
dbGetQuery(db, "SELECT from_name, message, date
FROM posts
WHERE message LIKE '%london%'
LIMIT 1")
## from_name
## 1 Rep. Steve Stivers
## message
## 1 Inspiring story from London, OH in The Columbus Dispatch. It’s amazing to see how a community comes together to support one another, and I am excited for all the children who will be able to enjoy Noah’s Playground. http://bit.ly/2lN7aIp
## date
## 1 2017-03-02
Very useful for us – GROUP BY allows us to aggregate results based on values of a specific variables. We will come back to this later, so don’t worry about it for now.
dbGetQuery(db,
"SELECT from_name, COUNT(type) AS type_count
FROM posts
GROUP BY type")
## from_name type_count
## 1 US Rep Rick Nolan 1334
## 2 Representative Zoe Lofgren 61512
## 3 Senator Tammy Duckworth 21
## 4 Senator Brian Schatz 23
## 5 Representative Zoe Lofgren 67798
## 6 Representative Zoe Lofgren 14795
## 7 Representative Zoe Lofgren 29928
And we can then order the results using ORDER BY. The default is alphabetical order for the first column, but we can change that:
# sort by type_count
dbGetQuery(db,
"SELECT type, COUNT(type) AS type_count
FROM posts
GROUP BY type
ORDER BY type_count")
## type type_count
## 1 music 21
## 2 note 23
## 3 event 1334
## 4 status 14795
## 5 video 29928
## 6 link 61512
## 7 photo 67798
# now in descending orders
dbGetQuery(db,
"SELECT type, COUNT(type) AS type_count
FROM posts
GROUP BY type
ORDER BY type_count DESC")
## type type_count
## 1 photo 67798
## 2 link 61512
## 3 video 29928
## 4 status 14795
## 5 event 1334
## 6 note 23
## 7 music 21
# which was the most popular post?
dbGetQuery(db,
"SELECT from_name, message, likes_count, datetime
FROM posts
ORDER BY likes_count DESC
LIMIT 1")
## from_name
## 1 U.S. Senator Elizabeth Warren
## message
## 1 Tonight we fight for American values at airports all across this country. I'm at Logan Airport tonight to say: our diversity makes our country strong. We will not allow a Muslim ban in the United States of America. #NoBanNoWall #nomuslimban
## likes_count datetime
## 1 421064 2017-01-29 01:59:12
# You can also specify the column number instead of the name
dbGetQuery(db,
"SELECT from_name, message, likes_count, datetime
FROM posts
ORDER BY 3 DESC
LIMIT 1")
## from_name
## 1 U.S. Senator Elizabeth Warren
## message
## 1 Tonight we fight for American values at airports all across this country. I'm at Logan Airport tonight to say: our diversity makes our country strong. We will not allow a Muslim ban in the United States of America. #NoBanNoWall #nomuslimban
## likes_count datetime
## 1 421064 2017-01-29 01:59:12
# what was the post with the highest comment to like ratio? We subset only posts with 1000 likes or more to avoid outliers.
dbGetQuery(db,
"SELECT from_name, message, likes_count, comments_count, date,
comments_count/likes_count AS comment_like_ratio
FROM posts
WHERE likes_count > 1000
ORDER BY comment_like_ratio DESC
LIMIT 5")
## from_name
## 1 U.S. Senator Susan Collins
## 2 Senator Cory Gardner
## 3 Senator Cory Gardner
## 4 John Cornyn
## 5 Senator Roy Blunt
## message
## 1 After securing significant changes in the bill, as well as commitments to pass legislation to help lower health insurance premiums, I will cast my vote in support of the Senate tax reform bill. As revised, this bill will provide much-needed tax relief and simplification for lower- and middle-income families, while spurring the creation of good jobs and greater economic growth. Click below to read my full statement:
## 2 The teachers in our lives play an outsized role as we grow into adults and throughout our adult life - and for good reason. They taught me that we must work together in common cause for a greater school, community, and nation. I look forward to the great work ahead of us not as parents divided, but as people committed, in spite of these differences, to building a stronger education for our children and grandchildren.
## 3 Today marks a historic day for our country. It has been over 30 years since Congress has passed major tax reform and I’m excited to bring tax relief to millions of Coloradans. Read more here: https://www.gardner.senate.gov/newsroom/press-releases/gardner-statement-on-passage-of-historic-tax-relief-legislation
## 4 This week, I’ll be voting to confirm Betsy DeVos as Secretary of Education. Power over education should be handed back down to the states, so that parents and teachers can choose how best to accomplish our universal goal of making sure every child has a good education. Ms. DeVos will lead this effort, so that parents, teachers and our local school districts will have a greater role in these important decisions.
## 5 I believe Betsy DeVos understands that decisions about education need to be made much closer to where kids are. I look forward to working with her to find ways to get those decisions back to local school boards, and moms and dads. As Chairman of the Appropriations Subcommittee for the Department of Education, I will be working hard to ensure that public education continues to be appreciated as the cornerstone of opportunity in America.
## likes_count comments_count date comment_like_ratio
## 1 1070 14974 2017-12-01 13
## 2 1135 14298 2017-02-07 12
## 3 1499 15271 2017-12-02 10
## 4 3762 36849 2017-02-06 9
## 5 1595 15167 2017-02-01 9