Creating an SQL database

For this first part of the class, we’ll be working with a SQLite database, which is self-contained in a file within our hard drive, without any need to set up a server. The dataset we will work with is all Facebook posts by Members of the U.S. Congress in 2017, as collected from the public Pages API earlier this year.

We’ll start by connecting to the database and adding the first table, with information at the member of congress level.

library(DBI)
# create database: this will create a file in our hard drive
db <- dbConnect(RSQLite::SQLite(), "facebook-db.sqlite")

# reading the first file
congress <- read.csv("~/data/congress-facebook-2017.csv",
    stringsAsFactors=F)

# adding first table: user-level data
dbWriteTable(db, "congress", congress)

# testing that it works with a simple query
dbListFields(db, "congress")
dbGetQuery(db, 'SELECT * FROM congress LIMIT 5')

Note that we have used dbWriteTable to add the data frame to our database, and then check that it all worked with dbListFields and dbGetQuery. As we will see in a second, this second function is what we will use to query the database, and it will always return a table.

We now switch to the second table. However, the files are too big to open them all in memory. Instead, we will open them one by one, and then append them to the table. Let’s see how that works…

fls <- list.files("~/data/posts", full.names=TRUE)

for (f in fls){
  
  message(f)

  # read file into memory
  fb <- read.csv(f, stringsAsFactors=F)
  
  # adding to table in SQL database
  dbWriteTable(db, "posts", fb, append=TRUE)
  
}

# testing that it works
dbListFields(db, "posts")
dbGetQuery(db, 'SELECT * FROM posts LIMIT 5')

# what if we make a mistake and want to remove the table?
# dbRemoveTable(db, "posts")

# and we close the connection for now
dbDisconnect(db)

Querying an SQL database

Now that we have our tables in the database, let’s see how we can query them. First we connect using dbConnect and then we type our query with dbGetQuery. Generally, we will write the query clauses in capital letters and the column names in lowercase.

db <- dbConnect(RSQLite::SQLite(), "facebook-db.sqlite")
test <- dbGetQuery(db, 'SELECT * FROM congress LIMIT 5')
str(test)
## 'data.frame':    5 obs. of  6 variables:
##  $ bioid      : chr  "A000055" "A000360" "A000367" "A000369" ...
##  $ screen_name: chr  "RobertAderholt" "senatorlamaralexander" "repjustinamash" "MarkAmodeiNV2" ...
##  $ name       : chr  "Robert B. Aderholt" "Lamar Alexander" "Justin Amash" "Mark E. Amodei" ...
##  $ gender     : chr  "M" "M" "M" "M" ...
##  $ type       : chr  "rep" "sen" "rep" "rep" ...
##  $ party      : chr  "Republican" "Republican" "Republican" "Republican" ...

RMarkdown actually allows us to write the SQL query directly as a chunk of code:

SELECT * FROM congress LIMIT 5
5 records
bioid screen_name name gender type party
A000055 RobertAderholt Robert B. Aderholt M rep Republican
A000360 senatorlamaralexander Lamar Alexander M sen Republican
A000367 repjustinamash Justin Amash M rep Republican
A000369 MarkAmodeiNV2 Mark E. Amodei M rep Republican
A000370 CongresswomanAdams Alma S. Adams F rep Democrat

This is the list of query clauses that we will use to work with the database table:

Let’s start with some examples of SELECT:

# querying just one column
dbGetQuery(db, "SELECT name FROM congress LIMIT 10")
##                  name
## 1  Robert B. Aderholt
## 2     Lamar Alexander
## 3        Justin Amash
## 4      Mark E. Amodei
## 5       Alma S. Adams
## 6        Pete Aguilar
## 7       Rick W. Allen
## 8   Ralph Lee Abraham
## 9  Jodey C. Arrington
## 10         Joe Barton
# multiple columns
dbGetQuery(db, "SELECT name, party FROM congress LIMIT 10")
##                  name      party
## 1  Robert B. Aderholt Republican
## 2     Lamar Alexander Republican
## 3        Justin Amash Republican
## 4      Mark E. Amodei Republican
## 5       Alma S. Adams   Democrat
## 6        Pete Aguilar   Democrat
## 7       Rick W. Allen Republican
## 8   Ralph Lee Abraham Republican
## 9  Jodey C. Arrington Republican
## 10         Joe Barton Republican
# adding expressions
dbGetQuery(db, "SELECT from_name, likes_count/comments_count, UPPER(type) FROM posts LIMIT 10")
##    from_name likes_count/comments_count UPPER(type)
## 1    Ted Poe                          8        LINK
## 2    Ted Poe                         11        LINK
## 3    Ted Poe                         10       PHOTO
## 4    Ted Poe                          6        LINK
## 5    Ted Poe                          3        LINK
## 6    Ted Poe                         11       VIDEO
## 7    Ted Poe                         17       PHOTO
## 8    Ted Poe                          4        LINK
## 9    Ted Poe                         10       PHOTO
## 10   Ted Poe                         11       PHOTO
# adding aliases to the new columns
dbGetQuery(db, "SELECT from_name, likes_count/comments_count AS like_ratio FROM posts LIMIT 10")
##    from_name like_ratio
## 1    Ted Poe          8
## 2    Ted Poe         11
## 3    Ted Poe         10
## 4    Ted Poe          6
## 5    Ted Poe          3
## 6    Ted Poe         11
## 7    Ted Poe         17
## 8    Ted Poe          4
## 9    Ted Poe         10
## 10   Ted Poe         11
# it's often a good idea to write a query in multiple lines so that it's easier to read it:
dbGetQuery(db, "SELECT from_name, 
           likes_count/comments_count AS like_ratio 
           FROM posts 
           LIMIT 10")
##    from_name like_ratio
## 1    Ted Poe          8
## 2    Ted Poe         11
## 3    Ted Poe         10
## 4    Ted Poe          6
## 5    Ted Poe          3
## 6    Ted Poe         11
## 7    Ted Poe         17
## 8    Ted Poe          4
## 9    Ted Poe         10
## 10   Ted Poe         11
# selecting unique values from a column
dbGetQuery(db, "SELECT DISTINCT from_name 
           FROM posts 
           LIMIT 10")
##                                 from_name
## 1                                 Ted Poe
## 2               Congressman Wm. Lacy Clay
## 3                 Congressman David Scott
## 4                Congressman Hank Johnson
## 5                      Rep. Steve Stivers
## 6                              Sam Graves
## 7                            Walter Jones
## 8  U.S. Congressman Henry Cuellar (TX-28)
## 9          Gregorio Kilili Camacho Sablan
## 10              Representative Mike Kelly

Now some examples of how we can use WHERE to subset the database based on values of specific columns:

# selecting based on values of a column
dbGetQuery(db, "SELECT name, party 
           FROM congress
           WHERE party = 'Republican'
           LIMIT 10")
##                  name      party
## 1  Robert B. Aderholt Republican
## 2     Lamar Alexander Republican
## 3        Justin Amash Republican
## 4      Mark E. Amodei Republican
## 5       Rick W. Allen Republican
## 6   Ralph Lee Abraham Republican
## 7  Jodey C. Arrington Republican
## 8          Joe Barton Republican
## 9           Roy Blunt Republican
## 10        Kevin Brady Republican
# greater than, working with dates
dbGetQuery(db, "SELECT from_name, type, date 
           FROM posts
           WHERE date > '2017-01-01'
           LIMIT 10")
##    from_name  type       date
## 1    Ted Poe  link 2017-01-03
## 2    Ted Poe  link 2017-01-03
## 3    Ted Poe photo 2017-01-05
## 4    Ted Poe  link 2017-01-05
## 5    Ted Poe  link 2017-01-06
## 6    Ted Poe video 2017-01-06
## 7    Ted Poe photo 2017-01-09
## 8    Ted Poe  link 2017-01-13
## 9    Ted Poe photo 2017-01-18
## 10   Ted Poe photo 2017-01-18
# AND operator
dbGetQuery(db, "SELECT from_name, type, date, likes_count 
           FROM posts
           WHERE date > '2017-06-01' AND type != 'photo' 
              AND likes_count > 500
           LIMIT 10")
##                          from_name   type       date likes_count
## 1                          Ted Poe status 2017-11-07        1101
## 2               Rep. Steve Stivers status 2017-08-15        1019
## 3                       Sam Graves   link 2017-09-25        1105
## 4                     Walter Jones status 2017-08-22         526
## 5                     Walter Jones   link 2017-09-14         617
## 6   Gregorio Kilili Camacho Sablan   link 2017-06-30         767
## 7        Representative Mike Kelly   link 2017-09-24        1659
## 8  Congressman John Yarmuth (KY-3) status 2017-06-08        1246
## 9  Congressman John Yarmuth (KY-3)  video 2017-06-10         509
## 10 Congressman John Yarmuth (KY-3)  video 2017-06-22         729
# OR operator
dbGetQuery(db, "SELECT from_name, type, date, comments_count 
           FROM posts
           WHERE comments_count>100 AND (type = 'photo' OR type = 'video')
           LIMIT 10")
##                    from_name  type       date comments_count
## 1                    Ted Poe video 2017-10-03            123
## 2                    Ted Poe photo 2017-11-24            176
## 3                    Ted Poe photo 2017-12-19            164
## 4  Congressman Wm. Lacy Clay photo 2017-01-04            317
## 5  Congressman Wm. Lacy Clay video 2017-01-12            163
## 6    Congressman David Scott video 2017-11-16            108
## 7    Congressman David Scott photo 2017-12-19            158
## 8         Rep. Steve Stivers photo 2017-05-03            132
## 9         Rep. Steve Stivers photo 2017-12-19            121
## 10                Sam Graves photo 2017-02-07            112
# membership, IN
dbGetQuery(db, "SELECT from_name, type, date, comments_count 
           FROM posts
           WHERE type IN ('video', 'event')
           LIMIT 10")
##    from_name  type       date comments_count
## 1    Ted Poe video 2017-01-06              7
## 2    Ted Poe video 2017-01-24             23
## 3    Ted Poe video 2017-01-26             14
## 4    Ted Poe video 2017-01-31             19
## 5    Ted Poe video 2017-02-02             44
## 6    Ted Poe video 2017-02-06             44
## 7    Ted Poe video 2017-02-15             22
## 8    Ted Poe video 2017-02-23             47
## 9    Ted Poe video 2017-02-24             27
## 10   Ted Poe video 2017-03-01             23
# matching conditions:
# _ matches exactly one character
# % matches any number of characters
dbGetQuery(db, "SELECT from_name, type, date, comments_count 
           FROM posts
           WHERE date LIKE '2017-01-__'
           LIMIT 10")
##    from_name  type       date comments_count
## 1    Ted Poe  link 2017-01-03             15
## 2    Ted Poe  link 2017-01-03             16
## 3    Ted Poe photo 2017-01-05             13
## 4    Ted Poe  link 2017-01-05            166
## 5    Ted Poe  link 2017-01-06              8
## 6    Ted Poe video 2017-01-06              7
## 7    Ted Poe photo 2017-01-09             13
## 8    Ted Poe  link 2017-01-13             10
## 9    Ted Poe photo 2017-01-18             10
## 10   Ted Poe photo 2017-01-18              8
dbGetQuery(db, "SELECT from_name, type, date, comments_count 
           FROM posts
           WHERE date LIKE '2017-03%'
           LIMIT 10")
##    from_name  type       date comments_count
## 1    Ted Poe photo 2017-03-01              1
## 2    Ted Poe video 2017-03-01             23
## 3    Ted Poe photo 2017-03-02             12
## 4    Ted Poe  link 2017-03-03             27
## 5    Ted Poe photo 2017-03-07              7
## 6    Ted Poe  link 2017-03-08            315
## 7    Ted Poe  link 2017-03-10             34
## 8    Ted Poe video 2017-03-15             27
## 9    Ted Poe  link 2017-03-17            142
## 10   Ted Poe photo 2017-03-29             39
# SQLite does not have regular expressions, but we can get creative...
dbGetQuery(db, "SELECT from_name, message, date
           FROM posts
           WHERE message LIKE '%hungary%'
           LIMIT 1")
##     from_name
## 1 Albio Sires
##                                                                                                                        message
## 1 This morning, I met with Dr. Laszlo Szabo, Ambassador of Hungary to discuss bilateral relations with an important NATO ally.
##         date
## 1 2017-09-28

Very useful for us – GROUP BY allows us to aggregate results based on values of a specific variables. We will come back to this later, so don’t worry about it for now.

dbGetQuery(db, 
  "SELECT from_name, COUNT(*) AS post_count
  FROM posts
  GROUP BY from_name
  LIMIT 10")
##                from_name post_count
## 1            Albio Sires        446
## 2             Ann Wagner        190
## 3          Anthony Brown        360
## 4          Ben Ray Lujan        131
## 5            Bill Foster        236
## 6           Bill Johnson        299
## 7             Billy Long         93
## 8     Blaine Luetkemeyer        193
## 9              Bob Latta        394
## 10 Bonnie Watson Coleman        490

And we can then order the results using ORDER BY. The default is alphabetical order for the first column, but we can change that:

# sort by type_count
dbGetQuery(db, 
  "SELECT type, COUNT(type) AS type_count
  FROM posts
  GROUP BY type
  ORDER BY type_count")
##     type type_count
## 1  music         21
## 2   note         23
## 3  event       1334
## 4 status      14795
## 5  video      29928
## 6   link      61512
## 7  photo      67798
# now in descending orders
dbGetQuery(db, 
  "SELECT type, COUNT(type) AS type_count
  FROM posts
  GROUP BY type
  ORDER BY type_count DESC")
##     type type_count
## 1  photo      67798
## 2   link      61512
## 3  video      29928
## 4 status      14795
## 5  event       1334
## 6   note         23
## 7  music         21
# which was the most popular post?
dbGetQuery(db, 
  "SELECT from_name, message, likes_count, datetime
  FROM posts
  ORDER BY likes_count DESC
  LIMIT 1")
##                       from_name
## 1 U.S. Senator Elizabeth Warren
##                                                                                                                                                                                                                                            message
## 1 Tonight we fight for American values at airports all across this country. I'm at Logan Airport tonight to say: our diversity makes our country strong. We will not allow a Muslim ban in the United States of America. #NoBanNoWall #nomuslimban
##   likes_count            datetime
## 1      421064 2017-01-29 01:59:12
# You can also specify the column number instead of the name
dbGetQuery(db, 
  "SELECT from_name, message, likes_count, datetime
  FROM posts
  ORDER BY 3 DESC
  LIMIT 1")
##                       from_name
## 1 U.S. Senator Elizabeth Warren
##                                                                                                                                                                                                                                            message
## 1 Tonight we fight for American values at airports all across this country. I'm at Logan Airport tonight to say: our diversity makes our country strong. We will not allow a Muslim ban in the United States of America. #NoBanNoWall #nomuslimban
##   likes_count            datetime
## 1      421064 2017-01-29 01:59:12
# what was the post with the highest comment to like ratio? We subset only posts with 1000 likes or more to avoid outliers.
dbGetQuery(db,
  "SELECT from_name, message, likes_count, comments_count, date,   
      comments_count/likes_count AS comment_like_ratio
  FROM posts
  WHERE likes_count > 1000
  ORDER BY comment_like_ratio DESC
  LIMIT 5")
##                    from_name
## 1 U.S. Senator Susan Collins
## 2       Senator Cory Gardner
## 3       Senator Cory Gardner
## 4                John Cornyn
## 5          Senator Roy Blunt
##                                                                                                                                                                                                                                                                                                                                                                                                                                                   message
## 1                     After securing significant changes in the bill, as well as commitments to pass legislation to help lower health insurance premiums, I will cast my vote in support of the Senate tax reform bill. As revised, this bill will provide much-needed tax relief and simplification for lower- and middle-income families, while spurring the creation of good jobs and greater economic growth.  Click below to read my full statement:
## 2                    The teachers in our lives play an outsized role as we grow into adults and throughout our adult life - and for good reason. They taught me that we must work together in common cause for a greater school, community, and nation. I look forward to the great work ahead of us not as parents divided, but as people committed, in spite of these differences, to building a stronger education for our children and grandchildren.
## 3                                                                                                                                  Today marks a historic day for our country. It has been over 30 years since Congress has passed major tax reform and I’m excited to bring tax relief to millions of Coloradans. Read more here: https://www.gardner.senate.gov/newsroom/press-releases/gardner-statement-on-passage-of-historic-tax-relief-legislation
## 4                          This week, I’ll be voting to confirm Betsy DeVos as Secretary of Education. Power over education should be handed back down to the states, so that parents and teachers can choose how best to accomplish our universal goal of making sure every child has a good education. Ms. DeVos will lead this effort, so that parents, teachers and our local school districts will have a greater role in these important decisions.
## 5 I believe Betsy DeVos understands that decisions about education need to be made much closer to where kids are. I look forward to working with her to find ways to get those decisions back to local school boards, and moms and dads. As Chairman of the Appropriations Subcommittee for the Department of Education, I will be working hard to ensure that public education continues to be appreciated as the cornerstone of opportunity in America.
##   likes_count comments_count       date comment_like_ratio
## 1        1070          14974 2017-12-01                 13
## 2        1135          14298 2017-02-07                 12
## 3        1499          15271 2017-12-02                 10
## 4        3762          36849 2017-02-06                  9
## 5        1595          15167 2017-02-01                  9