To illustrate the potential of SQL queries with large-scale databases, we’ll now work with a database hosted in BigQuery that I have used for some of my research projects.

Overview of the database

The database contains information related to a random sample of 200,000 users in the US, labeled with user-level information predicted from the voter file or other data sources. The database contains two tables, one at the tweet level and another one at the user level.

We have used different versions of this database to, for example, study the prevalence of hateful and intolerant speech on Twitter (in our 2021 QJPS article) or to study segregation in information diffusion (in our 2015 Psychological Science article).

# setting display options
options(tibble.width = Inf)
# loading packages
library(DBI)
library(bigrquery)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
# loading authentication token
# (code assumes it's in your working directory)
bq_auth(path="usc-poir613-key.json")
# connecting to database
db <- dbConnect(
  drv=bigquery(),
  project='usc-barbera',
  dataset='twitter_panel'
)

The tweet-label table (tweets_20170721) contains 388 million tweets and five variables: id_str (tweet ID), user_id_str (user ID), datetime (time of the tweet in BigQuery datetime format), date (date of tweet in string format), and text of the tweet

# count rows
dbGetQuery(db, 'SELECT COUNT(*) AS tweet_count FROM tweets_20170721')
## # A tibble: 1 × 1
##   tweet_count
##         <int>
## 1   388766024
# get names of columns
dbListFields(db, 'tweets_20170721')
## [1] "id_str"      "user_id_str" "datetime"    "date"        "text"
# show first 2 rows
dbGetQuery(db, 'SELECT * FROM tweets_20170721 LIMIT 2')
## # A tibble: 2 × 5
##   id_str             user_id_str datetime            date      
##   <chr>              <chr>       <dttm>              <chr>     
## 1 550441451294044162 42625966    2015-01-01 00:00:43 2015-01-01
## 2 550441809923817472 73674769    2015-01-01 00:02:08 2015-01-01
##   text                                   
##   <chr>                                  
## 1 I just wanna hold my niece <U+0001F629>
## 2 http://t.co/2nkqfejmsS

The user-level table has information about around 200,000 users. For each of them, the following variables are available: user_id_str (user ID), male (TRUE if user is predicted to be male), predicted age (with values 1 = 18-25, 2 = 26-40, 3 = 40+), predicted party (with values 1 = None/Others, 2 = Democrat, 3 = Republican), predicted race (with values 1 = Black, 2 = Hispanic, 3 = Asian, 4 = White), predicted turnout in 2012, name of user, description of user, followers_count, statuses_count, friends_count, location field in user’s profile, and lang (account language).

More information about how this dataset was collected is available in this working paper.

# count rows
dbGetQuery(db, 'SELECT COUNT(*) AS user_count FROM users_subset_20160718')
# get names of columns
dbListFields(db, 'users_subset_20160718')
# show first 2 rows
dbGetQuery(db, 'SELECT * FROM users_subset_20160718 LIMIT 2')

Descriptive statistics

Now let’s run a few sample queries. First, let’s count the number of tweets by month. Different ways we could do this, but a very useful function is DATE_TRUNC, which will truncate any timestamp to our desired level of granularity:

df <- dbGetQuery(db, 
  'SELECT
    DATE_TRUNC(datetime, MONTH) AS month,
    COUNT(*) AS tweet_count
  FROM twitter_panel.tweets_20170721
  GROUP BY month
  ORDER BY month')
ggplot(df, aes(x=month, y=tweet_count)) +
  geom_point() + geom_line() +
  theme_bw()

Now, let’s switch to the user-level table and look at the distribution of age by party.

df <- dbGetQuery(db, 
  'SELECT
    age,
    party,
    COUNT(*) AS user_count
  FROM twitter_panel.users_subset_20160718
  GROUP BY age, party
  ORDER BY age, party')
df <- df %>%
  mutate(party = factor(party, labels=c("Others", "Dem", "Rep")),
         age = factor(age, labels=c("18-25", "26-40", "40+"))) %>%
  group_by(age) %>%
  mutate(user_prop = user_count / sum(user_count)) %>%
  ungroup()

ggplot(df, aes(x=age, y=user_prop, color=party, fill=party)) +
  geom_col() + facet_wrap(~party) +
  scale_fill_manual(values=c("darkgreen", "blue", "red")) +
  scale_color_manual(values=c("darkgreen", "blue", "red")) +
  scale_y_continuous("% of users", 
                     labels=scales::percent_format(accuracy=1)) +
  theme_bw()

Searching within text columns

We can also use LIKE to search for tweets based on their text.

dbGetQuery(db, 
  "SELECT COUNT(*) AS tweet_count
  FROM twitter_panel.tweets_20170721
  WHERE lower(text) LIKE '%obama%'")
## # A tibble: 1 × 1
##   tweet_count
##         <int>
## 1      967995
dbGetQuery(db, 
  "SELECT COUNT(*) AS tweet_count
  FROM twitter_panel.tweets_20170721
  WHERE lower(text) LIKE '%trump%'")
## # A tibble: 1 × 1
##   tweet_count
##         <int>
## 1     4067171

Let’s look at another example for the user-level table. As a validation of the party predictions, let’s look at mentions of key political words in users’ profiles across parties. Note that instead of LIKE we can directly use regular expressions, which give us a bit more flexibility.

dbGetQuery(db, 
  "SELECT party, COUNT(*) AS user_count
  FROM twitter_panel.users_subset_20160718
  WHERE REGEXP_CONTAINS(
    LOWER(description), 
    'conservative|republican' )
  GROUP BY party
  ORDER BY party")
## # A tibble: 3 × 2
##   party user_count
##   <int>      <int>
## 1     1         85
## 2     2         49
## 3     3        139

But one potential issue here: the number of users across groups may also vary, so what we need to compute is the proportion of users that mention these keywords on their profile. How can we do this? One option is to do it manually after collecting the data:

(df <- dbGetQuery(db, 
  "SELECT party, 
      COUNTIF(
        REGEXP_CONTAINS(LOWER(description),
        'conservative|republican')) AS user_count, 
    COUNT(DISTINCT(user_id_str)) AS total_users
  FROM twitter_panel.users_subset_20160718
  GROUP BY party
  ORDER BY party"))
## # A tibble: 3 × 3
##   party user_count total_users
##   <int>      <int>       <int>
## 1     1         85       79165
## 2     2         49       73920
## 3     3        139       48652
(df$user_count / df$total_users) * 100
## [1] 0.10737068 0.06628788 0.28570254

But we can also do it in one step using a CTE:

dbGetQuery(db,
  "WITH counts AS (
    SELECT 
      party, 
      COUNTIF(
        REGEXP_CONTAINS(LOWER(description),
        'conservative|republican')) AS user_count, 
      COUNT(DISTINCT(user_id_str)) AS total_users
    FROM twitter_panel.users_subset_20160718
    GROUP BY party
  ) SELECT
    party,
    user_count,
    total_users,
    100 * (user_count / total_users) AS proportion
  FROM counts
  ORDER BY party")
## # A tibble: 3 × 4
##   party user_count total_users proportion
##   <int>      <int>       <int>      <dbl>
## 1     1         85       79165     0.107 
## 2     2         49       73920     0.0663
## 3     3        139       48652     0.286

We can also compute counts at the date level to obtain a plot showing the evolution in the number of tweets over time that e.g. mention the word happy.

df <- dbGetQuery(db,
  "SELECT date, COUNT(*) AS happy_count
  FROM twitter_panel.tweets_20170721
  WHERE lower(text) LIKE '%happy%'
  GROUP BY date")

df <- df[order(df$date),]
plot(as.Date(df$date), df$happy_count, type="l")

Which are the happiest days of the year? Two ways of doing this…

head(df[order(df$happy_count, decreasing = TRUE),])
## # A tibble: 6 × 2
##   date       happy_count
##   <chr>            <int>
## 1 2016-01-01       37044
## 2 2015-01-01       27600
## 3 2015-11-26       25590
## 4 2017-01-01       24376
## 5 2016-02-14       22021
## 6 2016-05-08       20952
dbGetQuery(db,
  "SELECT date, COUNT(*) AS happy_count
  FROM twitter_panel.tweets_20170721
  WHERE lower(text) LIKE '%happy%'
  GROUP BY date
  ORDER BY happy_count DESC
  LIMIT 10")
## # A tibble: 10 × 2
##    date       happy_count
##    <chr>            <int>
##  1 2016-01-01       37044
##  2 2015-01-01       27600
##  3 2015-11-26       25590
##  4 2017-01-01       24376
##  5 2016-02-14       22021
##  6 2016-05-08       20952
##  7 2015-12-31       19279
##  8 2016-07-04       18973
##  9 2016-03-27       18951
## 10 2016-11-24       18496

How did the popularity of the #BlackLivesMatter hashtag evolve over time?

df <- dbGetQuery(db,
  "SELECT date, COUNT(*) AS blm_count
  FROM twitter_panel.tweets_20170721
  WHERE lower(text) LIKE '%blacklivesmatter%'
  GROUP BY date")

df <- df[order(df$date),]
plot(as.Date(df$date), df$blm_count, type="l")

JOINs at scale

And in combination with JOIN, it gets really easy to merge databases to, for example, count the number of tweets mentioning a specific keyword based on user-level characteristics:

dbGetQuery(db,
  "SELECT users.party, 
    COUNT(*) AS tweet_count
  FROM twitter_panel.users_subset_20160718 AS users
  JOIN twitter_panel.tweets_20170721 AS tweets
  ON tweets.user_id_str = users.user_id_str
  WHERE lower(text) LIKE '%immigra%'
  GROUP BY users.party
  ORDER BY users.party")
## # A tibble: 3 × 2
##   party tweet_count
##   <int>       <int>
## 1     1       28273
## 2     2       34074
## 3     3       24264