To illustrate the potential of SQL queries with large-scale databases, we’ll now work with a database hosted in BigQuery that I have used for some of my research projects.
The database contains information related to a random sample of 200,000 users in the US, labeled with user-level information predicted from the voter file or other data sources. The database contains two tables, one at the tweet level and another one at the user level.
We have used different versions of this database to, for example, study the prevalence of hateful and intolerant speech on Twitter (in our 2021 QJPS article) or to study segregation in information diffusion (in our 2015 Psychological Science article).
# setting display options
options(tibble.width = Inf)
# loading packages
library(DBI)
library(bigrquery)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# loading authentication token
# (code assumes it's in your working directory)
bq_auth(path="usc-poir613-key.json")
# connecting to database
db <- dbConnect(
drv=bigquery(),
project='usc-barbera',
dataset='twitter_panel'
)
The tweet-label table (tweets_20170721
) contains 388 million tweets and five variables: id_str
(tweet ID), user_id_str
(user ID), datetime
(time of the tweet in BigQuery datetime format), date
(date of tweet in string format), and text
of the tweet
# count rows
dbGetQuery(db, 'SELECT COUNT(*) AS tweet_count FROM tweets_20170721')
## # A tibble: 1 × 1
## tweet_count
## <int>
## 1 388766024
# get names of columns
dbListFields(db, 'tweets_20170721')
## [1] "id_str" "user_id_str" "datetime" "date" "text"
# show first 2 rows
dbGetQuery(db, 'SELECT * FROM tweets_20170721 LIMIT 2')
## # A tibble: 2 × 5
## id_str user_id_str datetime date
## <chr> <chr> <dttm> <chr>
## 1 550441451294044162 42625966 2015-01-01 00:00:43 2015-01-01
## 2 550441809923817472 73674769 2015-01-01 00:02:08 2015-01-01
## text
## <chr>
## 1 I just wanna hold my niece <U+0001F629>
## 2 http://t.co/2nkqfejmsS
The user-level table has information about around 200,000 users. For each of them, the following variables are available: user_id_str
(user ID), male
(TRUE if user is predicted to be male), predicted age
(with values 1 = 18-25, 2 = 26-40, 3 = 40+), predicted party
(with values 1 = None/Others, 2 = Democrat, 3 = Republican), predicted race
(with values 1 = Black, 2 = Hispanic, 3 = Asian, 4 = White), predicted turnout
in 2012, name
of user, description
of user, followers_count
, statuses_count
, friends_count
, location
field in user’s profile, and lang
(account language).
More information about how this dataset was collected is available in this working paper.
# count rows
dbGetQuery(db, 'SELECT COUNT(*) AS user_count FROM users_subset_20160718')
# get names of columns
dbListFields(db, 'users_subset_20160718')
# show first 2 rows
dbGetQuery(db, 'SELECT * FROM users_subset_20160718 LIMIT 2')
Now let’s run a few sample queries. First, let’s count the number of tweets by month. Different ways we could do this, but a very useful function is DATE_TRUNC
, which will truncate any timestamp to our desired level of granularity:
df <- dbGetQuery(db,
'SELECT
DATE_TRUNC(datetime, MONTH) AS month,
COUNT(*) AS tweet_count
FROM twitter_panel.tweets_20170721
GROUP BY month
ORDER BY month')
ggplot(df, aes(x=month, y=tweet_count)) +
geom_point() + geom_line() +
theme_bw()
Now, let’s switch to the user-level table and look at the distribution of age by party.
df <- dbGetQuery(db,
'SELECT
age,
party,
COUNT(*) AS user_count
FROM twitter_panel.users_subset_20160718
GROUP BY age, party
ORDER BY age, party')
df <- df %>%
mutate(party = factor(party, labels=c("Others", "Dem", "Rep")),
age = factor(age, labels=c("18-25", "26-40", "40+"))) %>%
group_by(age) %>%
mutate(user_prop = user_count / sum(user_count)) %>%
ungroup()
ggplot(df, aes(x=age, y=user_prop, color=party, fill=party)) +
geom_col() + facet_wrap(~party) +
scale_fill_manual(values=c("darkgreen", "blue", "red")) +
scale_color_manual(values=c("darkgreen", "blue", "red")) +
scale_y_continuous("% of users",
labels=scales::percent_format(accuracy=1)) +
theme_bw()
We can also use LIKE to search for tweets based on their text.
dbGetQuery(db,
"SELECT COUNT(*) AS tweet_count
FROM twitter_panel.tweets_20170721
WHERE lower(text) LIKE '%obama%'")
## # A tibble: 1 × 1
## tweet_count
## <int>
## 1 967995
dbGetQuery(db,
"SELECT COUNT(*) AS tweet_count
FROM twitter_panel.tweets_20170721
WHERE lower(text) LIKE '%trump%'")
## # A tibble: 1 × 1
## tweet_count
## <int>
## 1 4067171
Let’s look at another example for the user-level table. As a validation of the party predictions, let’s look at mentions of key political words in users’ profiles across parties. Note that instead of LIKE
we can directly use regular expressions, which give us a bit more flexibility.
dbGetQuery(db,
"SELECT party, COUNT(*) AS user_count
FROM twitter_panel.users_subset_20160718
WHERE REGEXP_CONTAINS(
LOWER(description),
'conservative|republican' )
GROUP BY party
ORDER BY party")
## # A tibble: 3 × 2
## party user_count
## <int> <int>
## 1 1 85
## 2 2 49
## 3 3 139
But one potential issue here: the number of users across groups may also vary, so what we need to compute is the proportion of users that mention these keywords on their profile. How can we do this? One option is to do it manually after collecting the data:
(df <- dbGetQuery(db,
"SELECT party,
COUNTIF(
REGEXP_CONTAINS(LOWER(description),
'conservative|republican')) AS user_count,
COUNT(DISTINCT(user_id_str)) AS total_users
FROM twitter_panel.users_subset_20160718
GROUP BY party
ORDER BY party"))
## # A tibble: 3 × 3
## party user_count total_users
## <int> <int> <int>
## 1 1 85 79165
## 2 2 49 73920
## 3 3 139 48652
(df$user_count / df$total_users) * 100
## [1] 0.10737068 0.06628788 0.28570254
But we can also do it in one step using a CTE:
dbGetQuery(db,
"WITH counts AS (
SELECT
party,
COUNTIF(
REGEXP_CONTAINS(LOWER(description),
'conservative|republican')) AS user_count,
COUNT(DISTINCT(user_id_str)) AS total_users
FROM twitter_panel.users_subset_20160718
GROUP BY party
) SELECT
party,
user_count,
total_users,
100 * (user_count / total_users) AS proportion
FROM counts
ORDER BY party")
## # A tibble: 3 × 4
## party user_count total_users proportion
## <int> <int> <int> <dbl>
## 1 1 85 79165 0.107
## 2 2 49 73920 0.0663
## 3 3 139 48652 0.286
We can also compute counts at the date level to obtain a plot showing the evolution in the number of tweets over time that e.g. mention the word happy
.
df <- dbGetQuery(db,
"SELECT date, COUNT(*) AS happy_count
FROM twitter_panel.tweets_20170721
WHERE lower(text) LIKE '%happy%'
GROUP BY date")
df <- df[order(df$date),]
plot(as.Date(df$date), df$happy_count, type="l")
Which are the happiest days of the year? Two ways of doing this…
head(df[order(df$happy_count, decreasing = TRUE),])
## # A tibble: 6 × 2
## date happy_count
## <chr> <int>
## 1 2016-01-01 37044
## 2 2015-01-01 27600
## 3 2015-11-26 25590
## 4 2017-01-01 24376
## 5 2016-02-14 22021
## 6 2016-05-08 20952
dbGetQuery(db,
"SELECT date, COUNT(*) AS happy_count
FROM twitter_panel.tweets_20170721
WHERE lower(text) LIKE '%happy%'
GROUP BY date
ORDER BY happy_count DESC
LIMIT 10")
## # A tibble: 10 × 2
## date happy_count
## <chr> <int>
## 1 2016-01-01 37044
## 2 2015-01-01 27600
## 3 2015-11-26 25590
## 4 2017-01-01 24376
## 5 2016-02-14 22021
## 6 2016-05-08 20952
## 7 2015-12-31 19279
## 8 2016-07-04 18973
## 9 2016-03-27 18951
## 10 2016-11-24 18496
How did the popularity of the #BlackLivesMatter
hashtag evolve over time?
df <- dbGetQuery(db,
"SELECT date, COUNT(*) AS blm_count
FROM twitter_panel.tweets_20170721
WHERE lower(text) LIKE '%blacklivesmatter%'
GROUP BY date")
df <- df[order(df$date),]
plot(as.Date(df$date), df$blm_count, type="l")
And in combination with JOIN, it gets really easy to merge databases to, for example, count the number of tweets mentioning a specific keyword based on user-level characteristics:
dbGetQuery(db,
"SELECT users.party,
COUNT(*) AS tweet_count
FROM twitter_panel.users_subset_20160718 AS users
JOIN twitter_panel.tweets_20170721 AS tweets
ON tweets.user_id_str = users.user_id_str
WHERE lower(text) LIKE '%immigra%'
GROUP BY users.party
ORDER BY users.party")
## # A tibble: 3 × 2
## party tweet_count
## <int> <int>
## 1 1 28273
## 2 2 34074
## 3 3 24264