Workshop

Querying large-scale online datasets: SQL and Google BigQuery

London School of Economics, December 6th 2018

The volume and heterogeneity of the new datasets available in the digital age present unprecedented opportunities for social scientists, but also new methodological challenges. Computing a simple average for a variable across groups can take minutes when a researcher is working with government records, large-scale survey studies or social media datasets with millions of rows. The goal of this workshop is to learn how to overcome challenges associated to massive-scale online databases. We will learn the basics of SQL, a language designed to query relational databases that is currently employed by most tech companies; and how to use it from R using the DBI package. From all the available options to store online databases, we will focus on BigQuery, which relies on Google’s infrastructure to efficiently store and query databases at scale. We will learn how to process, upload, and query databases of up to a billion rows in a matter of seconds, and how to export the results of our queries.

The workshop follows a “learning-by-doing” approach, with short guided coding sessions followed by data challenges that will prompt participants to practice what they just learned.

Instructor

Pablo Barberá P.Barbera@lse.ac.uk @p_barbera

Schedule

Session 1 December 6, 2018 10–12:00
Session 2 December 6, 2018 14–16:00

Prerequisites

Familiarity with the R statistical programming language is desirable but not required. Students are expected to bring a laptop to class and follow along the coding section of each session.

Software

This course will use R, which is a free and open-source programming language primarily used for statistics and data analysis. We will also use RStudio, which is an easy-to-use interface to R.

Installing R or RStudio prior to the workshop is not necessary. The instructor will provide individual login details to an RStudio Server that all workshop participants can access to run their code.

License and credit

Science should be open, and this course builds up other open licensed material, so unless otherwise noted, all materials for this class are licensed under a Creative Commons Attribution 4.0 International License.

The layout for this website was designed by Jeffrey Arnold (thanks!).

The source for the materials of this course is on GitHub at pablobarbera/SQL-workshop

Feedback

If you have any feedback on the course or find any typos or errors in this website go to issues, click on the “New Issue” button to create a new issue, and add your suggestion or describe the problem.