Social Media Research

EITM Europe Summer Institute

Collegio Carlo Alberto, Torino, July 5-8 2018

Citizens across the globe spend an increasing proportion of their daily lives on social media websites, such as Twitter and Facebook. Their activities leave behind granular, time-stamped footprints of human behavior and personal interactions that represent a new and exciting source of data to study standing questions about political and social behavior. At the same time, the volume and heterogeneity of social media data present unprecedented methodological challenges. The goal of this course is to gain the skills necessary to automate the process of downloading, cleaning, and analyzing social media data using the R programming language for statistical computing.

We will follow a “learning-by-doing” approach, with short guided coding sessions followed by data challenges that will prompt participants to practice what they just learned. Given the applied nature of the course, there will be no required readings, but students are expected to complete and submit the data challenges before the beginning of the second and third sessions.


Pablo Barberá @p_barbera


July 5, 2018 Session 1 Social media research: opportunities and challenges. 10:30–13:00
Session 2 Scraping the web. 14:00–16:30
July 6, 2018 Session 3 Collecting data from social media. 9:30–12:00
Session 4 Topic discovery in social media datasets. 13:00–15:30
July 7, 2018 Session 5 Querying large-scale datasets using SQL. 9:30–12:00
Session 6 Big Data analysis using Google BigQuery. 13:00–15:30


The workshop assumes familiarity with the R statistical programming language. Participants should be able to know how to read datasets in R, work with vectors and data frames, and run basic statistical analyses, such as linear regression. More advanced knowledge of statistical computing, such as writing functions and loops, is helpful but not required.

Students are expected to bring a laptop to class and follow along the coding section of each session.


This course will use R, which is a free and open-source programming language primarily used for statistics and data analysis. We will also use RStudio, which is an easy-to-use interface to R.

Installing R or RStudio prior to the workshop is not necessary. The instructor will provide individual login details to an RStudio Server that all workshop participants can access to run their code.

License and credit

Science should be open, and this course builds up other open licensed material, so unless otherwise noted, all materials for this class are licensed under a Creative Commons Attribution 4.0 International License.

The layout for this website was designed by Jeffrey Arnold (thanks!).

The source for the materials of this course is on GitHub at pablobarbera/social-media-upf


If you have any feedback on the course or find any typos or errors in this website go to issues, click on the “New Issue” button to create a new issue, and add your suggestion or describe the problem.