University of Southern California, Fall 2022
Citizens across the globe spend an increasing proportion of their daily lives online. Their activities leave behind granular, time-stamped footprints of human behavior and personal interactions that represent a new and exciting source of data to study standing questions about political and social behavior. At the same time, the volume and heterogeneity of digital data present unprecedented methodological challenges. The goal of this course is to introduce students to new computational social science methods and tools required to explore and harness the potential of digital trace data using the R programming language.
The course will follow a “learning-by-doing” approach and will place emphasis on gaining experience in analyzing data using R. Students are expected to do the required readings and coding exercises for each week. The lectures will build upon the content of the readings with a series of data challenges that will introduce new statistical and programming concepts, which will then be applied to the analysis of data from published research papers or common tasks in computational social science. Most of the applications will be related to Political Science and International Relations questions, but the course should be of interest to social science students more generally.
Pablo Barberá | pbarbera@usc.edu | @p_barbera |
Class meetings | Wednesdays | 5:00–7:50 pm | CPA 256 |
Office hours | Mondays | 4:00–5:00 pm | Zoom - sign up here |
Fridays | 9:00–10:00 am | Zoom - sign up here |
Other times by appointment.
The course assumes prior knowledge of probability theory, hypothesis testing, regression analysis at the level of POIR 611 or equivalent, research design at the level of POIR 610 or equivalent, and familiarity with the R programming language. The exercises and problem sets will assume that you know how to read datasets in R, work with vectors and data frames, write simple functions and loops, and run basic statistical analyses, such as a t-test or a linear regression. More advance knowledge of maximum likelihood, causal inference, and Bayesian statistics is helpful but not required.
Students are expected to bring a laptop to class and follow along the coding section of each session. The USC Computing Center offers a Laptop Loaner Program for students who may not have access to one.
Each session will be divided into three blocks: an introduction to a computational social science tool or method, a discussion of recent research that uses that method, and a lab component that will demonstrate how to use it with R.
The first block, led by the instructor, will be based on a set of readings from textbooks and introductory manuals. This part will assume that all students have done the readings and thus we will move quickly and focus on the most complex aspects, how these methods are embedded in the broader computational social science literature, and strengths and weaknesses that perhaps were not obvious from the readings.
The second block will feature a critical discussion of recent publications or working papers that apply these methods. Two students will prepare a two-page referee report (due the night before the class) and a 12-minute presentation during which they will discuss their assessment of the paper. All students in the class are expected to at least have skimmed the papers and be ready to ask questions or engage in the discussion.
The third part of each class will be applied. The instructor will provide code and data that illustrate the application of a specific computational social science method, followed by a “coding challenge” where students will need to apply what they have learned to a new dataset. The code and output (pdf or html file) of their coding challenge needs to be submitted via GitHub classroom before the beginning of the following class.
Students are expected to attend every session, and do the assigned readings before each session. You should come to class with questions and ready to engage in a discussion about that week’s topic.
Each student is expected to write TWO referee reports and present them to the rest of the class. The report should be 800-1000 words long and provide a critical evaluation of the strengths and weakness of the paper, with a particular emphasis on its methodological aspects. For some suggestions on how to write a good report, see for example this Special Issue on Peer Review in The Political Methodologist. The presentation in class should summarize the main points in the referee report, but also go beyond its content by e.g. suggesting follow-up studies or other methods that could have been applied. The use of slides is recommended but not required. Each presentation should last no longer than 12 minutes, not including questions.
Note: The referee report will be due by 8pm the day before the class where it will be discussed via email.
At the end of each class, the instructor will share a short problem set based on the content of that class. There will be some time during class to start working on this exercise, and to address any question, but most of it should be completed after class. Each of these coding challenges will be due before the beginning of the following class, when the solutions (or a sketch of the solution) will be posted. They will not be graded, but submitting at least FIVE of them is compulsory. A complete submission should be done via GitHub and include both the R script with the code written by the student (and comments with interpretation of the results, answering the questions) as well as a pdf/html file showing the output of compiling the code.
Late submissions will not be accepted.
Students are required to submit an original research paper that employs any of the methods introduced in the course. The goal of this exercise is to demonstrate that you have the ability to conduct research in computational social science. This research paper can be an individual or group project (up to 3 people). For more information and deadlines for each component of the project, go to the Project page.
This course will use R, which is a free and open-source programming language primarily used for statistics and data analysis. We will also use RStudio, which is an easy-to-use interface to R. Make sure you install R and RStudio before the first day of class.
Science should be open, and this course builds up other open licensed material, so unless otherwise noted, all materials for this class are licensed under a Creative Commons Attribution 4.0 International License.
More specifically, some of the content of this course is based on materials kindly shared by Dan Cervone, Alex Hanna, Ken Benoit, Paul Nulty, Kevin Munger, Arthur Spirling, Katya Ognyanova, Rochelle Terman, Justin Grimmer, Karsten Donnay, Zoltan Fazekas, Jonathan Kastellec, and is probably inspired by many others who have publicly released their teaching materials. I have tried to give credit to the original author of each script, but if I forgot anyone, please feel free to reach out.
The layout for this website was designed by Jeffrey Arnold (thanks!).
The source for the materials of this course is on GitHub at pablobarbera/POIR613
If you have any feedback on the course or find any typos or errors in this website go to issues, click on the “New Issue” button to create a new issue, and add your suggestion or describe the problem.