Big Data Analysis in the Social Sciences

ECPR Summer School in Methods and Techniques

Central European University, Budapest, August 6-10 2018

Massive-scale datasets from web sources and social media, newly digitized text sources, and large longitudinal survey studies present exciting opportunities for the study of social and political behaviour, but at the same time its size and heterogeneity present significant challenges. This course will introduce participants to new computational methods and tools required to explore and analyse Big Data in the social sciences using the R programming language. It will be structured around techniques to deal with the 3 V’s of Big Data: volume, variety, and veracity. First, we will cover the basics of parallel programming, database management, and cloud computing to analyse large-scale datasets. Second, we will learn how to scale human tasks through the use of machine learning methods. Finally, we will discuss how to automatically discover insights from large text and network datasets and validate the output of this analysis. The course will follow a “learning-by-doing” approach, with short theoretical sessions followed by “data challenges” where participants will need to apply new methods.

Instructors

Pablo Barberá (Instructor) P.Barbera@lse.ac.uk @p_barbera
Tom Paskhalis (Teaching Assistant) t.g.paskhalis@lse.ac.uk @tpaskhalis

Schedule

Monday August 6, 2018 Session 1 Good coding practices in R. 14:00–15:30
Session 2 Parallel computing. 16:00–17:30
Tuesday August 7, 2018 Session 1 SQL for data manipulation. 14:00–15:30
Session 2 Large-scale data processing in the cloud. 16:00–17:30
Wednesday August 8, 2018 Session 1 Community detection in networks. 14:00–15:30
Session 2 Latent space network models 16:00–17:30
Thursday August 9, 2018 Session 1 Supervised machine learning 14:00–15:30
Session 2 Large-scale text classification. 16:00–17:30
Friday August 10, 2018 Session 1 Exploratory analysis of textual datasets 14:00–15:30
Session 2 Topic models. 16:00–17:30

Prerequisites

The course will assume intermediate familiarity with the R statistical programming language. Participants should be able to know how to read datasets in R, work with vectors and data frames, and run basic statistical analyses, such as linear regression. More advanced knowledge of statistical computing, such as writing functions and loops, is helpful but not required.

Students are expected to bring a laptop to class and follow along the coding section of each session.

Software

This course will use R, which is a free and open-source programming language primarily used for statistics and data analysis. We will also use RStudio, which is an easy-to-use interface to R.

Installing R or RStudio prior to the workshop is not necessary. The instructor will provide individual login details to an RStudio Server that all workshop participants can access to run their code.

License and credit

Science should be open, and this course builds up other open licensed material, so unless otherwise noted, all materials for this class are licensed under a Creative Commons Attribution 4.0 International License.

The layout for this website was designed by Jeffrey Arnold (thanks!).

The source for the materials of this course is on GitHub at pablobarbera/ECPR-SC105

Feedback

If you have any feedback on the course or find any typos or errors in this website go to issues, click on the “New Issue” button to create a new issue, and add your suggestion or describe the problem.