Massive-scale datasets from web sources and social media, newly digitized text sources, and large longitudinal survey studies present exciting opportunities for the study of social and political behaviour, but at the same time its size and heterogeneity present significant challenges. This course will introduce participants to new computational methods and tools required to explore and analyse Big Data in the social sciences using the R programming language. It will be structured around techniques to deal with the 3 V’s of Big Data: volume, variety, and veracity. First, we will cover the basics of parallel programming and cloud computing to analyse large-scale datasets. Second, we will learn how to scale human tasks through the use of machine learning methods. Finally, we will discuss how to automatically discover insights from large text and network datasets and validate the output of this analysis. The course will follow a “learning-by-doing” approach, with short theoretical sessions followed by “data challenges” where participants will need to apply new methods.
Additional information and schedule is available at the ECPR Summer School website
There are two ways you can follow the course and run the code contained in this GitHub repository. The recommended method is to connect to the provided RStudio server where all the R packages have already been installed, and all the R code is available. To access the server, visit bigdata.pablobarbera.com and log in with the information provided during class.
If you’re using your own laptop, you can either download the course materials clicking on each link in this repository, download it as a zip file, or you can “clone” it, using the buttons found to the right side of your browser window as you view this repository. If you do not have a git client installed on your system, you will need to get one here and also to make sure that git is installed.
You can also subscribe to the repository if you have a GitHub account, which will send you updates each time new changes are pushed to the repository.
The course will begin with a discussion of the concept of “Big Data” and the research opportunities and challenges of the use of massive-scale datasets in the social sciences. The first session will also provide a foundation of R coding skills upon which we will rely during the rest of the course. Here, we will go over existing packages to efficiently analyze large-scale datasets in R, how to parallelize for loops, and how to read and write large files.
The second session will focus on the most common application of Big Data in the social sciences: large-scale text classification. After a quick overview of the basics of machine learning, we will discuss specific details of the implementation of supervised learning algorithms in massive-scale datasets, and in particular recently-developed methods in computer science such as stochastic gradient descent, xgboost, and ensemble classifiers. Our emphasis will lie on the practical aspects: we will study these methods in the context of an application of sentiment analysis to newspaper articles, and will go through the entire research process, from the creation of a training dataset labeled by humans using crowd-sourcing platforms, to the application and validation of the machine learning algorithm, and passing through all the intermediate steps, such as cleaning and preprocessing the corpus of documents.
Exploratory data analysis can be a powerful tool for social scientists when they are interested in analyzing a new dataset. The third session will cover the existing tools for large-scale discovery in “Big Data” using R, applied to textual datasets. We will start with different techniques, such as collocation analysis, TF-IDF feature weighting and word2vec, which will allow us to identify salient themes and ideas across documents. Then, we will move to topic models, which allow researchers to automatically identify latent classes of documents in a corpus, with an application to the classification of Facebook posts by politicians into relevant political issues. This session will also cover other dimensionality reduction techniques that are commonly used in the social sciences to visualize large-scale datasets.
In the fourth session we will turn our attention to social networks, and in particular the detection of communities of individuals with shared interests or political preferences. The running example in this part of the course will be the classification of Twitter users along a latent ideological dimension based on the structure of the networks in which they are embedded. A common theme to this session and the previous one will be the emphasis on validation: once an unsupervised model is completed, how can we measure the quality of the results? We will discuss basics concepts of measurement theory, and best practices in the validation of the results from unsupervised statistical models.
The course will conclude with an introduction to cloud computing and database management for social scientists. Most available online resources and courses on these topics assume students are proficient in UNIX or have a background in programming. Here, however, we will start from scratch and focus on the coding skills required to conduct statistical analyses with data hosted in the “cloud”, while at the same time helping participants become familiar with programming concepts that can facilitate future collaborations with computer scientists. We will cover the most important commands in UNIX – the language required to interact with High-Performance Clusters (HPC), for example, which are now available in most universities – and test our skills in an online virtual machine hosted on Amazon Elastic Compute Cloud (EC2). In the second half of this session, we will learn the basics of SQL, and run our own queries in a dataset with over a billion geolocated tweets hosted in Google BigQuery.