Automated Collection of Web and Social Data

ECPR Summer School in Methods and Techniques

Central European University, Budapest, July 30-August 3 2018

An increasingly vast wealth of data is freely available on the web – from election results and legislative speeches to social media posts, newspaper articles, and press releases, among many other examples. Although this data is easily accessible, in most cases it is available in an unstructured format, which makes its analysis challenging. The goal of this course is to gain the skills necessary to automate the process of downloading, cleaning, and reshaping web and social data using the R programming language for statistical computing. We will cover all the most common scenarios: scraping data available in multiple pages or behind web forms, interacting with APIs and RSS feeds such as those provided by most media outlets, collecting data from Twitter, extracting text and table data from PDF files, and manipulating datasets into a format ready for analysis.

We will follow a “learning-by-doing” approach, with short guided coding sessions followed by data challenges that will prompt participants to practice what they just learned.

Instructor

Pablo Barberá (Instructor)	P.Barbera@lse.ac.uk	@p_barbera
Tom Paskhalis (Teaching Assistant)	t.g.paskhalis@lse.ac.uk	@tpaskhalis
Alberto Stefanelli (Teaching Assistant)	alberto.stefanelli.main@gmail.com	@sergsagara

Schedule

Monday July 30, 2018	Session 1	Basics of webscraping.	14:00–15:30
	Session 2	Scraping web data in table format.	16:00–17:30
Tuesday July 31, 2018	Session 1	Scraping web data in unstructured format.	14:00–15:30
	Session 2	Scraping data behind web forms with Selenium. Regular expressions. Basics of text analysis.	16:00–17:30
Wednesday August 1st, 2018	Session 1	Extracting media text from newspaper articles using RSS feeds.	14:00–15:30
	Session 2	Interacting with web APIs.	16:00–17:30
Thursday August 2nd, 2018	Session 1	Collecting data from Twitter’s Streaming API	14:00–15:30
	Session 2	Collecting data from Twitter’s REST API.	16:00–17:30
Friday August 3rd, 2018	Session 1	Extracting data from PDF files	14:00–15:30
	Session 2	Dealing with encoding issues. Exception handling.	16:00–17:30

Prerequisites

The course will assume intermediate familiarity with the R statistical programming language. Participants should be able to know how to read datasets in R, work with vectors and data frames, and run basic statistical analyses, such as linear regression. More advanced knowledge of statistical computing, such as writing functions and loops, is helpful but not required.

Students are expected to bring a laptop to class and follow along the coding section of each session.

Software

This course will use R, which is a free and open-source programming language primarily used for statistics and data analysis. We will also use RStudio, which is an easy-to-use interface to R.

Installing R or RStudio prior to the workshop is not necessary. The instructor will provide individual login details to an RStudio Server that all workshop participants can access to run their code.

License and credit

Science should be open, and this course builds up other open licensed material, so unless otherwise noted, all materials for this class are licensed under a Creative Commons Attribution 4.0 International License.

The layout for this website was designed by Jeffrey Arnold (thanks!).

The source for the materials of this course is on GitHub at pablobarbera/ECPR-SC104

Feedback

If you have any feedback on the course or find any typos or errors in this website go to issues, click on the “New Issue” button to create a new issue, and add your suggestion or describe the problem.