This Rmarkdown script provides a description of some of the building blocks we will need to scrape and clean data from the web in this course. If you’re having trouble following the guided coding part of the class, I recommend you run the following code and make sure you understand each of the concept introduced here.

Data types

R has many data types, but the most common ones we’ll use are:

  1. numeric: 1.1, 3, 317, Inf
  2. logical: TRUE or FALSE
  3. character: this is a character, hello world!
  4. factor: Democrat, Republican, Socialist, …

A small trick regarding logical values is that they correspond to 1 and 0. This will come in hand to count the number of TRUE values in a vector.

x <- c(TRUE, TRUE, FALSE)
x * 2
## [1] 2 2 0
sum(x)
## [1] 2

There are a few special values: NA, which denotes a missing value, and NaN, which means Not a number. The values Inf and -Inf are considered numeric. NULL denotes a value that is undefined.

0 / 0 # NaN
## [1] NaN
1 / 0 # Inf
## [1] Inf
x <- c(1, NA, 0)

Probably one of the most useful functions in R is str. It displays the internal structure of an object.

str(x)
##  num [1:3] 1 NA 0

Of course you can always print the object in the console:

print(x)
## [1]  1 NA  0

Note that print here is a function: it takes a series of arguments (in this case, the object x) and returns a value (50).

This is equivalent to just typing the name of the object in the console. (What’s going on behind the scenes is that R is calling the default function to print this object; which in this case is just print).

x
## [1]  1 NA  0

Data structures

Building off of the data types we’ve learned, data structures combine multiple values into a single object. Some common data structures in R include:

  1. vectors: sequence of values of a certain type
  2. data frame: a table of vectors, all of the same length
  3. list: collection of objects of different types

Vectors

We’ve already seen vectors created by combining multiple values with the c command:

student_names <- c("Bill", "Jane", "Sarah", "Fred", "Paul")
math_scores <- c(80, 75, 91, 67, 56)
verbal_scores <- c(72, 90, 99, 60, 68)

There are shortcuts for creating vectors with certain structures, for instance:

nums1 <- 1:100
nums2 <- seq(-10, 100, by=5) # -10, -5, 0, ..., 100
nums3 <- seq(-10, 100, length.out=467) # 467 equally spaced numbers between -10 and 100

Notice that we used seq to generate both nums1 and nums2. The different behavior is controlled by which arguments (e.g. by, length.out) are supplied to the function seq.

With vectors we can carry out some of the most fundamental tasks in data analysis, such as descriptive statistics

mean(math_scores)
## [1] 73.8
min(math_scores - verbal_scores)
## [1] -15
summary(verbal_scores)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    60.0    68.0    72.0    77.8    90.0    99.0

and plots.

plot(x=math_scores, y=verbal_scores)
text(x=math_scores, y=verbal_scores, labels=student_names)

It’s easy to pull out specific entries in a vector using []. For example,

math_scores[3]
## [1] 91
math_scores[1:3]
## [1] 80 75 91
math_scores[-(4:5)]
## [1] 80 75 91
math_scores[which(verbal_scores >= 90)]
## [1] 75 91
math_scores[3] <- 92
math_scores
## [1] 80 75 92 67 56

Data frames

Data frames allow us to combine many vectors of the same length into a single object.

students <- data.frame(student_names, math_scores, verbal_scores)
students
##   student_names math_scores verbal_scores
## 1          Bill          80            72
## 2          Jane          75            90
## 3         Sarah          92            99
## 4          Fred          67            60
## 5          Paul          56            68
summary(students)
##  student_names  math_scores verbal_scores 
##  Bill :1       Min.   :56   Min.   :60.0  
##  Fred :1       1st Qu.:67   1st Qu.:68.0  
##  Jane :1       Median :75   Median :72.0  
##  Paul :1       Mean   :74   Mean   :77.8  
##  Sarah:1       3rd Qu.:80   3rd Qu.:90.0  
##                Max.   :92   Max.   :99.0

Notice that student_names is a different class (character) than math_scores (numeric), yet a data frame combines their values into a single object. We can also create data frames that include new variables:

students$final_scores <- 0
students$final_scores <- (students$math_scores + students$verbal_scores)/2

age <- c(18, 19, 20, 21, 22)
students2 <- data.frame(student_names, age)
# merge different data frames
students3 <- merge(students, students2)

students3
##   student_names math_scores verbal_scores final_scores age
## 1          Bill          80            72         76.0  18
## 2          Fred          67            60         63.5  21
## 3          Jane          75            90         82.5  19
## 4          Paul          56            68         62.0  22
## 5         Sarah          92            99         95.5  20

Lists

Lists are an even more flexible way of combining multiple objects into a single object. As you will see throughout the course, we will use lists to store the output of our scraping steps. Using lists, we can combine together vectors of different lengths:

list1 <- list(some_numbers = 1:10, some_letters = c("a", "b", "c"))
list1
## $some_numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $some_letters
## [1] "a" "b" "c"

or even vectors and data frames, or multiple data frames:

schools <- list(school_name = "UPF", students = students, 
                    faculty = data.frame(name = c("Kelly Jones", "Matt Smith"), 
                                         age = c(41, 55)))
schools
## $school_name
## [1] "UPF"
## 
## $students
##   student_names math_scores verbal_scores final_scores
## 1          Bill          80            72         76.0
## 2          Jane          75            90         82.5
## 3         Sarah          92            99         95.5
## 4          Fred          67            60         63.5
## 5          Paul          56            68         62.0
## 
## $faculty
##          name age
## 1 Kelly Jones  41
## 2  Matt Smith  55

You can access a list component in several different ways:

schools[[1]]
## [1] "UPF"
schools[['faculty']]
##          name age
## 1 Kelly Jones  41
## 2  Matt Smith  55
schools$students
##   student_names math_scores verbal_scores final_scores
## 1          Bill          80            72         76.0
## 2          Jane          75            90         82.5
## 3         Sarah          92            99         95.5
## 4          Fred          67            60         63.5
## 5          Paul          56            68         62.0

A very frequent case scenario is when we have a list of data frames, and we want to bind them together:

results <- list()
# let's say here you're scraping 3 websites
results[[1]] <- data.frame(domain="google", url="www.google.com",
                           stringsAsFactors=FALSE)
results[[2]] <- data.frame(domain="facebook", url="www.facebook.com",
                           stringsAsFactors=FALSE)
results[[3]] <- data.frame(domain="twitter", url="www.twitter.com",
                           stringsAsFactors=FALSE)
# and now we want to combine all 3 data frames
results <- do.call(rbind, results)
results
##     domain              url
## 1   google   www.google.com
## 2 facebook www.facebook.com
## 3  twitter  www.twitter.com

Making functions

Being designed for statistics and data analysis, R has powerful built-in functions for data manipulation. However, you can dramatically extend R’s functionality by writing your own functions.

R functions are objects just like the vectors and data frames we’ve worked with, so we create them using an assignment.

times_2 <- function(x) x * 2
times_2(6)
## [1] 12
times_2(1:5)
## [1]  2  4  6  8 10

For longer functions, it’s necessary to use curly braces {}. We can also input multiple objects into a function, and return more complex objects, such as a vector or list.

two_numbers <- function(x, y) {
  my_sum <- x + y
  my_product <- x * y
  my_ratio <- x / y
  return(c(my_sum, my_product, my_ratio))
}
two_numbers(4, 11.93)
## [1] 15.9300000 47.7200000  0.3352892

Loops

We use loops whenever we need to run the same function (or chunk of code) across different units. For example, we may use a loop whenever we have multiple Twitter accounts and we want to run sentiment analysis for tweets posted by each of them.

“For” loops are probably the most common type of loop and are easily implemented in R

for (i in 1:10){
    print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Note the structure:

for (i in VECTOR){ do something with i }

In each iteration, i takes a different value of the VECTOR; “i” can be anything!

for (number in 1:10){
    print(number)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

The nice feature of loops is that it can use values from the previous iteration. For instance, we can get the first 40 terms in the Fibonacci sequence using a for loop.

fib <- c(0, 1, rep(NA, 38)) # initialize fib sequence
for(i in 3:40) {
  fib[i] <- fib[i-1] + fib[i-2]
}

Note that here we created an empty vector to store the output of each iteration. A simpler example:

values <- rep(NA, 10)
for (i in 1:10){
    values[i] <- i
}

A structure that we will use often in this workshop is a loop that stores some data in different elements within a list. This will be very useful when the output from each iteration is a data frame. For example:

# create empty list
grades <- list()
# loop over 5 students
for (i in 1:5){
  # create data frame with grade/info for this student
  student <- data.frame(id = i, 
                        initial = sample(LETTERS, 1), 
                        grade = runif(n=1, min=0, max=100),
                        stringsAsFactors=F)
  grades[[i]] <- student
}
# now we have a list...
class(grades)
## [1] "list"
# but we can turn it into a data frame
grades <- do.call(rbind, grades)
grades
##   id initial    grade
## 1  1       Y 32.57420
## 2  2       X 60.06433
## 3  3       E 66.02223
## 4  4       M 43.05902
## 5  5       Q 51.92582

If statements

Depending on whether a condition is true or false, we might want to execute different chunks of code.

compare_xy <- function(x, y) {
  if (x < y) {
    print("y is greater than x")
  } else if (x > y) {
    print("x is greater than y")
  } else {
    print("x and y are equal")
  }
}
compare_xy(3, 4)
## [1] "y is greater than x"
compare_xy(4, 3)
## [1] "x is greater than y"
compare_xy(1, 1)
## [1] "x and y are equal"

A slightly different type of if statement is the ifelse function:

numbers <- c(-2, -1, 0, 1, 2)
# converting them to absolute numbers
abs_numbers <- ifelse(numbers>0, numbers, -numbers)
abs_numbers
## [1] 2 1 0 1 2