This Rmarkdown script provides a description of some of the building blocks we will need to scrape and clean data from the web in this course. If you're having trouble following the guided coding part of the class, I recommend you run the following code and make sure you understand each of the concept introduced here.
### Data types
R has many data types, but the most common ones we'll use are:
1. numeric: `1.1`, `3`, `317`, `Inf`...
2. logical: `TRUE` or `FALSE`
3. character: `this is a character`, `hello world!`...
4. factor: `Democrat`, `Republican`, `Socialist`, ...
A small trick regarding logical values is that they correspond to `1` and `0`. This will come in hand to count the number of `TRUE` values in a vector.
```{r}
x <- c(TRUE, TRUE, FALSE)
x * 2
sum(x)
```
There are a few special values: `NA`, which denotes a missing value, and `NaN`, which means Not a number. The values `Inf` and `-Inf` are considered numeric. `NULL` denotes a value that is undefined.
```{r}
0 / 0 # NaN
1 / 0 # Inf
x <- c(1, NA, 0)
```
Probably one of the most useful functions in R is `str`. It displays the internal structure of an object.
```{r}
str(x)
```
Of course you can always print the object in the console:
```{r}
print(x)
```
Note that `print` here is a function: it takes a series of arguments (in this case, the object `x`) and returns a value (`50`).
This is equivalent to just typing the name of the object in the console. (What's going on behind the scenes is that R is calling the default function to print this object; which in this case is just `print`).
```{r}
x
```
### Data structures
Building off of the data types we've learned, *data structures* combine multiple values into a single object. Some common data structures in `R` include:
1. vectors: sequence of values of a certain type
2. data frame: a table of vectors, all of the same length
3. list: collection of objects of different types
#### Vectors
We've already seen vectors created by **c**ombining multiple values with the `c` command:
```{r}
student_names <- c("Bill", "Jane", "Sarah", "Fred", "Paul")
math_scores <- c(80, 75, 91, 67, 56)
verbal_scores <- c(72, 90, 99, 60, 68)
```
There are shortcuts for creating vectors with certain structures, for instance:
```{r}
nums1 <- 1:100
nums2 <- seq(-10, 100, by=5) # -10, -5, 0, ..., 100
nums3 <- seq(-10, 100, length.out=467) # 467 equally spaced numbers between -10 and 100
```
Notice that we used `seq` to generate both `nums1` and `nums2`. The different behavior is controlled by which arguments (e.g. `by`, `length.out`) are supplied to the function `seq`.
With vectors we can carry out some of the most fundamental tasks in data analysis, such as descriptive statistics
```{r}
mean(math_scores)
min(math_scores - verbal_scores)
summary(verbal_scores)
```
and plots.
```{r}
plot(x=math_scores, y=verbal_scores)
text(x=math_scores, y=verbal_scores, labels=student_names)
```
It's easy to pull out specific entries in a vector using `[]`. For example,
```{r}
math_scores[3]
math_scores[1:3]
math_scores[-(4:5)]
math_scores[which(verbal_scores >= 90)]
math_scores[3] <- 92
math_scores
```
#### Data frames
Data frames allow us to combine many vectors of the same length into a single object.
```{r}
students <- data.frame(student_names, math_scores, verbal_scores)
students
summary(students)
```
Notice that `student_names` is a different class (character) than `math_scores` (numeric), yet a data frame combines their values into a single object. We can also create data frames that include new variables:
```{r}
students$final_scores <- 0
students$final_scores <- (students$math_scores + students$verbal_scores)/2
age <- c(18, 19, 20, 21, 22)
students2 <- data.frame(student_names, age)
# merge different data frames
students3 <- merge(students, students2)
students3
```
#### Lists
Lists are an even more flexible way of combining multiple objects into a single object. As you will see throughout the course, we will use lists to store the output of our scraping steps. Using lists, we can combine together vectors of different lengths:
```{r}
list1 <- list(some_numbers = 1:10, some_letters = c("a", "b", "c"))
list1
```
or even vectors and data frames, or multiple data frames:
```{r}
schools <- list(school_name = "UPF", students = students,
faculty = data.frame(name = c("Kelly Jones", "Matt Smith"),
age = c(41, 55)))
schools
```
You can access a list component in several different ways:
```{r}
schools[[1]]
schools[['faculty']]
schools$students
```
A very frequent case scenario is when we have a list of data frames, and we want to bind them together:
```{r}
results <- list()
# let's say here you're scraping 3 websites
results[[1]] <- data.frame(domain="google", url="www.google.com",
stringsAsFactors=FALSE)
results[[2]] <- data.frame(domain="facebook", url="www.facebook.com",
stringsAsFactors=FALSE)
results[[3]] <- data.frame(domain="twitter", url="www.twitter.com",
stringsAsFactors=FALSE)
# and now we want to combine all 3 data frames
results <- do.call(rbind, results)
results
```
### Making functions
Being designed for statistics and data analysis, `R` has powerful built-in functions for data manipulation. However, you can dramatically extend `R`'s functionality by writing your own functions.
`R` functions are objects just like the vectors and data frames we've worked with, so we create them using an assignment.
```{r}
times_2 <- function(x) x * 2
times_2(6)
times_2(1:5)
```
For longer functions, it's necessary to use curly braces `{}`. We can also input multiple objects into a function, and return more complex objects, such as a vector or list.
```{r}
two_numbers <- function(x, y) {
my_sum <- x + y
my_product <- x * y
my_ratio <- x / y
return(c(my_sum, my_product, my_ratio))
}
two_numbers(4, 11.93)
```
### Loops
We use loops whenever we need to run the same function (or chunk of code) across different units. For example, we may use a loop whenever we have multiple Twitter accounts and we want to run sentiment analysis for tweets posted by each of them.
"For" loops are probably the most common type of loop and are easily implemented in R
```{r}
for (i in 1:10){
print(i)
}
```
Note the structure:
```{r, eval=FALSE}
for (i in VECTOR){ do something with i }
```
In each iteration, i takes a different value of the VECTOR; "i" can be anything!
```{r}
for (number in 1:10){
print(number)
}
```
The nice feature of loops is that it can use values from the previous iteration. For instance, we can get the first 40 terms in the Fibonacci sequence using a for loop.
```{r}
fib <- c(0, 1, rep(NA, 38)) # initialize fib sequence
for(i in 3:40) {
fib[i] <- fib[i-1] + fib[i-2]
}
```
Note that here we created an empty vector to store the output of each iteration. A simpler example:
```{r}
values <- rep(NA, 10)
for (i in 1:10){
values[i] <- i
}
```
A structure that we will use often in this workshop is a loop that stores some data in different elements within a list. This will be very useful when the output from each iteration is a data frame. For example:
```{r}
# create empty list
grades <- list()
# loop over 5 students
for (i in 1:5){
# create data frame with grade/info for this student
student <- data.frame(id = i,
initial = sample(LETTERS, 1),
grade = runif(n=1, min=0, max=100),
stringsAsFactors=F)
grades[[i]] <- student
}
# now we have a list...
class(grades)
# but we can turn it into a data frame
grades <- do.call(rbind, grades)
grades
```
### If statements
Depending on whether a condition is true or false, we might want to execute different chunks of code.
```{r}
compare_xy <- function(x, y) {
if (x < y) {
print("y is greater than x")
} else if (x > y) {
print("x is greater than y")
} else {
print("x and y are equal")
}
}
compare_xy(3, 4)
compare_xy(4, 3)
compare_xy(1, 1)
```
A slightly different type of if statement is the `ifelse` function:
```{r}
numbers <- c(-2, -1, 0, 1, 2)
# converting them to absolute numbers
abs_numbers <- ifelse(numbers>0, numbers, -numbers)
abs_numbers
```