This Rmarkdown script provides a description of some of the building blocks for webscraping that we have just covered. If you’re having trouble following the guided coding part of the class, I recommend you run the following code and make sure you understand each of the concept introduced here.

R has many data types, but the most common ones we’ll use are:

- numeric:
`1.1`

,`3`

,`317`

,`Inf`

… - logical:
`TRUE`

or`FALSE`

- character:
`this is a character`

,`hello world!`

… - factor:
`Democrat`

,`Republican`

,`Socialist`

, …

A small trick regarding logical values is that they correspond to `1`

and `0`

. This will come in hand to count the number of `TRUE`

values in a vector.

```
x <- c(TRUE, TRUE, FALSE)
x * 2
```

`## [1] 2 2 0`

`sum(x)`

`## [1] 2`

There are a few special values: `NA`

, which denotes a missing value, and `NaN`

, which means Not a number. The values `Inf`

and `-Inf`

are considered numeric. `NULL`

denotes a value that is undefined.

`0 / 0 # NaN`

`## [1] NaN`

`1 / 0 # Inf`

`## [1] Inf`

`x <- c(1, NA, 0)`

Probably one of the most useful functions in R is `str`

. It displays the internal structure of an object.

`str(x)`

`## num [1:3] 1 NA 0`

Of course you can always print the object in the console:

`print(x)`

`## [1] 1 NA 0`

Note that `print`

here is a function: it takes a series of arguments (in this case, the object `x`

) and returns a value (`50`

).

This is equivalent to just typing the name of the object in the console. (What’s going on behind the scenes is that R is calling the default function to print this object; which in this case is just `print`

).

`x`

`## [1] 1 NA 0`

Building off of the data types we’ve learned, *data structures* combine multiple values into a single object. Some common data structures in `R`

include:

- vectors: sequence of values of a certain type
- data frame: a table of vectors, all of the same length
- list: collection of objects of different types

We’ve already seen vectors created by **c**ombining multiple values with the `c`

command:

```
student_names <- c("Bill", "Jane", "Sarah", "Fred", "Paul")
math_scores <- c(80, 75, 91, 67, 56)
verbal_scores <- c(72, 90, 99, 60, 68)
```

There are shortcuts for creating vectors with certain structures, for instance:

```
nums1 <- 1:100
nums2 <- seq(-10, 100, by=5) # -10, -5, 0, ..., 100
nums3 <- seq(-10, 100, length.out=467) # 467 equally spaced numbers between -10 and 100
```

Notice that we used `seq`

to generate both `nums1`

and `nums2`

. The different behavior is controlled by which arguments (e.g. `by`

, `length.out`

) are supplied to the function `seq`

.

With vectors we can carry out some of the most fundamental tasks in data analysis, such as descriptive statistics

`mean(math_scores)`

`## [1] 73.8`

`min(math_scores - verbal_scores)`

`## [1] -15`

`summary(verbal_scores)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.0 68.0 72.0 77.8 90.0 99.0
```

and plots.

```
plot(x=math_scores, y=verbal_scores)
text(x=math_scores, y=verbal_scores, labels=student_names)
```

It’s easy to pull out specific entries in a vector using `[]`

. For example,

`math_scores[3]`

`## [1] 91`

`math_scores[1:3]`

`## [1] 80 75 91`

`math_scores[-(4:5)]`

`## [1] 80 75 91`

`math_scores[which(verbal_scores >= 90)]`

`## [1] 75 91`

```
math_scores[3] <- 92
math_scores
```

`## [1] 80 75 92 67 56`

Data frames allow us to combine many vectors of the same length into a single object.

```
students <- data.frame(student_names, math_scores, verbal_scores)
students
```

```
## student_names math_scores verbal_scores
## 1 Bill 80 72
## 2 Jane 75 90
## 3 Sarah 92 99
## 4 Fred 67 60
## 5 Paul 56 68
```

`summary(students)`

```
## student_names math_scores verbal_scores
## Bill :1 Min. :56 Min. :60.0
## Fred :1 1st Qu.:67 1st Qu.:68.0
## Jane :1 Median :75 Median :72.0
## Paul :1 Mean :74 Mean :77.8
## Sarah:1 3rd Qu.:80 3rd Qu.:90.0
## Max. :92 Max. :99.0
```

Notice that `student_names`

is a different class (character) than `math_scores`

(numeric), yet a data frame combines their values into a single object. We can also create data frames that include new variables:

```
students$final_scores <- 0
students$final_scores <- (students$math_scores + students$verbal_scores)/2
age <- c(18, 19, 20, 21, 22)
students2 <- data.frame(student_names, age)
# merge different data frames
students3 <- merge(students, students2)
students3
```

```
## student_names math_scores verbal_scores final_scores age
## 1 Bill 80 72 76.0 18
## 2 Fred 67 60 63.5 21
## 3 Jane 75 90 82.5 19
## 4 Paul 56 68 62.0 22
## 5 Sarah 92 99 95.5 20
```

Lists are an even more flexible way of combining multiple objects into a single object. As you will see throughout the course, we will use lists to store the output of our scraping steps. Using lists, we can combine together vectors of different lengths:

```
list1 <- list(some_numbers = 1:10, some_letters = c("a", "b", "c"))
list1
```

```
## $some_numbers
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $some_letters
## [1] "a" "b" "c"
```

or even vectors and data frames, or multiple data frames:

```
schools <- list(school_name = "CEU", students = students,
faculty = data.frame(name = c("Kelly Jones", "Matt Smith"),
age = c(41, 55)))
schools
```

```
## $school_name
## [1] "CEU"
##
## $students
## student_names math_scores verbal_scores final_scores
## 1 Bill 80 72 76.0
## 2 Jane 75 90 82.5
## 3 Sarah 92 99 95.5
## 4 Fred 67 60 63.5
## 5 Paul 56 68 62.0
##
## $faculty
## name age
## 1 Kelly Jones 41
## 2 Matt Smith 55
```

You can access a list component in several different ways:

`schools[[1]]`

`## [1] "CEU"`

`schools[['faculty']]`

```
## name age
## 1 Kelly Jones 41
## 2 Matt Smith 55
```

`schools$students`

```
## student_names math_scores verbal_scores final_scores
## 1 Bill 80 72 76.0
## 2 Jane 75 90 82.5
## 3 Sarah 92 99 95.5
## 4 Fred 67 60 63.5
## 5 Paul 56 68 62.0
```

A very frequent case scenario is when we have a list of data frames, and we want to bind them together:

```
results <- list()
# let's say here you're scraping 3 websites
results[[1]] <- data.frame(domain="google", url="www.google.com",
stringsAsFactors=FALSE)
results[[2]] <- data.frame(domain="facebook", url="www.facebook.com",
stringsAsFactors=FALSE)
results[[3]] <- data.frame(domain="twitter", url="www.twitter.com",
stringsAsFactors=FALSE)
# and now we want to combine all 3 data frames
results <- do.call(rbind, results)
results
```

```
## domain url
## 1 google www.google.com
## 2 facebook www.facebook.com
## 3 twitter www.twitter.com
```

Being designed for statistics and data analysis, `R`

has powerful built-in functions for data manipulation. However, you can dramatically extend `R`

’s functionality by writing your own functions.

`R`

functions are objects just like the vectors and data frames we’ve worked with, so we create them using an assignment.

```
times_2 <- function(x) x * 2
times_2(6)
```

`## [1] 12`

`times_2(1:5)`

`## [1] 2 4 6 8 10`

For longer functions, it’s necessary to use curly braces `{}`

. We can also input multiple objects into a function, and return more complex objects, such as a vector or list.

```
two_numbers <- function(x, y) {
my_sum <- x + y
my_product <- x * y
my_ratio <- x / y
return(c(my_sum, my_product, my_ratio))
}
two_numbers(4, 11.93)
```

`## [1] 15.9300000 47.7200000 0.3352892`

“For” loops are probably the most common type of loop and are easily implemented in R

```
for (i in 1:10){
print(i)
}
```

```
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
```

Note the structure:

`for (i in VECTOR){ do something with i }`

In each iteration, i takes a different value of the VECTOR; “i” can be anything!

```
for (number in 1:10){
print(number)
}
```

```
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
```

The nice feature of loops is that it can use values from the previous iteration. For instance, we can get the first 40 terms in the Fibonacci sequence using a for loop.

```
fib <- c(0, 1, rep(NA, 38)) # initialize fib sequence
for(i in 3:40) {
fib[i] <- fib[i-1] + fib[i-2]
}
```

Note that here we created an empty vector to store the output of each iteration. A simpler example:

```
values <- rep(NA, 10)
for (i in 1:10){
values[i] <- i
}
```

Depending on whether a condition is true or false, we might want to execute different chunks of code.

```
compare_xy <- function(x, y) {
if (x < y) {
print("y is greater than x")
} else if (x > y) {
print("x is greater than y")
} else {
print("x and y are equal")
}
}
compare_xy(3, 4)
```

`## [1] "y is greater than x"`

`compare_xy(4, 3)`

`## [1] "x is greater than y"`

`compare_xy(1, 1)`

`## [1] "x and y are equal"`