The foreach package improves the way in which we run loops in R, and provides a construct to run loops in parallel.
The basic structure of loops with the package is:
# Without parallelization --> %do%
output <- foreach(i = 'some object to iterate over', 'options') %do% {some r code}
# With parallelization --> %dopar%
output <- foreach(i = 'some object to iterate over', 'options') %dopar% {some r code}
As a first example, we can use foreach just like a for loop without parallelization
library(foreach)
result <- foreach(x = c(4,9,16)) %do% sqrt(x)
result
## [[1]]
## [1] 2
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 4
Note that, unlike a regular for loop, foreach returns an object (by default a list) that contains the results compiled across all iterations.
We can change the object returned by specifying the function used to combine results across iterations with the .combine option:
result <- foreach(x = c(4,9,16), .combine = 'c') %do% sqrt(x)
class(result)
## [1] "numeric"
Other options for .combine are: cbind, rbind, +, *:
# cbind...
result <- foreach(x = c(4,9,16), .combine = 'cbind') %do% c(sqrt(x), log(x), x^2)
class(result)
## [1] "matrix" "array"
result
## result.1 result.2 result.3
## [1,] 2.000000 3.000000 4.000000
## [2,] 1.386294 2.197225 2.772589
## [3,] 16.000000 81.000000 256.000000
# rbind
result <- foreach(x = c(4,9,16), .combine = 'rbind') %do% c(sqrt(x), log(x), x^2)
class(result)
## [1] "matrix" "array"
result
## [,1] [,2] [,3]
## result.1 2 1.386294 16
## result.2 3 2.197225 81
## result.3 4 2.772589 256
# sum
result <- foreach(x = c(4,9,16), .combine = '+') %do% sqrt(x)
class(result)
## [1] "numeric"
result
## [1] 9
Before we can parallelize our code, we need to declare a “cluster” – that is, we need to tell R that we have multiple cores – so that R knows how to execute the code. These are the steps involved in this process:
doParallel package to extend the functionality of foreach.library(doParallel)
## Loading required package: iterators
## Loading required package: parallel
myCluster <- makeCluster(3, # number of cores to use
type = "PSOCK") # type of cluster
First, we choose the number of cores we want to use. You can check how many your computer has by running detectCores(). One good rule of thumb is to always leave one core unused for other tasks.
detectCores()
## [1] 8
We can choose between two types of clusters:
registerDoParallel(myCluster)
If you’re running this locally, you can check your Monitor App to see that new instances of R were launched in your computer.
%do% to %dopar%output <- foreach(i = 'some object to iterate over', 'options') %dopar% {some r code}
For example:
result <- foreach(x = c(4,9,16), .combine = 'c') %dopar% sqrt(x)
stopCluster(myCluster)
Let’s run some tests to see the improvement in performance. We’ll be using bootstrapping to compute the confidence intervals for a regression coefficient.
d <- read.csv("../data/incivility.csv", stringsAsFactors=FALSE)
nsims <- 500
# without parallelization
system.time({
r <- foreach(1:nsims, .combine='c') %do% {
smp <- sample(1:nrow(d), replace=TRUE)
reg <- lm(log(comment_likes_count+1) ~
attacks, data=d[smp,])
coef(reg)[2]
}})
## user system elapsed
## 1.348 0.085 1.438
quantile(r, probs=c(.025, 0.975))
## 2.5% 97.5%
## 0.1527369 0.2609330
# with parallelization
myCluster <- makeCluster(3, type = "FORK") # why "FORK"?
registerDoParallel(myCluster)
system.time({
r <- foreach(1:nsims, .combine='c') %dopar% {
smp <- sample(1:nrow(d), replace=TRUE)
reg <- lm(log(comment_likes_count+1) ~ attacks, data=d[smp,])
coef(reg)[2]
}})
## user system elapsed
## 0.141 0.041 0.740
stopCluster(myCluster)
quantile(r, probs=c(.025, 0.975))
## 2.5% 97.5%
## 0.1538211 0.2614707
Why isn’t the total running time 1/ncores the original running time? Remember there is some overhead added whenever we split our computation across different cores.