Common R
Performing operations on vectors
In general, there are three methods that can be used to perform the same operation (such as a mathematical operation) on every element in a vector \(\boldsymbol{x}\). A simple way of doing this is with a loop, which iterates once for each element in \(\boldsymbol{x}\), and performs the operation one at a time. Vectorisation refers to the process of applying the same operation to every element in a vector at once, whereas the apply
function applies any function across a vector in a single line of code.
Comparison between vectorisation, loops and apply
We can test the efficiency of using vectorisation as opposed to using a loop or an apply
function. We will construct three pieces of code to do the same thing, that is, apply the function \(f(x) = \sin(x)\) to all elements in a vector, which is constructed of the natural numbers up to 100,000. We start by creating the vector:
x = seq(1,100000)
Now we can create three functions. One that uses a for loop, one that uses apply
and one that works by vectorising.
loop_function = function(x){
y = numeric(length(x))
for(i in 1:length(x)){
y[i] = sin(x[i])
}
return(y)
}
apply_function = function(x){
y = apply(as.matrix(x), 1, sin)
return(y)
}
vector_function = function(x){
y = sin(x)
return(y)
}
Now that the functions are constructed, we can use the inbuilt R function system.time
to calculate how long it takes each function to run.
system.time(loop_function(x))
## user system elapsed
## 0.058 0.000 0.080
system.time(apply_function(x))
## user system elapsed
## 0.312 0.000 0.312
system.time(vector_function(x))
## user system elapsed
## 0.006 0.000 0.005
Naturally, none of these computations take very long to perform, as the process of taking the sine isn’t very complex. However, you can still see the order of which functions are the fastest. In general, vectorisation will always be more efficient than loops or apply
functions, and loops are faster than using apply
.
There are cases where it will not be possible to use vectorisation to carry out a task on an array. In this case, it is necessary to construct a function to pass through apply
, or performing operations within a loop. In general, loops are faster and more flexible - as they allow you to do more in each iteration than a function could. Some situations where you might want to use apply
is to make a simple process neater in the code. If you were doing something relatively straightforward, you will save space and make the code more readable by using apply
, as opposed to a loop.
It is common practice to always vectorise your code when you can, as it comes with a significant speed increase, as loops and apply
functions are slower than vectorised code.
Other functions
There are different variants of the apply
function depending on how your data are constructed, and how you would want your output.
apply(X, MARGIN, FUN, ...)
is the basic apply function.MARGIN
refers to which dimension remains constant when performing the function. For example,apply(sum,1,x)
will sum across the columns, and the number of rows will remain constant.lapply(X, FUN, ...)
is an apply function that returns a list as its output, each element in the list corresponding to applying the given function to each value inX
.sapply(X, FUN, ...)
is a wrapper oflapply
that will simplify the output so that it is not in list form.mapply(FUN, ...)
is a multi-dimensional version ofsapply
, withmapply
it is possible to add more than one input to the function, and it will return a vector of values for each set of inputs.
Map, Reduce and Filter
Map
maps a function to a vector. This is similar to lapply
. For example:
x = seq(1, 3, by=1)
f = function(a) a+5
M = Map(f,x)
M
## [[1]]
## [1] 6
##
## [[2]]
## [1] 7
##
## [[3]]
## [1] 8
This has added 5 to every element in x
, and returned a list of outputs for each element. In fact, Map
performs the same operation as mapply
does, which we can see in the function itself:
Map
## function (f, ...)
## {
## f <- match.fun(f)
## mapply(FUN = f, ..., SIMPLIFY = FALSE)
## }
## <bytecode: 0x55bfca7455a8>
## <environment: namespace:base>
Reduce
performs a given function on pairs of elements in a vector. The procedure is iterated from left to right, and a single value is returned. This can be done from right to left by adding the argument right=TRUE
. As an example, consider division:
f = function(x, y) x/y
x = seq(1, 3, by=1)
Reduce(f, x)
## [1] 0.1666667
Reduce(f, x, right=TRUE)
## [1] 1.5
In the first case, Reduce
worked by dividing 1 by 2, then this result by 3. In the second case, this was in reverse, first dividing 3 by 2, then this result by 1.
Filter
will ‘filter’ an array into values that satisfy the condition. For example
x = seq(1,5)
condition = function(x) x > 3
Filter(condition,x)
## [1] 4 5
Filter
is similar to just indexing an array using TRUE
/FALSE
values, but instead of indexing using an array, it indexes using a function. However, we can inspect the interior of the function
Filter
## function (f, x)
## {
## ind <- as.logical(unlist(lapply(x, f)))
## x[which(ind)]
## }
## <bytecode: 0x55bfc77914d8>
## <environment: namespace:base>
So infact, the function for Filter
simply uses lapply
to get the indices of the TRUE
/FALSE
values, and indexes the array for input x
with a simple subsetting.
Parallel Computing
By using the parallel
package, you can make use of all processing cores on your computer. Naturally, if you only have a single core processor, this is irrelevant, but most computers in the modern day have 2, 4, 8 or more cores. Parallel computing will allow R to run up to this many proccesses at the same time. A lot of important tasks in R can be sped up with parallel computing, for example MCMC. In MCMC, using \(n\) cores can allow you to also run \(n\) chains at once, with (in theory) no slowdown.
Supercomputers generally have an extremely large number of cores, so being able to run code in parallel is important in computationally expensive programming jobs.
There are some disadvantages to this: namely that splitting a process to four different cores will also require four times as much memory. If your memory isn’t sufficient for the amount of cores that you are using, this will cause a significant slowdown.
Using mclapply
or foreach
There are two main methods to parallelise a set of commands (or a function) in R. The first method is a parallel version of apply
, and the second method is a parallel version of mclapply
. To illustrate how these work, consider the example of an \(ARMA(1,1)\) model, which has an equation of the form
\[
x_t = \epsilon_t + \alpha\epsilon_{t-1} + \beta x_{t-1}
\]
\[
\epsilon_t \sim N(0, 1)
\]
A function that generates an \(ARMA(1,1)\) process can be written as:
arma11 = function(alpha=0.5, beta=1, initx, N=1000){
x = eps = numeric(N)
x[1] = initx
eps[1] = rnorm(1,0,1)
eps[2] = rnorm(1,0,1)
x[2] = eps[1] + alpha*eps[2] + beta*x[1]
for(i in 3:N){
eps[i] = rnorm(1,0,1)
x[i] = eps[i] + alpha*eps[i-1] + beta*x[i-1]
}
return(x)
}
This will generate a vector of \(x\) values for each timestep from \(t=1,\dots,N\). We can see a plot of this generated time series by running a simulated \(ARMA\) timeseries of length \(N=1000\).
plot(1:1000, arma11(initx=0.5,N=1000),type="l", xlab="Time (t)", ylab=expression(x), main="Time Series Example")
Now that the functions are set up for testing, we can now set up the computer to work in parallel. This involves loading the required packages and detecting the number of cores we have available.
library(parallel)
library(MASS)
no.cores = detectCores()
no.cores
## [1] 8
So we have this many cores to work with (on my laptop, there are 8 cores). The no.cores
variable will be passed into the parallel computing functions. We can now use mclapply
to simulate this \(ARMA\) model a large amount of times, and calculate the difference from the first value and the last value as a statistic. By putting the arma11
function in a wrapper, we can pass it through to mclapply
.
arma_wrapper = function(x) {
y = arma11(alpha=1, beta=1, initx=x, N=1000)
return(head(y,1) - tail(y,1))
}
xvals = rep(0.5,1000)
MCLoutput = unlist(mclapply(xvals, arma_wrapper, mc.cores = no.cores))
So now, MCLoutput
is a vector, of length 1000, that contains the differences between the first and last value in a generated time series from an \(ARMA(1,1)\) model with \(\alpha=1\) and \(\beta=1\).
head(MCLoutput)
## [1] -5.373153 -3.740134 -88.112201 53.205845 -4.410669 21.850910
mean(MCLoutput)
## [1] -0.2121396
Since this process is iterated at every time step, and ran 1000 times on top of that, it will be efficient for testing the efficiency of parallel computing. We can also construct a foreach
loop that will carry out the same task. The foreach
function is supplied by the foreach
library. It is similar to a for
loop but does not depend on each previous iteration of the loop. Instead, foreach
runs the contents of the loop in parallel a specified number of times.
library(foreach)
library(doParallel)
## Loading required package: iterators
registerDoParallel(no.cores)
FEoutput = foreach(i=1:1000) %dopar% {
y = arma11(initx=0.5,N=1000)
head(y,1) - tail(y,1)
}
The foreach
loop that has been set up performs the same process as the arma_wrapper
function earlier at each iteration, it simulates an \(ARMA(1,1)\) process with \(N=1000\) time steps 1000 times.
head(unlist(FEoutput))
## [1] -22.303263 -7.588862 20.798046 41.559121 14.506028 29.758766
mean(unlist(FEoutput))
## [1] -0.8435657
Now that all of the parallel methods are set up, we can time them and compare them to not using parallel at all.
system.time(mclapply(xvals, arma_wrapper, mc.cores = no.cores))
## user system elapsed
## 5.749 0.835 2.056
system.time(foreach(i=1:1000) %dopar% {
y = arma11(initx=0.5,N=1000)
head(y,1) - tail(y,1)
})
## user system elapsed
## 6.023 0.882 1.847
system.time(for(i in 1:1000){
arma11(initx=0.5,N=1000)
})
## user system elapsed
## 4.795 0.007 4.802
This shows that the fastest method is mclapply
, which is different from the normal case of apply
being slower than a simple loop. Both methods significantly sped up computation time against the non-parallel version.