# Common R

## Performing operations on vectors

In general, there are three methods that can be used to perform the same operation (such as a mathematical operation) on every element in a vector \(\boldsymbol{x}\). A simple way of doing this is with a **loop**, which iterates once for each element in \(\boldsymbol{x}\), and performs the operation one at a time. **Vectorisation** refers to the process of applying the same operation to every element in a vector at once, whereas the `apply`

function applies any function across a vector in a single line of code.

### Comparison between vectorisation, loops and `apply`

We can test the efficiency of using vectorisation as opposed to using a loop or an `apply`

function. We will construct three pieces of code to do the same thing, that is, apply the function \(f(x) = \sin(x)\) to all elements in a vector, which is constructed of the natural numbers up to 100,000. We start by creating the vector:

`x = seq(1,100000)`

Now we can create three functions. One that uses a for loop, one that uses `apply`

and one that works by vectorising.

```
loop_function = function(x){
y = numeric(length(x))
for(i in 1:length(x)){
y[i] = sin(x[i])
}
return(y)
}
apply_function = function(x){
y = apply(as.matrix(x), 1, sin)
return(y)
}
vector_function = function(x){
y = sin(x)
return(y)
}
```

Now that the functions are constructed, we can use the inbuilt R function `system.time`

to calculate how long it takes each function to run.

`system.time(loop_function(x))`

```
## user system elapsed
## 0.058 0.000 0.080
```

`system.time(apply_function(x))`

```
## user system elapsed
## 0.312 0.000 0.312
```

`system.time(vector_function(x))`

```
## user system elapsed
## 0.006 0.000 0.005
```

Naturally, none of these computations take very long to perform, as the process of taking the sine isn’t very complex. However, you can still see the order of which functions are the fastest. In general, vectorisation will always be more efficient than loops or `apply`

functions, and loops are faster than using `apply`

.

There are cases where it will not be possible to use vectorisation to carry out a task on an array. In this case, it is necessary to construct a function to pass through `apply`

, or performing operations within a loop. In general, loops are faster and more flexible - as they allow you to do more in each iteration than a function could. Some situations where you might want to use `apply`

is to make a simple process neater in the code. If you were doing something relatively straightforward, you will save space and make the code more readable by using `apply`

, as opposed to a loop.

It is common practice to always vectorise your code when you can, as it comes with a significant speed increase, as loops and `apply`

functions are slower than vectorised code.

### Other functions

There are different variants of the `apply`

function depending on how your data are constructed, and how you would want your output.

`apply(X, MARGIN, FUN, ...)`

is the basic apply function.`MARGIN`

refers to which dimension remains constant when performing the function. For example,`apply(sum,1,x)`

will sum across the columns, and the number of rows will remain constant.`lapply(X, FUN, ...)`

is an apply function that returns a list as its output, each element in the list corresponding to applying the given function to each value in`X`

.`sapply(X, FUN, ...)`

is a wrapper of`lapply`

that will simplify the output so that it is not in list form.`mapply(FUN, ...)`

is a multi-dimensional version of`sapply`

, with`mapply`

it is possible to add more than one input to the function, and it will return a vector of values for each set of inputs.

#### Map, Reduce and Filter

`Map`

maps a function to a vector. This is similar to `lapply`

. For example:

```
x = seq(1, 3, by=1)
f = function(a) a+5
M = Map(f,x)
M
```

```
## [[1]]
## [1] 6
##
## [[2]]
## [1] 7
##
## [[3]]
## [1] 8
```

This has added 5 to every element in `x`

, and returned a list of outputs for each element. In fact, `Map`

performs the same operation as `mapply`

does, which we can see in the function itself:

`Map`

```
## function (f, ...)
## {
## f <- match.fun(f)
## mapply(FUN = f, ..., SIMPLIFY = FALSE)
## }
## <bytecode: 0x55bfca7455a8>
## <environment: namespace:base>
```

`Reduce`

performs a given function on pairs of elements in a vector. The procedure is iterated from left to right, and a single value is returned. This can be done from right to left by adding the argument `right=TRUE`

. As an example, consider division:

```
f = function(x, y) x/y
x = seq(1, 3, by=1)
Reduce(f, x)
```

`## [1] 0.1666667`

`Reduce(f, x, right=TRUE)`

`## [1] 1.5`

In the first case, `Reduce`

worked by dividing 1 by 2, then this result by 3. In the second case, this was in reverse, first dividing 3 by 2, then this result by 1.

`Filter`

will ‘filter’ an array into values that satisfy the condition. For example

```
x = seq(1,5)
condition = function(x) x > 3
Filter(condition,x)
```

`## [1] 4 5`

`Filter`

is similar to just indexing an array using `TRUE`

/`FALSE`

values, but instead of indexing using an array, it indexes using a function. However, we can inspect the interior of the function

`Filter`

```
## function (f, x)
## {
## ind <- as.logical(unlist(lapply(x, f)))
## x[which(ind)]
## }
## <bytecode: 0x55bfc77914d8>
## <environment: namespace:base>
```

So infact, the function for `Filter`

simply uses `lapply`

to get the indices of the `TRUE`

/`FALSE`

values, and indexes the array for input `x`

with a simple subsetting.

## Parallel Computing

By using the `parallel`

package, you can make use of all processing cores on your computer. Naturally, if you only have a single core processor, this is irrelevant, but most computers in the modern day have 2, 4, 8 or more cores. Parallel computing will allow R to run up to this many proccesses at the same time. A lot of important tasks in R can be sped up with parallel computing, for example MCMC. In MCMC, using \(n\) cores can allow you to also run \(n\) chains at once, with (in theory) no slowdown.

Supercomputers generally have an extremely large number of cores, so being able to run code in parallel is important in computationally expensive programming jobs.

There are some disadvantages to this: namely that splitting a process to four different cores will also require four times as much memory. If your memory isn’t sufficient for the amount of cores that you are using, this will cause a significant slowdown.

### Using `mclapply`

or `foreach`

There are two main methods to parallelise a set of commands (or a function) in R. The first method is a parallel version of `apply`

, and the second method is a parallel version of `mclapply`

. To illustrate how these work, consider the example of an \(ARMA(1,1)\) model, which has an equation of the form
\[
x_t = \epsilon_t + \alpha\epsilon_{t-1} + \beta x_{t-1}
\]
\[
\epsilon_t \sim N(0, 1)
\]
A function that generates an \(ARMA(1,1)\) process can be written as:

```
arma11 = function(alpha=0.5, beta=1, initx, N=1000){
x = eps = numeric(N)
x[1] = initx
eps[1] = rnorm(1,0,1)
eps[2] = rnorm(1,0,1)
x[2] = eps[1] + alpha*eps[2] + beta*x[1]
for(i in 3:N){
eps[i] = rnorm(1,0,1)
x[i] = eps[i] + alpha*eps[i-1] + beta*x[i-1]
}
return(x)
}
```

This will generate a vector of \(x\) values for each timestep from \(t=1,\dots,N\). We can see a plot of this generated time series by running a simulated \(ARMA\) timeseries of length \(N=1000\).

`plot(1:1000, arma11(initx=0.5,N=1000),type="l", xlab="Time (t)", ylab=expression(x), main="Time Series Example")`

Now that the functions are set up for testing, we can now set up the computer to work in parallel. This involves loading the required packages and detecting the number of cores we have available.

```
library(parallel)
library(MASS)
no.cores = detectCores()
no.cores
```

`## [1] 8`

So we have this many cores to work with (on my laptop, there are 8 cores). The `no.cores`

variable will be passed into the parallel computing functions. We can now use `mclapply`

to simulate this \(ARMA\) model a large amount of times, and calculate the difference from the first value and the last value as a statistic. By putting the `arma11`

function in a wrapper, we can pass it through to `mclapply`

.

```
arma_wrapper = function(x) {
y = arma11(alpha=1, beta=1, initx=x, N=1000)
return(head(y,1) - tail(y,1))
}
xvals = rep(0.5,1000)
MCLoutput = unlist(mclapply(xvals, arma_wrapper, mc.cores = no.cores))
```

So now, `MCLoutput`

is a vector, of length 1000, that contains the differences between the first and last value in a generated time series from an \(ARMA(1,1)\) model with \(\alpha=1\) and \(\beta=1\).

`head(MCLoutput)`

`## [1] -5.373153 -3.740134 -88.112201 53.205845 -4.410669 21.850910`

`mean(MCLoutput)`

`## [1] -0.2121396`

Since this process is iterated at every time step, and ran 1000 times on top of that, it will be efficient for testing the efficiency of parallel computing. We can also construct a `foreach`

loop that will carry out the same task. The `foreach`

function is supplied by the `foreach`

library. It is similar to a `for`

loop but does not depend on each previous iteration of the loop. Instead, `foreach`

runs the contents of the loop in parallel a specified number of times.

```
library(foreach)
library(doParallel)
```

`## Loading required package: iterators`

```
registerDoParallel(no.cores)
FEoutput = foreach(i=1:1000) %dopar% {
y = arma11(initx=0.5,N=1000)
head(y,1) - tail(y,1)
}
```

The `foreach`

loop that has been set up performs the same process as the `arma_wrapper`

function earlier at each iteration, it simulates an \(ARMA(1,1)\) process with \(N=1000\) time steps 1000 times.

`head(unlist(FEoutput))`

`## [1] -22.303263 -7.588862 20.798046 41.559121 14.506028 29.758766`

`mean(unlist(FEoutput))`

`## [1] -0.8435657`

Now that all of the parallel methods are set up, we can time them and compare them to not using parallel at all.

`system.time(mclapply(xvals, arma_wrapper, mc.cores = no.cores))`

```
## user system elapsed
## 5.749 0.835 2.056
```

```
system.time(foreach(i=1:1000) %dopar% {
y = arma11(initx=0.5,N=1000)
head(y,1) - tail(y,1)
})
```

```
## user system elapsed
## 6.023 0.882 1.847
```

```
system.time(for(i in 1:1000){
arma11(initx=0.5,N=1000)
})
```

```
## user system elapsed
## 4.795 0.007 4.802
```

This shows that the fastest method is `mclapply`

, which is different from the normal case of `apply`

being slower than a simple loop. Both methods significantly sped up computation time against the non-parallel version.