# Performance and Debugging

## Introduction

**Debugging** is an important part of any programming process. It is unlikely that an individual will write their code correctly on the first write up, and it is generally accepted that the code will only become usable after a few debugging iterations. **Perfomance** is the measure of how efficient your code is with respect to speed, so that you can do the same operation in as quick a time as possible.

In this section of the portfolio, the debugging process and the performance aspects will be explained by way of an example. This example will be a large function that is deliberately inefficiently and incorrectly written, and will be fixed and improved using different methods.

## Example: Least Squares with Feature Transform

Take for example least squares regression with the choice of three basis functions; linear, quadratic, or trigonometric. This function can take one output vector \(y\), and one input vector \(x\), and estimate parameters \(w_{LS}\) from the solution \[ \boldsymbol{w}_{LS} = \left(\boldsymbol{\phi}(\boldsymbol{X})\boldsymbol{\phi}(\boldsymbol{X})^T +\lambda \boldsymbol{I}\right)^{-1}\boldsymbol{\phi}(\boldsymbol{X})\boldsymbol{y}^T, \] where \(\boldsymbol{X}\) is the model matrix, and \(\phi\) is a feature transform function. The first version of this function looks like

```
LS.feature.transform.fit <- function(y, x, ft, b){
if(ft == "Polynomial") basis_function = function(x) poly(x, b, raw=TRUE)
if(ft == "Linear") basis_function = function(x) x
if(ft == "Trigonometric") {
basis_function = function(x){
basis = c()
for(i in 1:b){
basis = cbind(basis, sin(i*x), cos(i*x))
}
return(basis)
}
}
Phi = matrix(NA, length(x), length(basis_function(x))+1)
for(i in 1:length(x)){
Phi[i,] = c(1, basis_function(x[i]))
}
wLS = solve(t(Phi) %*% Phi) %*% t(Phi) %*% y
return(wLS)
}
```

This function takes the argument `ft`

, meaning feature transform. The first thing that the function does is try to recognise which feature transform is being input, by a series of three `if`

functions. Then `basis_function`

is assigned to a function corresponding to the correct feature transform. After this, `Phi`

is set up as a matrix with the correct dimensions, that is the length of the data and the dimension of the feature space (including a column of 1’s).

Let’s test this function. Good practice is to create a series of testing functions or scripts that will test as many aspects of the function as possible. This code chunk below will test the function for each basis function.

```
library(lasso2)
data(Prostate)
LS.feature.transform.fit(Prostate$lpsa,Prostate$lcavol,"Linear",1)
LS.feature.transform.fit(Prostate$lpsa,Prostate$lcavol,"Polynomial",3)
LS.feature.transform.fit(Prostate$lpsa,Prostate$lcavol,"Trigonometric",1)
```

### Debugging

Let’s try the function for a linear (identity) feature transform. When this line is run, the following error is returned.

`LS.feature.transform.fit(Prostate$lpsa,Prostate$lcavol,"Linear",1)`

```
## Error in t(Phi) %*% Phi :
## requires numeric/complex matrix/vector arguments
```

This error message is not informative enough to go back and fix our function immediately. We can use the `traceback`

function to get more information.

`traceback()`

```
## 2: solve(t(Phi) %*% Phi) at #20
## 1: LS.feature.transform.fit(Prostate$lpsa, Prostate$lcavol, "Linear",
1)
```

This is not as useful here, because it does not say much more than the previous error message. We do now know that the error is on line `#20`

, but the actual problem could be before that, presumably in the definition of `Phi`

. We can also use another debugging function, called `browser()`

, which allows you to open an interactive debugging environment. From here we can interactively view all elements and find where the problem is. *Note that RStudio also allows an interactive debugging environment.*

Let’s take a look at the elements in the function after the definition of `Phi`

. We can first look at the first few rows and columns of `Phi`

. The following code was run after running all other parts of the function up to the definition of `Phi`

:

`Phi[1:5,1:5]`

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 -0.5798185 1 -0.5798185 1
## [2,] 1 -0.9942523 1 -0.9942523 1
## [3,] 1 -0.5108256 1 -0.5108256 1
## [4,] 1 -1.2039728 1 -1.2039728 1
## [5,] 1 0.7514161 1 0.7514161 1
```

The dimension of `Phi`

is wrong, as the columns are being repeated! It should only have two columns, a column of 1’s regarding to the intercept, and a column of each \(\phi(x)\). This error must come from where the dimension is defined, in the line `Phi = matrix(NA, length(x), length(basis_function(x))+1)`

. Looking at `Phi`

in this case, we see we do not want to use the length of the basis function, but instead the number of columns that it contains. So instead we can change `length(basis_function(x))+1`

to `dim(as.matrix(basis_function(x)))[2]+1`

. Now we run the test again.

`LS.feature.transform.fit(Prostate$lpsa,Prostate$lcavol,"Linear",1)`

```
## [,1]
## [1,] 1.5072975
## [2,] 0.7193204
```

Now the function works! We can see if the parameter estimates are correct by comparing to the output from a linear model with `lm`

, since we have the known result that \(w_{LS} = w_{MLE}\).

`lm(lpsa ~ lcavol, data = Prostate)$coef`

```
## (Intercept) lcavol
## 1.5072975 0.7193204
```

And ensuring that our function works for the other two feature transforms.

`LS.feature.transform.fit(Prostate$lpsa,Prostate$lcavol, "Polynomial", 3)`

```
## [,1]
## [1,] 1.66387139
## [2,] 0.69613468
## [3,] -0.18630511
## [4,] 0.06164228
```

`lm(lpsa ~ poly(lcavol, 3, raw=TRUE), data = Prostate)$coef`

```
## (Intercept) poly(lcavol, 3, raw = TRUE)1
## 1.66387139 0.69613468
## poly(lcavol, 3, raw = TRUE)2 poly(lcavol, 3, raw = TRUE)3
## -0.18630511 0.06164228
```

`LS.feature.transform.fit(Prostate$lpsa,Prostate$lcavol, "Trigonometric", 2)`

```
## [,1]
## [1,] 2.4198088
## [2,] 0.3801217
## [3,] -1.2342905
## [4,] 0.4748111
## [5,] 0.3759252
```

`lm(lpsa ~ sin(lcavol) + cos(lcavol) + sin(2*lcavol) + cos(2*lcavol), data = Prostate)$coef`

```
## (Intercept) sin(lcavol) cos(lcavol) sin(2 * lcavol) cos(2 * lcavol)
## 2.4198088 0.3801217 -1.2342905 0.4748111 0.3759252
```

There are no errors, and our parameter estimates match those from `lm`

, so this debugging has been a success.

### Performance

So far we have only tested this for a length \(n=97\) dataset, so performance has been relatively fast. We can use the function `microbenchmark`

from the `microbenchmark`

package to test how quickly our function runs, and gives a summary of statistics on how quickly the function runs. Before we do that, we want to have a larger data set to see more obvious difference in how our computation times are. We can do this with some simulated data.

```
x.test = runif(10000, 1, 5)
y.test = rexp(10000,rate=1.5*x.test-1)
library(microbenchmark)
# microbenchmark(LS.feature.transform.fit(y.test,x.test,"Polynomial",5))
```

This range of speeds is not bad, but could be improved. To see where we can improve the performance of our code, we can do something called *profiling*. A statistical profiler can determine where the code is spending most of its time being run, by using operating system interrupts. An implementation of profiling in R is provided by the `profvis`

package, and the `profvis`

function. We start by running this on our function.

```
library(profvis)
x = x.test; y = y.test; ft = "Polynomial"; b = 5
profvis({
if(ft == "Polynomial") basis_function = function(x) poly(x, b, raw=TRUE)
if(ft == "Linear") basis_function = function(x) x
if(ft == "Trigonometric") {
basis_function = function(x){
basis = c()
for(i in 1:b){
basis = cbind(basis, sin(i*x), cos(i*x))
}
return(basis)
}
}
Phi = matrix(NA, length(x), dim(as.matrix(basis_function(x)))[2]+1)
for(i in 1:length(x)){
Phi[i,] = c(1, basis_function(x[i]))
}
wLS = solve(t(Phi) %*% Phi) %*% t(Phi) %*% y
})
```

Looking at this graph, we can see that most of the time and memory is spent in the `basis_function`

. This is likely due to the basis function being run at every iteration in the loop. One way of improving this would be to assign elements of `Phi`

in terms of columns instead of rows. This is because R uses *column-major storage*, meaning that when a matrix is stored in memory, it is being assigned in chunks that correspond to columns, not rows. Therefore defining a matrix column-wise needs fewer operations than defining a matrix row-wise.

The performance can be greatly increased by *vectorising* so that the basis function need only be applied once instead of many times (as this is where the slowdown was). This eliminates the loop as well, another source of inefficiency. We make the following changes when defining the matrix \(\boldsymbol{\phi}(\boldsymbol{x})\).

`Phi = as.matrix(cbind(1, basis_function(x)))`

Here we can see that the dimensions do not need to be set up in advance. Now if we benchmark and profile the function again, we get

```
## Unit: milliseconds
## expr min lq
## LS.feature.transform.fit(y.test, x.test, "Polynomial", 5) 7.071979 15.42255
## mean median uq max neval
## 23.93168 20.15049 25.85308 171.5225 100
```

Here the benchmarks are significantly faster, and whilst the function seems to get stuck in the same place, the times that it gets stuck there is order of magnitudes smaller than it was previously.

## Closing Thoughts

In this portfolio section, we explained and showed an extended example of *debugging*, *profiling* and *optimising performance*. There are better ways of implementing all of these things than what was demonstrated here.

For *debugging*, we simply printed out the code where we thought the errors were, but a more rigorous debugging process would have involved an interactive debugger, which was discussed but not implemented. Using the `debug`

function in R would allow RStudio to go through each line of the function and show results at each point. RStudio also allows breakpoints in functions, so that when the function is run, it will stop at a breakpoint instead of having to go through every line.

For *optimising performance*, R is generally inefficient for user written functions and scripts. Code written would be a lot more efficient and faster if it was written in a language such as C. Many core R routines and packages are written in C, greatly improving their efficiency. This can be achieved with the `RCpp`

package, which provides an accessible way of writing efficient R code in C++.