Local Polynomial Regression with Rcpp

Introduction

Rcpp sugar and Armadillo are libraries included in Rcpp that allow different processes:

Sugar provides an array of basic functions in Rcpp that are similar to inbuilt base R functions. These include functions such as cbind, sum, sin, sample and many more. These are essential for writing simple code in Rcpp.
Armadillo is a package used for linear algebra operations in Rcpp. It allows specification of matrices, 3D matrices, vectors and others.

This portfolio will explain the use of key functions and features of Armadillo and sugar by implementing local polynomial regression on a data set on solar electricity production in Sydney. We can load this data into the R environment first.

load("solarAU.RData")

head(solarAU)

##       prod          toy tod   logprod
## 8832 0.019 0.000000e+00   0 -3.540459
## 8833 0.032 5.708088e-05   1 -3.170086
## 8834 0.020 1.141618e-04   2 -3.506558
## 8835 0.038 1.712427e-04   3 -3.036554
## 8836 0.036 2.283235e-04   4 -3.079114
## 8837 0.012 2.854044e-04   5 -3.816713

The column prod is a production measure, and logprod is the log of the production measure, which will be the response variable. The covariates are toy - the time of year, within 0 and 1 as a percentage, and tod - the time of day, from 0 to 47, measured in half an hour intervals. A simple polynomial regression model is of the form \[ \mathbb{E}(y|\boldsymbol{x}) = \beta_0 + \beta_1\text{tod} + \beta_2\text{tod}^2 + \beta_3\text{toy} + \beta_4 \text{toy}^2 = \tilde{\boldsymbol{x}}\boldsymbol{\beta}. \]

Fitting a linear regression model

We can use Armadillo to fit a linear regression model, and to solve the least squares optimisation problem \[ \hat{\boldsymbol{\beta}} := \operatorname{argmin}_{\boldsymbol{\beta}}\|{\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\|}^2, \] which has solution \[ \boldsymbol{\hat{\beta}} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}. \] This can be implemented in C using QR decomposition of \(\boldsymbol{X}\).

library(Rcpp)
sourceCpp(code='
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export(name="lm_cpp")]]
arma::vec lm_cpp_I(arma::vec& y, arma::mat& X)
{
  arma::mat Q, R;
  arma::qr_econ(Q, R, X);
  arma::vec beta = solve(R, (trans(Q) * y));
  return beta;
}')
ls = function(formula, data){
  y = data[,all.vars(formula)[1]]
  x = model.matrix(formula, data)
  lm_cpp(y, x)
}

Note that the sourceCpp function has a differing argument code, where instead of reading code from a file in the directory, the code is supplied directly here. Running sourceCpp adds the lm_cpp function to the environment. An R function is set up which takes the inputs formula and data and runs the Rcpp function with the given inputs. This makes it comparable to lm. Firstly though, we can see that both functions output the same parameter estimates.

ls(logprod ~ tod + I(tod^2) + toy + I(toy^2), data = solarAU)

##             [,1]
## [1,] -6.26275685
## [2,]  0.86440391
## [3,] -0.01757599
## [4,] -5.91806924
## [5,]  6.14298863

lm(logprod ~ tod + I(tod^2) + toy + I(toy^2), data = solarAU)$coef

## (Intercept)         tod    I(tod^2)         toy    I(toy^2) 
## -6.26275685  0.86440391 -0.01757599 -5.91806924  6.14298863

But which is faster?

microbenchmark(
  R_lm = lm(logprod ~ tod + I(tod^2) + toy + I(toy^2), data = solarAU),
  C_lm = ls(logprod ~ tod + I(tod^2) + toy + I(toy^2), data = solarAU),
  times = 500
)

## Unit: milliseconds
##  expr      min       lq     mean   median       uq      max neval
##  R_lm 3.531021 4.047208 5.079068 4.205777 5.222140 57.54048   500
##  C_lm 2.522135 2.927327 3.615145 3.026044 3.654492 37.26725   500

So the C code is approximately twice as fast with this method, and would be even faster without the R wrapper function. However, the R function lm performs a number of checks and computes a number of statistics that the C++ code does not, which explains part of the performance gap.

For a more fair comparison, we can set up the model matrices in advance, and perform the same operations in R and in RCpp.

R_ls_solve = function(y,X){
  QR = qr(X)
  beta = solve(qr.R(QR), (t(qr.Q(QR)) %*% y))
  return(beta)
}
y = solarAU$logprod
X = with(solarAU, cbind(1, tod, tod^2, toy, toy^2))
microbenchmark(R_solve = R_ls_solve(y,X),
               C_solve = lm_cpp(y, X), 
               times = 500)

## Unit: microseconds
##     expr      min        lq      mean   median        uq       max neval
##  R_solve 1910.686 2189.3570 5022.8070 2464.270 3779.8105 79083.490   500
##  C_solve  449.401  518.7625  690.3906  584.342  687.4495  8261.433   500

So this has come with an approximate speed up of ten times, exemplifying how much more computationally efficient code written in C++ is over R. However, there is a (non-computational) problem with the model. A plot of the residuals from the linear model shows this.

beta = ls(logprod ~ tod + I(tod^2) + toy + I(toy^2), data = solarAU)
res = y - X %*% beta
predplot = ggplot(solarAU, mapping = aes(x=toy, y=tod, z= X%*%beta)) +
  stat_summary_hex() +  xlab("Time of Year") + ylab("Time of Day") +
  theme(legend.title = element_blank())
resplot = ggplot(solarAU, mapping=aes(x=toy, y = tod, z = res)) +
  stat_summary_hex() +  xlab("Time of Year") + ylab("Time of Day") +
  theme(legend.title = element_blank())
grid.arrange(predplot, resplot, ncol=2)

There is a pattern in the residuals! This means that there is a feature not included in the model that can explain some of the noise. We need to extend this model to account for this.

Local Regression

We can improve the model fit by adopting a local regression approach, that is, making the parameter estimates depend on the covariates \(\boldsymbol{x}\), i.e. \(\hat{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}(\boldsymbol{x})\). This is achieved by minimising \(\hat{\boldsymbol{\beta}}(\boldsymbol{x}_0)\) for a fixed \(\boldsymbol{x}_0\): \[ \hat{\boldsymbol{\beta}}(\boldsymbol{x}_0) = \operatorname{argmin}_{\boldsymbol{\beta}} \sum^n_{i=1}\kappa_{\boldsymbol{H}}(\boldsymbol{x}_0 - \boldsymbol{x}_i)(y_i-\tilde{\boldsymbol{x}}_i^T\boldsymbol{\beta})^2, \] for a density kernel \(\kappa_{\boldsymbol{H}}\) with positive definite bandwidth matrix \(\boldsymbol{H}\). Fitting this model involves re-fitting the linear model once for each row in the data set, which for large data sets is not viable, but shows the need of computationally efficient code in C++. Now we can write the local regression function in RCpp.

vec local_lm_I(vec& y, rowvec x0, rowvec X0, mat& x, mat& X, mat& H)
{
  mat Hstar = chol(H, "lower"); 
  vec w = dmvnInt(x, x0, Hstar);
  vec beta = lm_cpp_I(y % sqrt(w), X.each_col() % sqrt(w));
  return X0 * beta;
}

// [[Rcpp::export(name="local_lm_fit")]]
vec local_lm_fit_I(vec& y, mat x0, mat X0, mat& x, mat& X, mat& H)
{
  int n0 = x0.n_rows;
  vec out(n0);
  for(int ii=0; ii < n0; ii++)
  {
    rowvec x00 = x0.row(ii);
    rowvec X00 = X0.row(ii);
    out(ii) = as_scalar(local_lm_I(y, x00, X00, x, X, H));
    if(ii % 50 == 0) {R_CheckUserInterrupt();}
  }
  return out;
}

These two functions, as well as lm_cpp from earlier all combine to implement local regression. local_lm_fit_I loops over all rows in x0 and X0 and fits linear regression for a constant \(\boldsymbol{x}_0\), using weights from a Gaussian kernel, imeplemented by a multivariate normal density (given by dmvnInt, defined separately). local_lm_I simply calculates the weights and pre-multiplies them by \(\boldsymbol{y}\) and \(\boldsymbol{X}\) to go into the fitting function. We can source these functions and run these for a subsetted data set.

sourceCpp('lm_cpp.cpp')
nsub = 2000

sub = sample(1:nrow(solarAU), nsub)
y = solarAU$logprod
x = as.matrix(solarAU[c("tod", "toy")])
X = model.matrix(~tod+toy+I(tod^2)+I(toy^2),data=solarAU)
x0 = x[sub, ]
X0 = X[sub, ]
H = diag(c(1,0.01))

cpp_pred_local = local_lm_fit(y, x0, X0, x, X, H)

Of which we can plot, for both the predictions and the residuals.

predPlot = ggplot(mapping=aes(x=x0[,2], y = x0[,1], z = cpp_pred_local)) + stat_summary_hex() +
  xlab("Time of Year") + ylab("Time of Day") + theme(legend.title = element_blank())
resPlot = ggplot(mapping=aes(x=x0[,2], y = x0[,1], z = y[sub] - cpp_pred_local)) + stat_summary_hex() +
  xlab("Time of Year") + ylab("Time of Day") + theme(legend.title = element_blank())
grid.arrange(predPlot, resPlot, ncol=2)

These look a lot better! There isn’t a systematic pattern in these residuals which show that the model isn’t missing anything important. Now we can check this C++ code against the basic R code.

library(mvtnorm)
lmLocal <- function(y, x0, X0, x, X, H){
  w <- dmvnorm(x, x0, H)
  fit <- lm(y ~ -1 + X, weights = w)
  return( t(X0) %*% coef(fit) )
}
R_pred_local <- sapply(1:nsub, function(ii){
  lmLocal(y = y, x0 = x0[ii, ], X0 = X0[ii, ], x = x, X = X, H = diag(c(1, 0.1)^2))
})

all.equal(R_pred_local, as.vector(cpp_pred_local))

## [1] TRUE

Since all these predictions are equal in R and Rcpp, how does the speed difference compare? This function takes a long time to run, so we will only time each of them once.

system.time(sapply(1:nsub, function(ii){
  lmLocal(y = y, x0 = x0[ii, ], X0 = X0[ii, ], x = x, X = X, H = diag(c(1, 0.1)^2))
}))

##    user  system elapsed 
##  23.527   0.076  23.606

system.time(
  local_lm_fit(y, x0, X0, x, X, H)
)

##    user  system elapsed 
## 872.113   0.309 145.901

So the Rcpp code has come with an approximate ten times speed up again. However, this model could still be improved. We have chosen the bandwidth matrix \(\boldsymbol{H}\) arbitrarily, but this could be improved with cross-validation. Whilst this is not shown here, a similar approach to that given in section 2 could be implemented.

Last updated on May 5, 2019