Local Polynomial Regression with Rcpp
Introduction
Rcpp sugar and Armadillo are libraries included in Rcpp that allow different processes:
- Sugar provides an array of basic functions in Rcpp that are similar to inbuilt base R functions. These include functions such as
cbind
,sum
,sin
,sample
and many more. These are essential for writing simple code in Rcpp. - Armadillo is a package used for linear algebra operations in Rcpp. It allows specification of matrices, 3D matrices, vectors and others.
This portfolio will explain the use of key functions and features of Armadillo and sugar by implementing local polynomial regression on a data set on solar electricity production in Sydney. We can load this data into the R environment first.
load("solarAU.RData")
head(solarAU)
## prod toy tod logprod
## 8832 0.019 0.000000e+00 0 -3.540459
## 8833 0.032 5.708088e-05 1 -3.170086
## 8834 0.020 1.141618e-04 2 -3.506558
## 8835 0.038 1.712427e-04 3 -3.036554
## 8836 0.036 2.283235e-04 4 -3.079114
## 8837 0.012 2.854044e-04 5 -3.816713
The column prod
is a production measure, and logprod
is the log of the production measure, which will be the response variable. The covariates are toy
- the time of year, within 0 and 1 as a percentage, and tod
- the time of day, from 0 to 47, measured in half an hour intervals. A simple polynomial regression model is of the form
\[
\mathbb{E}(y|\boldsymbol{x}) = \beta_0 + \beta_1\text{tod} + \beta_2\text{tod}^2 + \beta_3\text{toy} + \beta_4 \text{toy}^2 = \tilde{\boldsymbol{x}}\boldsymbol{\beta}.
\]
Fitting a linear regression model
We can use Armadillo to fit a linear regression model, and to solve the least squares optimisation problem \[ \hat{\boldsymbol{\beta}} := \operatorname{argmin}_{\boldsymbol{\beta}}\|{\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\|}^2, \] which has solution \[ \boldsymbol{\hat{\beta}} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}. \] This can be implemented in C using QR decomposition of \(\boldsymbol{X}\).
library(Rcpp)
sourceCpp(code='
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export(name="lm_cpp")]]
arma::vec lm_cpp_I(arma::vec& y, arma::mat& X)
{
arma::mat Q, R;
arma::qr_econ(Q, R, X);
arma::vec beta = solve(R, (trans(Q) * y));
return beta;
}')
ls = function(formula, data){
y = data[,all.vars(formula)[1]]
x = model.matrix(formula, data)
lm_cpp(y, x)
}
Note that the sourceCpp
function has a differing argument code
, where instead of reading code from a file in the directory, the code is supplied directly here. Running sourceCpp
adds the lm_cpp
function to the environment. An R function is set up which takes the inputs formula
and data
and runs the Rcpp function with the given inputs. This makes it comparable to lm
. Firstly though, we can see that both functions output the same parameter estimates.
ls(logprod ~ tod + I(tod^2) + toy + I(toy^2), data = solarAU)
## [,1]
## [1,] -6.26275685
## [2,] 0.86440391
## [3,] -0.01757599
## [4,] -5.91806924
## [5,] 6.14298863
lm(logprod ~ tod + I(tod^2) + toy + I(toy^2), data = solarAU)$coef
## (Intercept) tod I(tod^2) toy I(toy^2)
## -6.26275685 0.86440391 -0.01757599 -5.91806924 6.14298863
But which is faster?
microbenchmark(
R_lm = lm(logprod ~ tod + I(tod^2) + toy + I(toy^2), data = solarAU),
C_lm = ls(logprod ~ tod + I(tod^2) + toy + I(toy^2), data = solarAU),
times = 500
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## R_lm 3.531021 4.047208 5.079068 4.205777 5.222140 57.54048 500
## C_lm 2.522135 2.927327 3.615145 3.026044 3.654492 37.26725 500
So the C code is approximately twice as fast with this method, and would be even faster without the R wrapper function. However, the R function lm
performs a number of checks and computes a number of statistics that the C++ code does not, which explains part of the performance gap.
For a more fair comparison, we can set up the model matrices in advance, and perform the same operations in R and in RCpp.
R_ls_solve = function(y,X){
QR = qr(X)
beta = solve(qr.R(QR), (t(qr.Q(QR)) %*% y))
return(beta)
}
y = solarAU$logprod
X = with(solarAU, cbind(1, tod, tod^2, toy, toy^2))
microbenchmark(R_solve = R_ls_solve(y,X),
C_solve = lm_cpp(y, X),
times = 500)
## Unit: microseconds
## expr min lq mean median uq max neval
## R_solve 1910.686 2189.3570 5022.8070 2464.270 3779.8105 79083.490 500
## C_solve 449.401 518.7625 690.3906 584.342 687.4495 8261.433 500
So this has come with an approximate speed up of ten times, exemplifying how much more computationally efficient code written in C++ is over R. However, there is a (non-computational) problem with the model. A plot of the residuals from the linear model shows this.
beta = ls(logprod ~ tod + I(tod^2) + toy + I(toy^2), data = solarAU)
res = y - X %*% beta
predplot = ggplot(solarAU, mapping = aes(x=toy, y=tod, z= X%*%beta)) +
stat_summary_hex() + xlab("Time of Year") + ylab("Time of Day") +
theme(legend.title = element_blank())
resplot = ggplot(solarAU, mapping=aes(x=toy, y = tod, z = res)) +
stat_summary_hex() + xlab("Time of Year") + ylab("Time of Day") +
theme(legend.title = element_blank())
grid.arrange(predplot, resplot, ncol=2)
There is a pattern in the residuals! This means that there is a feature not included in the model that can explain some of the noise. We need to extend this model to account for this.
Local Regression
We can improve the model fit by adopting a local regression approach, that is, making the parameter estimates depend on the covariates \(\boldsymbol{x}\), i.e. \(\hat{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}(\boldsymbol{x})\). This is achieved by minimising \(\hat{\boldsymbol{\beta}}(\boldsymbol{x}_0)\) for a fixed \(\boldsymbol{x}_0\): \[ \hat{\boldsymbol{\beta}}(\boldsymbol{x}_0) = \operatorname{argmin}_{\boldsymbol{\beta}} \sum^n_{i=1}\kappa_{\boldsymbol{H}}(\boldsymbol{x}_0 - \boldsymbol{x}_i)(y_i-\tilde{\boldsymbol{x}}_i^T\boldsymbol{\beta})^2, \] for a density kernel \(\kappa_{\boldsymbol{H}}\) with positive definite bandwidth matrix \(\boldsymbol{H}\). Fitting this model involves re-fitting the linear model once for each row in the data set, which for large data sets is not viable, but shows the need of computationally efficient code in C++. Now we can write the local regression function in RCpp.
vec local_lm_I(vec& y, rowvec x0, rowvec X0, mat& x, mat& X, mat& H)
{
mat Hstar = chol(H, "lower");
vec w = dmvnInt(x, x0, Hstar);
vec beta = lm_cpp_I(y % sqrt(w), X.each_col() % sqrt(w));
return X0 * beta;
}
// [[Rcpp::export(name="local_lm_fit")]]
vec local_lm_fit_I(vec& y, mat x0, mat X0, mat& x, mat& X, mat& H)
{
int n0 = x0.n_rows;
vec out(n0);
for(int ii=0; ii < n0; ii++)
{
rowvec x00 = x0.row(ii);
rowvec X00 = X0.row(ii);
out(ii) = as_scalar(local_lm_I(y, x00, X00, x, X, H));
if(ii % 50 == 0) {R_CheckUserInterrupt();}
}
return out;
}
These two functions, as well as lm_cpp
from earlier all combine to implement local regression. local_lm_fit_I
loops over all rows in x0
and X0
and fits linear regression for a constant \(\boldsymbol{x}_0\), using weights from a Gaussian kernel, imeplemented by a multivariate normal density (given by dmvnInt
, defined separately). local_lm_I
simply calculates the weights and pre-multiplies them by \(\boldsymbol{y}\) and \(\boldsymbol{X}\) to go into the fitting function. We can source these functions and run these for a subsetted data set.
sourceCpp('lm_cpp.cpp')
nsub = 2000
sub = sample(1:nrow(solarAU), nsub)
y = solarAU$logprod
x = as.matrix(solarAU[c("tod", "toy")])
X = model.matrix(~tod+toy+I(tod^2)+I(toy^2),data=solarAU)
x0 = x[sub, ]
X0 = X[sub, ]
H = diag(c(1,0.01))
cpp_pred_local = local_lm_fit(y, x0, X0, x, X, H)
Of which we can plot, for both the predictions and the residuals.
predPlot = ggplot(mapping=aes(x=x0[,2], y = x0[,1], z = cpp_pred_local)) + stat_summary_hex() +
xlab("Time of Year") + ylab("Time of Day") + theme(legend.title = element_blank())
resPlot = ggplot(mapping=aes(x=x0[,2], y = x0[,1], z = y[sub] - cpp_pred_local)) + stat_summary_hex() +
xlab("Time of Year") + ylab("Time of Day") + theme(legend.title = element_blank())
grid.arrange(predPlot, resPlot, ncol=2)
These look a lot better! There isn’t a systematic pattern in these residuals which show that the model isn’t missing anything important. Now we can check this C++ code against the basic R code.
library(mvtnorm)
lmLocal <- function(y, x0, X0, x, X, H){
w <- dmvnorm(x, x0, H)
fit <- lm(y ~ -1 + X, weights = w)
return( t(X0) %*% coef(fit) )
}
R_pred_local <- sapply(1:nsub, function(ii){
lmLocal(y = y, x0 = x0[ii, ], X0 = X0[ii, ], x = x, X = X, H = diag(c(1, 0.1)^2))
})
all.equal(R_pred_local, as.vector(cpp_pred_local))
## [1] TRUE
Since all these predictions are equal in R and Rcpp, how does the speed difference compare? This function takes a long time to run, so we will only time each of them once.
system.time(sapply(1:nsub, function(ii){
lmLocal(y = y, x0 = x0[ii, ], X0 = X0[ii, ], x = x, X = X, H = diag(c(1, 0.1)^2))
}))
## user system elapsed
## 23.527 0.076 23.606
system.time(
local_lm_fit(y, x0, X0, x, X, H)
)
## user system elapsed
## 872.113 0.309 145.901
So the Rcpp code has come with an approximate ten times speed up again. However, this model could still be improved. We have chosen the bandwidth matrix \(\boldsymbol{H}\) arbitrarily, but this could be improved with cross-validation. Whilst this is not shown here, a similar approach to that given in section 2 could be implemented.