Optimisation
Numerical Optimisation
The general idea in optimisation is to find a minimum (or maximum) of some function. Generally, our problem has the form
Optimising a complicated function
To demonstrate the different optimisation methods, the speeds and abilities of each, consider optimising the Rastrigin function. This is a non-convex function that takes the form
where plotly
package to inspect it.
f = function(x) A*n + sum(x^2 - A*cos(2*pi*x))
A = 5
n = 2
x1 = seq(-10,10,length=100)
x2 = seq(-10,10,length=100)
xy = expand.grid(x1,x2)
z = apply(xy,1,f)
dim(z) = c(length(x1),length(x2))
z.plot = list(x=x1, y=x2, z=z)
image(z.plot, xlab = "x", ylab = "y", main = "Rastrigin Function")
So we are interested in optimising the function
f
. We can see from inspection of the plot that there is a global minimum at
f(c(0,0))
## [1] 0
So we will be evaluating optimisation methods based on how close they get to this true solution. We continue this portfolio by explaining the different optimisation methods, and evaluating their performance in finding the global minimum of the Rastrigin function.
When
grad_f = function(x) {
c(2*x[1] + 2*pi*A*sin(2*pi*x[1]),
2*x[2] + 2*pi*A*sin(2*pi*x[2]) )
}
hess_f = function(x){
H11 = 2 + 4*pi^2*A*sin(2*pi*x[1])
H22 = 2 + 4*pi^2*A*sin(2*pi*x[2])
return(matrix(c(H11,0,0,H22),2,2))
}
These analytical forms of the gradient and hessian can be supplied to various optimisation algorithms to speed up convergence.
Optimisation problems can be one or multi dimensional, where the dimension refers to the size of the parameter vector, in our case
Optimisation Methods
Gradient Descent Methods
Iterative algorithms take the form
gradient_method = function(f, x, gradient, eps=1e-4, t=0.1, maxiter=1000){
converged = TRUE
iterations = 0
while((!all(abs(gradient(x)) < eps))){
if(iterations > maxiter){
cat("Not converged, stopping after", iterations, "iterations \n")
converged = FALSE
break
}
gradf = gradient(x)
d = -gradf/abs(gradf)
x = x - t*gradf
iterations = iterations + 1
}
if(converged) {cat("Number of iterations:", iterations, "\n")
cat("Converged!")}
return(list(f=f(x),x=x))
}
This code essentially will continue running the while loop until the tolerance condition is satisfied, where the change in
gradient_method(f, x = c(1, 1), grad_f)
## Not converged, stopping after 1001 iterations
## $f
## [1] 20.44268
##
## $x
## [1] -3.085353 -3.085353
gradient_method(f, x = c(.01, .01), grad_f)
## Not converged, stopping after 1001 iterations
## $f
## [1] 17.82949
##
## $x
## [1] -2.962366 -2.962366
Even when the initial guess of while( (f(x) - f(x + t*d) ) < (-alpha*t * t(gradf)%*%d)) t = beta*t
. Meaning we need to specify
gradient_method(f, c(0.01,0.01), grad_f, maxiter = 10000)
## Number of iterations: 1255
## Converged!
## $f
## [1] 5.002503e-09
##
## $x
## [1] 5.008871e-06 5.008871e-06
Now we finally have convergence! However, this is for when the initial guess was very close to the actual solution, and so in more realistic cases where we don’t know this true solution, this method is likely inefficient and inaccurate. The Newton method is an advanced form of the basic gradient descent method.
Newton Methods
The Newton method seeks to solve the optimisation problem using evaluations of Hessians and a quadratic approximation of a function nlm
function from base R uses the Newton method. It is an expensive algorithm to run, because it involves inverting a matrix, the hessian matrix of nlm
to test the Newton method on the Rastrigin function.
f_fornlm = function(x){
out = f(x)
attr(out, 'gradient') <- grad_f(x)
attr(out, 'hessian') <- hess_f(x)
return(out)
}
nlm(f, c(-4, 4), check.analyticals = TRUE)
## $minimum
## [1] 3.406342e-11
##
## $estimate
## [1] -4.135221e-07 -4.131223e-07
##
## $gradient
## [1] 1.724132e-05 1.732303e-05
##
## $code
## [1] 2
##
## $iterations
## [1] 3
So this converged to the true solution in a surprisingly small number of iterations. The likely reason for this is due to Newton’s method using a quadratic approximation, and the Rastrigin function taking a quadratic form.
BFGS
In complex cases, the hessian cannot be supplied analytically. Even if it can be supplied analytically, in high dimensions the hessian is a very large matrix, which makes it computationally expensive to invert for each iteration. The BFGS method approximates the hessian matrix, increasing computability and efficiency. The BFGS method is the most common quasi-Newton method, and it is one of the methods that can be suppled to the optim
function. It approximates the hessian matrix with
Initialise
- Obtain a direction
through the solution of - Obtain a stepsize
by line search - Set
- Update
- Set
- Update the hessian approximation
BFGS is the fastest method that is guaranteed convergence, but has its downsides. BFGS stores the matrices
This is a very good but complicated method. Luckily, the function optim
from the stats
package in R has the ability to optimise with the BFGS method. Testing this on the Rastrigin function gives
optim(c(1,1), f, method="BFGS", gr = grad_f)
## $par
## [1] 0.9899629 0.9899629
##
## $value
## [1] 1.979932
##
## $counts
## function gradient
## 19 3
##
## $convergence
## [1] 0
##
## $message
## NULL
optim(c(.1,.1), f, method="BFGS", gr = grad_f)
## $par
## [1] 4.61081e-10 4.61081e-10
##
## $value
## [1] 0
##
## $counts
## function gradient
## 31 5
##
## $convergence
## [1] 0
##
## $message
## NULL
So the BFGS method actually didn’t find the true solution for an initial value of
Non-Linear Least Squares Optimisation
The motivating example we have used throughout this section was concerned with optimising a two-dimensional function, of which we were only interested in two variables that controlled the value of the function
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a stochastic approximation to the standard gradient descent method. Instead of calculating the gradient for an entire dataset (which can be extremely large) it calculates the gradient for a lower-dimensional subset of the dataset; picked randomly or deterministically. The form of this method is