Section 5: High Performance Computing with Bluecrystal

Introduction

We’ve all been in the situation where we want to perform some extremely complicated computing process, such as MCMC or manipulating an extremely large data frame a lot of times, but our laptops just aren’t good enough. They don’t have enough cores, no dedicated GPU and only a small amount of memory. Luckily for academics and PhD students, universities in general sympathise with us. A high performance computing (HPC) cluster is a collection of highly powerful computers located somewhere that can be accessed remotely, and used to run terminal scripts for coding.

This portfolio will detail the use of Bluecrystal, the super computers available at the University of Bristol. To access these computers, one needs to do so through the terminal, and so a basic knowledge of manipulating and using file structures in bash is required. Section 4 details the fundamentals of bash if the reader is not already familiar.

Bluecrystal Phase 3

This tutorial will focus on Bluecrystal Phase 3, that is the third generation of Bluecrystal machines. The cluster is made up of 312 compute nodes, which are where the processes are run. The basic nodes have specifications:

  • 64GB RAM
  • 300TB storage
  • 16 Cores
  • Infiniband High Speed Network

As well as the large memory nodes, which have 256GB of RAM, and the GPU nodes which each have an NVIDIA Tesla K20 - an extremely powerful GPU.

Connecting to the HPC Cluster

To access the HPC cluster, you would need to log in via SSH (secure shell) from a University of Bristol connection (or a VPN). The ssh command in bash is your friend here. To log in (assuming you have an account), you perform the following command:

ssh your_username@bluecrystalp3.bris.ac.uk

It will then ask for your password, which you supply immediately after writing this command. From here you will be in the log-in node, note that you should not run any code on the log-in node, as this node is only purposed for connecting to the compute nodes. If you run any large code on the log-in node you will slow down the HPC cluster for everyone else.

File System

Within the log-in node, you will have your own personal file directory where you can store your files and your code. By default, after logging in you will be in this directory, so if you use the ls command you will see the contents of your personal directory immediately. You can write code here, through the terminal, or copy it into your directory from your own personal computer. You are free to make directories and files here, as the contents of your directory will be read by the compute nodes when you want to run a job.

To copy a file from your own computer to your file system in the HPC cluster, you can use the scp command in bash, a command that is run from your own computer only and looks like

scp path_to_my_file/file.txt your_username@bluecrystalp3.bris.ac.uk::path_inside_hpc/

Performing this operation would copy file.txt in folder path_to_my_file into the path_inside_hpc folder on your directory in the HPC cluster. If you want to do this the other way around, and copy something from your HPC cluster file system to your personal computer, just switch the order of the arguments to scp, but always do it from your own machine.

Running Jobs

To run a job, you must write a bash script that tells the compute node what to do. This bash script will be interpreted at the HPC node and run accordingly. Below is a general template for what this bash script would look like to be passed across:

#!/bin/bash
#
#PBS -l nodes=2:ppn=1,walltime=24:00:00

# working directory
export WORK_DIR=$HOME
cd $WORK_DIR

# print to output
echo JOB ID: $PBS_JOBID
echo Working Directory `pwd`

# run something
/bin/hostname

The first line #!/bin/bash tells the compiler to read this as bash, and the third line #PBS -l nodes=2:ppn=1,walltime=24:00:00 gives information to the HPC cluster as to what you want for the job. You can change these arguments to your suiting, e.g. increase the walltime if you think your code will run for more than 24 hours.

You must save this bash script as something like run_R.sh, and then when logged into Bluecrystal, use the command qsub - meaning to submit this job to the queue, i.e.

qsub run_R.sh

which would add this job to the queue. Since there are many people that use the HPC cluster, your job may not start immediately, and you might have to wait. You may have to wait longer if your walltime is particularly high, as you will be waiting for enough nodes to become available.

Other Functions

As well as qsub, there are other commands that you can use to play with the HPC cluster. Some notable ones are

  • qstat gives a list of current jobs being run and those in the queue
  • qstat -u user_name gives a list of current jobs queued and running by user_name
  • qstat job_id gives information about the job job_id being run
  • qdel job_id deletes a job with a given job_id

Running Different Code

To run other programming languages on the HPC cluster, it can be a bit of faff. To run code such as Python or R on the cluster, you must first load the module associated with the particular language. On the log-in node, you can run

module avail

to get a list of all available modules. There will be a lot. Choose one that you like, for example, I am a personal fan of languages/R-3.6.2-gcc9.1.0, and you can load this with module load module_name, for example

module load languages/R-3.6.2-gcc9.1.0

which will allow you to run R and R scripts. To submit a job that runs an R script, you must add this line to the job script before you run the code. To run an R script from bash, you use

Rscript script_name.R

Using R Packages

Since packages cannot be installed globally on the log-in node, you can install them locally instead. You first type the command R into bash, and then install.packages("package_name"). It will ask you if you want the package to be installed locally, which you say yes to.

After this, all packages installed on your local file system on the log-in node will be accessible as normal when running job scripts.

Daniel Williams
Daniel Williams
CDT Student

I am a PhD student studying at the University of Bristol under the COMPASS CDT, and previously studied at the University of Exeter. My research is currently regarding truncated density estimation, and unnormalised models. But I am also interested in AI more generally, including all the learnings, Machine, Deep and Reinforcement (as well as some others!).