One useful application of CUmulus is the ability to integrate your VMs with CU Research Computing High Performance Computing (HPC) resources. HPC compute is typically time-limited (at CURC 24 hours for regular joband 7 days long job) due finite resources and user competition for those resources. One way to deal with this problem is to schedule your jobs over time (e.g. by using cronjobs) though this isn’t always practical for more complex workflows. Using authentication keys (in this case Java Web Tokens or JWTs) you can setup a connection from your CUmulus instance to CURC HPC and schedule jobs remotely to set up more complex workflow specific pipelines.
We have documented the process for an Ubuntu 20.04 instance on CUmulus connecting to the CURC Blanca cluster. Below is an outline with links to specific sections:
- Create your CUmulus instance
- Install SLURM on CUmlus Instance
- Configure SLURM in CUmulus
- Generate Java Web Token
- Submit Job
Follow the docs below for general HPC integration or follow our step-by-step tutorial.
Instructions for Ubuntu 20.04:¶
Part 1: Create your Instance¶
The first thing we will need to do is create a Ubuntu 20.04 CUmulus instance. Log in to the CUmlus portal and follow our docs to create a CUmulus instance with the following specifications:
- Image: Ubuntu 20.04
- Security groups: ssh-restricted
- Set up a Floating IP
Part 2: Install SLURM on CUmlus Instance¶
The second thing we’ll need to do is install SLURM on our CUmulus instance. To do so we will log into our instance using ssh, update our instance, then install SLURM.
Log in to your instance from a local machine by specifying your ssh key file (with the
-iflag) and the floating IP you set up in step 1:
Note: this command is entered from your local machine’s terminal
$ ssh -i ~/.ssh/<ssh_key> ubuntu@<Floating IP>
From your CUmulus instance update and install SLURM dependencies
Note: these commands require administative (sudo) privilage
$ sudo apt-get update $ sudo apt install -y libmysqlclient-dev libjwt-dev munge gcc make
Install SLURM. It looks like there’s a lot going on with this step, but all we’re doing is downloading SLURM from github to the
/optdirectory of your instance, configuring the compilation to include Java Web Tokens (JWT) functionaliy, and then compiling and installing SLURM. Note that it is VERY IMPORTANT that the SLURM version on your CUmulus instance matches the CURC HPC version otherwise it will not connect.
You can check the SLURM version on CURC HPC resources by loading your cluster specific SLURM module from a login node (either
module load slurm/blancaor
module load slurm/alpine) then checking the version of a SLURM command (e.g.
In this example we’re using the 20.02.4 SLURM version and specifying it with git clone branch flag (
$ cd /opt $ sudo git clone -b slurm-20-02-4-1 https://github.com/SchedMD/slurm.git $ cd slurm $ sudo ./configure --with-jwt --disable-dependency-tracking $ sudo make && sudo make install
Part 3. Configure SLURM in CUmulus¶
Now that we have SLURM installed we can start to configure our instance to make the proper connection to CURC HPC resources. In this step we’ll add/edit the
slurm.conf file, create a user and group for SLURM, create a user and group for you that match your user/group from CURC HPC.
slurm.conf. Here we make a slurm directory at
/etc/slurmand copy the
slurm.conffile straight from CURC HPC using the secure copy (
scp) tool into that new directory.
Note: We are copying the
slurm.conffile from Blanca in this example
$ sudo mkdir -p /etc/slurm $ cd /etc/slurm $ sudo scp <RC_username>@login.rc.colorado.edu:/curc/slurm/blanca/etc/slurm.conf .
to finish the copy from CURC HPC, type your CURC password and accept Duo push.
Edit the following two variables in slurm.conf using a text editor of your choice (i.e. vim, nano):
Create a SLURM user & group. Use the group and user add commands to create the SLURM user and group with the 515 IDs:
sudo groupadd -g 515 slurm sudo useradd -u 515 -g 515 slurm
Configure your user and group on the VM. First we will query our user/group info on CURC HPC. From a CURC login node run the following commands and note the outputs:
[user@login11 ~]$ id -u $USER <userid> [user@login11 ~]$ id -g $USER <groupid> [user@login11 ~]$ whoami <username> [user@login11 ~]$ id -g -n $USER <groupname>
We can now create a user and group for ourselves on our instance that match CURC HPC:
$ sudo groupadd -g <groupid> <groupname> $ sudo useradd -u <userid> -g <groupid> <username>
Part 4. Generate Java Web Token¶
Next we will genrate the Java Web Token on a CURC login node. Keep in mind that these tokens are generated with an expiration. The max time CURC systems will keep a JWT is #TODO. Note the JWT output in order submit jobs.
Note: In this example we are genrating this token on the Blanca cluster
$ module load slurm/blanca $ scontrol token lifespan=72000 #token with 2 hour duration $ SLURM_JWT=<jwt-token>
Part 5. Submit Job¶
Our CUmulus instance is now configured to submit jobs to CURC HPC resources. When we ssh to our instance we are (likely) logged in as the admin user. We need to log in to the user we created in the previous step to submit jobs so CURC HPC will recognize the incoming request.
sudo su -command to log into your user
$ sudo su - <username>
SLURM_CONFvariables in your environemt:
export SLURM_JWT=<jwt-token> export SLURM_CONF=/etc/slurm/slurm.conf
Submit a test job! We’re finally ready to test a job submission from CUmulus. Use the
sbatchcommand to submit a job from the command line (i.e. without a batch script). In the example job submission below we use the
--chdirflag to change into our home directory (so we can easily find the job output) and use the
--wrapcommand to run the simple
hostnamebash command on a CURC compute node.
$ sbatch --qos=<blanca-qos> --chdir="/home/<username>" --wrap="hostname" Submitted batch job 12451234
Note: you can use the
--chdirflag to direct the output to an directory you have write access to.
You have now successfully connected your CUmulus instance to CURC HPC resources!
The CU Research Computing group would like to acknowledge support of this project from the National Science Foundation (OAC-1925766).
Couldn’t find what you need? Provide feedback on these docs!