Blanca¶

CU Research Computing operates a shared “condo” compute cluster, named Blanca, which consists of nodes owned by individual research groups or departments. Condo partners get significantly prioritized access on nodes that they own and can run jobs on any nodes that are not currently in use by other partners.

An allocation of CPU time is not needed in order to run on Blanca.

If you would like to purchase a Blanca node, please visit the Research Computing website for more details.

Blanca Quick-Start¶

If your group is a Blanca partner, ask your PI or PoC to send an email to rc-help@colorado.edu requesting access for you to their high-priority queue.
From a login node, run “module load slurm/blanca” to access the Slurm job scheduler instance for Blanca.
Consult the Table and the Examples section below to learn how to direct your jobs to the appropriate compute nodes.
If needed, compile your application on the appropriate compute node type.

Job Scheduling¶

All jobs are run through a batch/queue system. Interactive jobs on compute nodes are allowed but these must be initiated through the scheduler. Each partner group has its own high-priority QoS (analogous to a queue) for jobs that will run on nodes that it has purchased. High-priority jobs move to the top of the queue and are thus guaranteed to start running within a few minutes, unless other high-priority jobs are already queued or running ahead of them. High-priority jobs can run for a maximum wall time of 7 days. All partners also have access to a low-priority preemptable QoS that can run on any Blanca nodes that are not already in use by their owners. Low-priority jobs have a maximum wall time of 24 hours.

Blanca uses a separate instance of the Slurm scheduling system from the other RC compute resources. You can use Blanca’s Slurm instance by loading a special module on a login node: “module load slurm/blanca”.

More details about how to use Slurm can be found here.

QoS¶

Slurm on Blanca uses “Quality of Service”, or QoS, to classify jobs for scheduling. A QoS in this case is analogous to a “queue” in other scheduling systems. Each partner group has its own high-priority QoS called blanca-<group identifier> and can also use the condo-wide low-priority QoS, which is called preemptable.

If you are a new Blanca user, ask your PI or Point of Contact person to request access for you to your group’s high-priority QoS; requests should be made via email to rc-help@colorado.edu. You are only allowed to use a high-priority QoS if you have been added as a member of it, and you can only use the low-priority preemptable QoS if you are also a member of a high-priority QoS. Your PI may also be able to point you to group-specific documentation regarding Blanca.

Node-QoS-Features¶

Since not all Blanca nodes are identical, you can include node features in your job requests to help the scheduler determine which nodes are appropriate for your jobs to run on when you are using the preemptable QoS.

To determine which nodes exist on the system, type scontrol show nodes to get a list.

Node Features¶

Blanca is a pervasively heterogeneous cluster. A variety of feature tags are applied to nodes deployed in Blanca to allow jobs to target specific CPU, GPU, network, and storage requirements.

Use the sinfo command to determine the features that are available on any node in the cluster.

sinfo --format="%N | %f"

The sinfo query may be further specified to look at the features available within a specific partition.

sinfo --format="%N | %f" --partition="blanca-curc"

Description of features¶

westmere-ex: Intel processor generation
sandybridge: Intel processor generation
ivybridge: Intel processor generation
haswell: Intel processor generation
broadwell: Intel processor generation
skylake: Intel processor generation
cascade: Intel processor generation
avx: AVX processor instruction set
avx2: AVX2 processor instruction set
fdr: InfiniBand network generation
edr: InfiniBand network generation
Quadro: NVIDIA GPU generation
Tesla: NVIDIA GPU generation
k2000: NVIDIA K2000 GPU
T4: NVIDIA T4 GPU
P100: NVIDIA P100 GPU
V100: NVIDIA V100 GPU
A100: NVIDIA A100 GPU
localraid: large, fast RAID disk storage in node
rhel7: RedHat Enterprise Linux version 7 operating system

Requesting GPUs in jobs¶

Using GPUs in jobs requires one to use the General Resource (”Gres”) functionality of Slurm to request the gpu(s). At a minimum, one would specify #SBATCH --gres=gpu in their job script to specify that they would like to use a single gpu of any type. One can also request multiple GPUs on nodes that have more than one, and a specific type of GPU (e.g., V100, A100) if desired. The available Blanca GPU resources and configurations can be viewed as follows on a login node with the slurm/blanca module loaded:

$ sinfo --Format NodeList:30,Partition,Gres |grep gpu |grep -v "blanca "
NODELIST                      PARTITION           GRES
bgpu-bortz1                   blanca-bortz        gpu:t4:1
bgpu-casa1                    blanca-casa         gpu:v100:1
bnode[0201-0236]              blanca-ccn          gpu:k2000:1
bgpu-curc[1-4]                blanca-curc-gpu     gpu:a100:3
bgpu-dhl1                     blanca-dhl          gpu:p100:2
bgpu-kann1                    blanca-kann         gpu:v100:4
bgpu-mktg1                    blanca-mktg         gpu:p100:1
bgpu-papp1                    blanca-papp         gpu:v100:1      

Examples of configurations one could request:

request a single gpu of any type

#SBATCH --gres=gpu

request multiple gpus of any type

#SBATCH --gres=gpu:3

request a single gpu of type NVIDIA V100

#SBATCH --gres=gpu:v100:1

request two gpus of type NVIDIA A100

#SBATCH --gres=gpu:a100:2

Notes:

Examples of full job scripts for GPUs are shown in the next section.
If you will be running preemptable GPU jobs in --qos=preemptable, requesting --gres=gpu will give you the greatest number of GPU options (i.e., your job will run on any Blanca node that has at least one GPU). However, you can request specific GPUs as well (e.g,. gpu:a100), but your jobs will only run on the subset of GPU nodes that has that type.

Examples¶

Here are examples of Slurm directives that can be used in your batch scripts in order to meet certain job requirements. Note that the “constraint” directive constrains a job to run only on nodes with the corresponding feature.

To run a 32-core job for 36 hours on a single blanca-ics node:

#SBATCH --qos=blanca-ics
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --time=36:00:00

To run a 56-core job across two blanca-sha nodes for seven days:

#SBATCH --qos=blanca-sha
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=28
#SBATCH --time=7-00:00:00
#SBATCH --export=NONE

To run a 16-core job for 36 hours on a single blanca-curc-gpu node, using all three gpus:

#SBATCH --qos=blanca-curc-gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --time=36:00:00
#SBATCH --gres=gpu:3

To run an 8-core job in the low-priority QoS on any node that has broadwell processors and uses the RHEL 7 operating system:

#SBATCH --qos=preemptable
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=4:00:00
#SBATCH --export=NONE
#SBATCH --constraint="broadwell&rhel7"

To run an 8-core job in the low-priority QoS on any node that has either the AVX or AVX2 instruction set:

#SBATCH --qos=preemptable
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=4:00:00
#SBATCH --export=NONE
#SBATCH --constraint="avx|avx2"

To run an 8-core job in the low-priority QoS on any node that has at least 1 GPU:

#SBATCH --qos=preemptable
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=4:00:00
#SBATCH --export=NONE
#SBATCH --gres=gpu

To start a 2-hr interactive job on one core on a blanca-ceae node, run this at the command line:

sinteractive --qos=blanca-ceae --export=NONE --time=02:00:00

Note that the interactive job won’t start until the resources that it needs are available, so you may not get a command prompt on your assigned compute node immediately.

Important notes¶

To see what modules are available, start an interactive job on a compute node and use module avail or module spider on it.
/home, /projects, and /pl/active (PetaLibrary Active) are available on all Blanca nodes. Scratch I/O can be written to /rc_scratch, which should offer much better performance than /projects. Most Blanca nodes also have at least 400 GB of scratch space on a local disk, available to jobs as $SLURM_SCRATCH. For more info on the different RC storage spaces, please see our page on storage.
There are no dedicated Blanca compile nodes. To build software that will run on Blanca, start an interactive job on a node like the one on which you expect your jobs to run, and compile your software there. Do not compile on the login nodes!
Multi-node MPI jobs that do a lot of inter-process communication do not run well on most standard Blanca nodes. Nodes equipped with specialty fabrics like Blanca CCN or any node on Blanca HPC can run MPI application much more efficiently.

Blanca Preemptable QOS¶

(effective 2018-03-01) Each partner group has its own high-priority QoS (“blanca-”) for jobs that will run on nodes that it has contributed. High-priority jobs can run for up to 7 days. All partners also have access to a low-priority QoS (“preemptable”) that can run on any Blanca nodes that are not already in use by the partners who contributed them. Low-priority jobs will have a maximum time limit of 24 hours, and can be preempted at any time by high-priority jobs that request the same compute resources being used by the low-priority job. The preemption process will terminate the low-priority job with a grace period of up to 120-seconds. Preempted low-priority jobs will then be requeued by default. Additional details follow.

Usage¶

To specify the preemptable QoS in a job script:

#SBATCH --QoS=preemptable

To specify the preemptable QoS for an interactive job:

$ sinteractive --qos=preemptable <other_arguments>

Batch jobs that are preempted will automatically requeue if the exit code is non-zero. (It will be non-zero in most cases.) If you would prefer that jobs not requeue, specify:

#SBATCH --no-requeue

Interactive jobs will not requeue if preempted.

Best practices¶

Checkpointing: Given that preemptable jobs can request wall times up to 24 hours in duration, there is the possibility that users may lose results if they do not checkpoint. Checkpointing is the practice of incrementally saving computed results such that – if a job is preempted, killed, canceled or crashes – a given software package or model can continue from the most recent checkpoint in a subsequent job, rather than starting over from the beginning. For example, if a user implements hourly checkpointing and their 24 hour simulation job is preempted after 22.5 hours, they will be able to continue their simulation from the most recent checkpoint data that was written out at 22 hours, rather than starting over. Checkpointing is an application-dependent process, not something that can be automated on the system end; many popular software packages have checkpointing built in (e.g., ‘restart’ files). In summary, users of the preemptable QoS should implement checkpointing if at all possible to ensure they can pick up where they left off in the event their job is preempted.

Requeuing: Users running jobs that do not require requeuing if preempted should specify the --no-requeue flag noted above to avoid unnecessary use of compute resources.

Example Job Scripts¶

Run a 6-hour preemptable python job on 32 cores without specifying a partition (job will run on any available compute partitions on Blanca, regardless of features, so long as they have at least 16 cores each).

#!/bin/bash
#SBATCH --time=06:00:00
#SBATCH --qos=preemptable
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks=32
#SBATCH --output=test.%j.out

module purge
module load python

python myscript.py

Same as Example 1, but specify a specific partition (‘blanca-ccn’) (job will only run on blanca-ccn nodes)

#!/bin/bash
#SBATCH --time=06:00:00
#SBATCH --qos=preemptable
#SBATCH --partition=blanca-ccn
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks=32
#SBATCH --output=test.%j.out

module purge
module load python

python myscript.py

Same as Example 1, but specify desired node features, in this case the avx2 instruction set and Redhat V7 OS (job will run on any node meeting these feature requirements, and which has at least 16 cores per node).

#!/bin/bash
#SBATCH --time=06:00:00
#SBATCH --qos=preemptable
#SBATCH --constraint="avx2&rhel7"
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks=32
#SBATCH --output=test.%j.out

module purge
module load python

python myscript.py

Other considerations¶

Grace period upon preemption: When jobs are preempted, a 120 second grace period is available to enable users to save and exit their jobs should they have the ability to do so. The preempted job is immediately sent SIGCONT and SIGTERM signals by Slurm in order to provide notification of its imminent termination. This is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching the end of the 120 second grace period. Users wishing to do so can monitor the job for the SIGTERM signal and, when detected, take advantage of this 120 second grace period to save and exit their jobs.

The ‘blanca’ QoS: Note that as of 1 March, 2018, the “preemptable” qos replaces the previous low-priority QoS, “blanca”, which is no longer active.