Frequently Asked Questions

See our documentation homepage for information about our most common topics.

  1. I have a new phone. How do I move my Duo onto it?
  2. How do I acknowledge use of CURC Resources?
  3. How do I check how full my Summit directories are?
  4. When will my job start?
  5. How can I get system metrics?
  6. How much memory did my job use?
  7. Where is my current fair share priority level at?
  8. Why is my job pending with reason ‘ReqNodeNotAvail’?
  9. Why do I get the following ‘Invalid Partition’ error when I run my job?:sbatch: error: Batch job submission failed: Invalid partition name specified.
  10. Why do I get the following ‘Invalid Partition’ error when I run a Blanca job?
  11. How can I check what allocations I belong to? Why do I get the following ‘LMOD’ error when I try to load slurm/summit?:Lmod has detected the following error: The following module(s) are unknown: "slurm/summit"
  12. How do I install my own python library?
  13. Why does my PetaLibrary allocation report less storage than I requested?
  14. Why is my JupyterHub session pending with reason ‘QOSMaxSubmitJobPerUserLimit’?

I have a new phone. How do I move my Duo onto it?

You can add a new device to your duo account by visiting https://duo.colorado.edu. After a CU authorization page you will be directed to a Duo authentication page. Ignore the Duo Push prompt and instead click “Add a new device”:

_images/duo_new_device1.png

Duo will then try to authenticate your account by push notification to verify your identity. Cancel this push notifcation…

_images/duo_new_device2.png

…and click on “Enter a Passcode”, or “Call Me”.

  • If you select “Call Me” the simply recieve the call and press 1.
  • If you select “Enter a passcode” then click “Text me new codes” and you will be sent a list of one time passwords. Type in any one of the codes and you will be authenticated.

Once you have verified your identity, follow the instructions provided by Duo to add your device.

If you cannot authenticate your account (e.g. do not have your old device), contact rc-help@colorado.edu for further assistance.

How do I check how full my Summit directories are?

You have three directories allocated to your username ($USER). These include /home/$USER (2 G), /projects/$USER (250 G) and /scratch/summit/$USER (10 T). To see how much space you’ve used in each, from a Summit ‘scompile’ node, type curc-quota as follows:

[user@shas0100 ~]$ curc-quota
------------------------------------------------------------------------
                                       Used         Avail    Quota Limit
------------------------------------------------------------------------
/home/janedoe                          1.7G          339M           2.0G
/projects/janedoe                       67G          184G           250G
/scratch/summit                         29G        10211G         10240G

You can also check the amount of space being used by any directory with the du -sh command or the directory’s contents with the du -h command:

[janedoe@shas0136 ~]$ du -h /scratch/summit/janedoe/WRF
698M	WRF/run
698M	WRF

When will my job start?

You can pull up information on your job’s start time using the squeue command:

squeue --user=your_rc-username --start

Note that Slurm’s estimated start time can be a bit inaccurate. This is because Slurm calculates this estimation off the jobs that are currently running or queued in the system. Any job that is added in later with a higher priority may delay your job.

For more information on the squeue command, take a look at our Useful Slurm Commands tutorial. Or visit the Slurm page on squeue

Note that you can also see system level wait times and how they change through time by visiting the CURC metrics portal at https://xdmod.rc.colorado.edu

How can I get metics about CURC systems such as how busy they are, wait times, and account usage?

Visit the CURC metrics portal at https://xdmod.rc.colorado.edu

How much memory did my job use?

You can check how much memory your job used by using the sacct command. Simply replace YYYY-MM-DD with the date you ran the job:

sacct --starttime=YYYY-MM-DD --jobs=your_job-id --format=User,JobName,JobId,MaxRSS

If you’d like to monitor memory usage on jobs that are currently running, use the sstat command:

sstat --jobs=your_job-id --format=User,JobName,JobId,MaxRSS

For more information on sstat or sacct commands, take a look at our Useful Slurm Commands tutorial. Or visit the Slurm reference pages on sstat and sacct.

How can I see my current FairShare priority?

There are a couple ways you can check your FairShare priority:

  1. Using the levelfs tool in the slurmtools module. levelfs shows the current fair share priority of a specified user.

    You can use this tool by first loading in the slurmtools module (available from login nodes):

    $ module load slurmtools
    

    Tip: slurmtools is packed with lots of great features and tools like suacct, suuser, jobstats, seff, etc.

    Then using levelfs on your username:

    $ levelfs $USER
    
    • A value of 1 indicates average priority compared to other users in an account.
    • A value of < 1 indicates lower than average priority (longer than average queue waits)
    • A value of > 1 indicates higher than average priority (shorter than average queue waits)

  2. Using the sshare command:

    sshare -U -l
    

    The sshare command will print out a table of information regarding your usage and priority on all allocations. The -U flag will specify the current user and the -l flag will print out more details in the table. The field we are looking for is the LevelFS. The LevelFS holds a number from 0 to infinity that describes the fair share of an association in relation to its other siblings in an account. Over serviced accounts will have a LevelFS that’s between 0 and 1. Under serviced accounts will have a LevelFS that’s greater than 1. Accounts that haven’t run any jobs will have a LevelFS of infinity (inf).

    For more information on fair share the sshare command, take a look at Slurm’s documentation on fair share Or check out the Slurm reference page on sshare

Why is my job pending with reason ‘ReqNodeNotAvail’?

The ‘ReqNodeNotAvail’ message usually means that your node has been reserved for maintenance during the period you have requested within your job script. This message often occurs in the days leading up to our regularly scheduled maintenance, which is performed the first Wednesday of every month. So, for example, if you run a job with a 72 hour wall clock request on the first Monday of the month, you will receive the ‘ReqNodeNotAvail’ error because the node is reserved for maintenance within that 72-hour window. You can confirm whether the requested node has a reservation by typing scontrol show reservation to list all active reservations.

If you receive this message, the following solutions are available:

  1. Run a shorter job that does not intersect the maintenance window

You can update your current job’s time so that it does not intersect with the maintenance window using the scontrol command:

$ scontrol update jobid=<jobid> time=<time>
  1. Wait until after maintenance window has finished, your job will resume automatically

Why do I get an ‘Invalid Partition’ error when I try to run a job?

This error usually means users do not have an allocation that would provide the service units (SUs) required to run a job. This can occur if a user has no valid allocation, specifies an invalid allocation, or specifies an invalid partition. Think of SUs as “HPC currency”: you need an allocation of SUs to use the system. Allocations are free. New CU users should automatically get added to a ‘ucb-general’ allocation upon account creation which will provide a modest allocation of SUs for running small jobs and testing/benchmarking codes. However, if this allocation expires and you do not have a new one you will see this error. ‘ucb-general’ allocations are intended for benchmarking and testing, and it is expected that users will move to a project allocation. To request a Project and apply for a Project Allocation visit our allocation site.

Why do I get an ‘Invalid Partition’ error when I try to run a Blanca job?

If you are getting an ‘invalid patition’ error on a Blanca job which you know you have access to or have had access to before, you may be in the slurm/summit or slurm/alpine scheduler instance. From a login node, run module load slurm/blanca to access the Slurm job scheduler instance for Blanca, then try to resubmit your job.

How can I check what allocations I belong to?

You can check the allocations you belong to with the sacctmgr command. Simply type:

sacctmgr -p show associations user=$USER

…from a login or compile node. This will print out an assortment of information including allocations and QoS available to you. For more information on sacctmgr, check out the Slurm’s documentation

Why do I get an ‘LMOD’ error when I try to load Slurm?

The slurm/summit module environment can not be loaded from compile or compute nodes. It should only be loaded from login nodes when attempting to switch between Blanca and Summit enviornments. This error can be disregarded, as no harm is done.

How do I install my own python library?

Although Research Computing provides commonly used Python libraries as module, you may need to install individual python libraries for your research. This is best handled by utilizing Research Computing’s Anaconda installation to set up a local Conda enviornment.

Find out more about using Python with Anaconda here.

Why does my allocation report less storage than I requested?

Every ZFS-based PetaLibrary allocation has snapshots enabled by default. ZFS snapshots are read-only representations of a ZFS filesystem at the time the snapshot is taken. Read more about ZFS Snapshots

PetaLibrary allocation sizes are set with quotas, and ZFS snapshot use does count against your quota. Removing a file from your filesystem will only return free space to your filesystem if no snapshots reference the file. Filesystem free space does not increase until a file on a filesystem and all snapshots referencing said file are removed. Because snapshots can cause confusion about how space is utilized within an allocation, the default snapshot schedule discards snapshots that are more than one week old.

If you would like to set a custom snapshot schedule for your allocation, please contact rc-help@colorado.edu. Note that the longer you retain snapshots, the longer it will take to free up space by deleting files from your allocation.

Why is my JupyterHub session pending with reason ‘QOSMaxSubmitJobPerUserLimit’?

JupyterHub on CURC is run using a SLURM compute job under the cluster with the shas-interactive partition. The shas-interactive partition provides users with rapid turn around start times but is limited to a single core/node. This means only one instance of JupyterHub (or any job using the interactive partitions) can be run at a time.

In order to spawn another JupyterHub job you first need to close the current job.

You can either do so by shutting down your current JupyterHub server or by canceling your job manually.

Couldn’t find what you need? Provide feedback on these docs!