Best practices for running parallel executables on RC Resources

Many scientific software packages are programmed such that they can parallelize tasks across multiple cores on one node (shared memory parallelization) or multiple cores across multiple nodes (distributed memory parallelization). Either way, running the executable programs that result from compiling parallel-capable software packages requires the use of MPI (”message passing interface”) libraries which coordinate the passing of information between the parallel tasks.

This documentation covers the best way to run parallel executables on Alpine and Blanca across multiple cores and nodes. Running a parallel executable on Alpine or Blanca requires loading both a compiler and an mpi module, in addition to any other modules one needs. The primary compiler/mpi module “combos” are:

  1. Intel compilers with Intel-MPI (IMPI)
  2. Gnu (gcc) compilers with OpenMPI.
  3. AMD Optimized (aocc) compilers with OpenMPI (Alpine only)

It is recommended that you always use the Intel/IMPI combo to compile and run your parallel software on Blanca, because Intel/IMPI-compiled codes typically run more efficiently on the Intel CPUs on these systems. The gcc/OpenMPI combo can be used as a fallback if the code will not compile with Intel/IMPI. We don’t presently have a recommeded compiler/MPI combo for Alpine, which has AMD CPUs.

This documentation assumes you have already compiled your parallel software into an executable program. Additonal information on compiling (and programming) MPI-capable software in Fortran and C is provided in the RC documentation. To run your parallel executable you should always load and use the same compiler/mpi module combo that you used to compile it.

Running parallel executables

Shared memory parallel codes (that run across multiple cores on a single node) can be run anywhere on Alpine or Blanca. Distributed memory parallel codes (that run across multiple cores and multiple nodes) can be run on any Alpine partition, as well as any Blanca-HPC partition (e.g.,blanca-nso and blanca-topopt) and the blanca-ccn partition. If you have very large MPI jobs that would span multiple chassis (i.e., roughly 1000 cores or more), please contact us for help with optimization for message passing, as cabling limitations between some chassis may pose challenges. If uncertain whether distributed memory parallel jobs can be run in a given Blanca partition, users can employ the scontrol command to query whether fdr or edr is an available feature for a random node in the partition-of-interest. For example, to check a node in the blanca-ccn partition:

$ scontrol show node bnode0201 |grep AvailableFeatures
   AvailableFeatures=ivybridge,Quadro,K2000,avx,fdr,rhel7

With Intel/IMPI

Step1: Load the intel and impi modules. In this example intel/17.4 and impi/17.3 are loaded, but note that other options are also available and can be viewed with the module avail command.

module load intel/17.4
module load impi/17.3

Step 2: Export the following two environment variables:

export I_MPI_FABRICS=shm:ofi
export I_MPI_PMI_LIBRARY=/lib64/libpmi.so

Step 3: Now use one of the following three commands (srun, mpirun, or mpiexec) to invoke your parallel executable. In this example the parallel executable is called myexecutable.exe (yours will have a different name), and we are parallelizing across 48 cores (-n 48):

srun -n 48 ./myexecutable.exe

or

mpirun -n 48 ./myexecutable.exe

or

mpiexec -n 48 ./myexecutable.exe

In practice, all three methods will provide nearly identical performance, so choosing one is often a matter of preference. Slurm recommends using the srun command because it is best integrated with the Slurm Workload Manager that is used on Blanca. Additional details on the use of srun, mpirun and mpiexec with Intel-MPI can be found in the Slurm MPI and UPC User’s Guide.

With gcc/OpenMPI or aocc/OpenMPI

Step1: Load the gcc or aocc modules, then the openmpi module. In this example gcc/6.1.0 and openmpi/2.0.1 are loaded), but note that other options are also available and can be viewed with the module avail command.

module load gcc/6.1.0
module load openmpi/2.0.1

Step2: Now ensure that the environment on the parent process is exported to any child processes (required for OpenMPI only).

export SLURM_EXPORT_ENV=ALL

Step 3: Now use one of the following three commands (srun, mpirun, or mpiexec) to invoke your parallel executable. In this example the parallel executable is called myexecutable.exe (yours will have a different name), and we are parallelizing across 2 cores (-n 48):

srun -n 48 ./myexecutable.exe

or

mpirun -n 48 ./myexecutable.exe

or

mpiexec -n 48 ./myexecutable.exe

In practice, all three methods will provide nearly identical performance, so choosing one is often a matter of preference. Slurm recommends using the srun command because it is best integrated with the Slurm Workload Manager that is used on Blanca. Additional details on the use of srun, mpirun and mpiexec with OpenMPI can be found in the Slurm MPI and UPC User’s Guide.

Example job script for running a parallel executable:

#!/bin/bash

#SBATCH --nodes=2
#SBATCH --time=04:00:00
#SBATCH --partition=shas
#SBATCH --ntasks=48
#SBATCH --job-name=mpi-job
#SBATCH --output=mpi-job.%j.out

module purge
module load gcc/10.3
module load openmpi/4.1.1

export SLURM_EXPORT_ENV=ALL

#Run a 48 core job across 2 nodes:
mpirun -n $SLURM_NTASKS /path/to/mycode.exe

#Note: $SLURM_NTASKS has a value of the amount of cores you requested

Notes

  • When invoking gcc/openmpi-compiled softare via the srun command, make sure the code is compiled with openmpi version 2.X or greater.
  • Multiple MPI implementations may be available for a given compiler. Once you’ve loaded a given compiler, type module avail to see the list of MPI modules available for that particular compiler.
  • Tip: you can substitute -n $SLURM_NTASKS for -n 48 in the examples above, which will prevent you from having to change the number of tasks each time you change the number of –ntasks requested in your job script. The $SLURM_NTASKS variable will automatically take on the value of the number of tasks requested.