Best practices for running parallel executables on RC Resources

Many scientific software packages are programmed such that they can parallelize tasks across multiple cores on one node (shared memory parallelization) or multiple cores across multiple nodes (distributed memory parallelization). Either way, running the executable programs that result from compiling parallel-capable software packages requires the use of MPI (“message passing interface”) libraries which coordinate the passing of information between the parallel tasks.

This documentation covers the best way to run parallel executables on Summit and Blanca across multiple cores and nodes. Running a parallel executable on Summit or Blanca requires loading both a compiler and an mpi module, in addition to any other modules one needs. Here the focus is on the two primary compiler/mpi module “combos”:

  1. Intel compilers with Intel-MPI (IMPI)
  2. Gnu (gcc) compilers with OpenMPI.

It is recommended that you always use the Intel/IMPI combo to compile and run your parallel software on Summit and Blanca, because Intel/IMPI-compiled codes typically run more efficiently. The gcc/OpenMPI combo can be used as a fallback if the code will not compile with Intel/IMPI.

This documentation assumes you have already compiled your parallel software into an executable program. Additonal information on compiling (and programming) MPI-capable software in Fortran and C is provided in the RC documentation. To run your parallel executable you should always load and use the same compiler/mpi module combo that you used to compile it.

Running parallel executables

Shared memory parallel codes (that run across multiple cores on a single node) can be run anywhere on Summit or Blanca. Distributed memory parallel codes (that run across multiple cores and multiple nodes) can be run on any Summit partition, as well as any Blanca-HPC partition (e.g.,blanca-nso and blanca-topopt) and the blanca-ccn partition. If uncertain whether distributed memory parallel jobs can be run in a given Blanca partition, users can employ the scontrol command to query whether fdr or edr is an available feature for a random node in the partition-of-interest. For example, to check a node in the blanca-ccn partition:

$ scontrol show node bnode0201 |grep AvailableFeatures
   AvailableFeatures=ivybridge,Quadro,K2000,avx,fdr,rhel7

With Intel/IMPI

Step1: Load the intel and impi modules. In this example intel/17.4 and impi/17.3 are loaded, but note that other options are also available and can be viewed with the module avail command.

module load intel/17.4
module load impi/17.3

Step 2: Export the following two environment variables:

export I_MPI_FABRICS=shm:ofi
export I_MPI_PMI_LIBRARY=/lib64/libpmi.so

Step 3: Now use one of the following three commands (srun, mpirun, or mpiexec) to invoke your parallel executable. In this example the parallel executable is called myexecutable.exe (yours will have a different name), and we are parallelizing across 48 cores (-n 48):

srun -n 48 ./myexecutable.exe

or

mpirun -n 48 ./myexecutable.exe

or

mpiexec -n 48 ./myexecutable.exe

In practice, all three methods will provide nearly identical performance, so choosing one is often a matter of preference. Slurm recommends using the srun command because it is best integrated with the Slurm Workload Manager that is used on both Summit and Blanca. Additional details on the use of srun, mpirun and mpiexec with Intel-MPI can be found in the Slurm MPI and UPC User’s Guide.

With gcc/OpenMPI

Step1: Load the gcc and openmpi modules. In this example gcc/6.1.0 and openmpi/2.0.1 are loaded), but note that other options are also available and can be viewed with the module avail command.

module load gcc/6.1.0
module load openmpi/2.0.1

Step 2: Now use one of the following three commands (srun, mpirun, or mpiexec) to invoke your parallel executable. In this example the parallel executable is called myexecutable.exe (yours will have a different name), and we are parallelizing across 2 cores (-n 48):

srun -n 48 ./myexecutable.exe

or

mpirun -n 48 ./myexecutable.exe

or

mpiexec -n 48 ./myexecutable.exe

In practice, all three methods will provide nearly identical performance, so choosing one is often a matter of preference. Slurm recommends using the srun command because it is best integrated with the Slurm Workload Manager that is used on both Summit and Blanca. Additional details on the use of srun, mpirun and mpiexec with OpenMPI can be found in the Slurm MPI and UPC User’s Guide.

Example job script for running a parallel executable:

#!/bin/bash

#SBATCH --nodes=2
#SBATCH --time=04:00:00
#SBATCH --partition=shas
#SBATCH --ntasks=48
#SBATCH --job-name=mpi-job
#SBATCH --output=mpi-job.%j.out

module purge
module load intel/17.4
module load impi/17.3

#Run a 48 core job across 2 nodes:
srun -n $SLURM_NTASKS /path/to/mycode.exe

#Note: $SLURM_NTASKS has a value of the amount of cores you requested

Notes

  • Software compiled with intel/impi modules on Summit presently work on Blanca and visa-versa.
  • Software compiled with gcc/openmpi modules on Summit presently will not work on Blanca, and likely visa-versa, due to differences in the available shared libraries used for openmpi-based parallelization between the two systems. Therefore, when compiling software with gcc/openmpi users should do so on the system they intend to use it on.
  • When invoking gcc/openmpi-compiled softare via the srun command, make sure the code is compiled with openmpi version 2.X or greater.
  • Other compiler/mpi combos are also available in the RC module stack. For example, Portland Group (pgi) compilers are available with OpenMPI. There are also gcc/impi and intel/openmpi combos available. To explore options, first choose and load a compiler module (e.g., intel/V.XX, gcc/V.XX or pgi/V.XX) and then type module avail to see the list of MPI modules available for that particular compiler.
  • Tip: you can substitute -n $SLURM_NTASKS for -n 48 in the examples above, which will prevent you from having to change the number of tasks each time you change the number of –ntasks requested in your job script. The $SLURM_NTASKS variable will automatically take on the value of the number of tasks requested.