Param Vikram-1000 Slurm tutorial

Tutorial Reference University of Innsbruck

1. Submitting jobs (sbatch)

The command sbatch is used to submit jobs to the batch-system using the following syntax:

sbatch [options] [job_script.slurm [ job_script_arguments ...]]

where job_script.slurm represents the (relative or absolute) path to a simple shell script containing the commands to be run on the cluster nodes. We recommend to use the suffix .slurm to distinguish from scripts intended for other uses. If no file is specified, sbatch will read a script from standard input.

The first line of this script needs to start with #! followed by the path to an interpreter. For instance #!/bin/sh or #!/bin/bash (or any other available shell of your taste) but note that we currently only support Bash (/bin/bash). Your script may use common Bash functionality such as I/O redirection using the < and > characters, loops, case constructs etc., but please keep it simple. If your setup uses a different shell or needs a complex script, simply call your script from within the batch job script.

Slurm will start your job on any nodes that have the necessary resources available or put your job in a waiting queue until requested resources become available. After submitting your job, you may continue to work or log out - job scheduling is completely independent of interactive work. The only way to stop a job after it has been submitted is to use the scancel command described below.

If you submit more than one job at the same time, you need to make sure that individual jobs (which may be executed simultaneously) do not interfere with each other by e.g. writing to the same files.

The options tell the sbatch command how to behave: job name, use of main memory and run time, parallelization method, etc.

There are two ways of supplying these options to the sbatch command:

Method 1:

You may add the options directly to the sbatch command line, like:

sbatch --job-name=job_name --ntasks=number_of_tasks --cpus-per-task=number_of_cpus_per_task --mem-per-cpu=memory_per_cpu job_script.slurm [ argument ... ]

Method 2 (recommended):

Add the sbatch options to the beginning of your job_script, one option per line.

Note that the lines prefixed with #SBATCH are parsed by the sbatch command, but are treated as comments by the shell.

Taking above example, the contents of job_script.slurm would look like:

#!/bin/bash

                    #SBATCH --job-name=job_name
                    #SBATCH --ntasks=number_of_tasks
                    #SBATCH --cpus-per-task=number_of_cpus_per_task
                    #SBATCH --mem-per-cpu=memory_per_cpu

                    ./your_commands
                    

If you give conflicting options both in the job file and the sbatch command line, the command line options take precedence. So you can use options in the job script to supply defaults that may be overriden in the sbatch command line.

1.1 Overview of commonly used options to sbatch

For details, please look at the sbatch documentation.

1.1.1 Job Name, Input, Output, Working Directory, Environment

--job-name=name
Name of the job.
Default: File name of the job script.
The job name is used in the default output of squeue (see below) and may be used in the filename pattern of input-, output- and error-file.
Slurm will set the following environment variables:
$SLURM_JOB_NAME: Name of the job
$SLURM_JOB_ID: ID of the job.
--output=filename_pattern
Standard output (stdout) of the job script will be connected to the file specified by filename_pattern.
By default both standard output and standard error are directed to the same file. For normal jobs the default file name is slurm-%j.out, where "%j" is replaced by the job ID. For job arrays, the default file name is slurm-%A_%a.out, "%A" is replaced by the job ID and "%a" by the array index.
The working directory of the job script is the current working directory (where sbatch was called) unless the --chdir argument is given.
Filename patterns may use the following place-holders (for a full list see the documentation of sbatch):
  • %x   Job name.
  • %j   Job-ID.
  • %t   Task identifier (aka rank). This will create a seperate file per task.
  • %N   Short hostname. This will create a separate file per node.
Example: -o %x_%j_%N.out
please note: two jobs using the same output file name will clobber each other's output. Use the default or make sure that your filename_pattern includes %j.
--error=filename_pattern
Standard error (stderr) of the job script will be connected to the file specified by filename_pattern as described above.
Default: stderr will be connected to the same file as stdout.
--input=filename pattern
Standard input of the job script will be connected to the file specified by filename pattern. By default, "/dev/null" is connected to the script's standard input.
--chdir=directory
Execute the job in the specified working directory. Input/output file names are relative to this directory.
Default: current working directory of sbatch-command.
--export=NONE
Disable propagation of environment variables from current shell (default: --export=ALL, i.e. all variables are exported).
If you use this option and start parallel programs with srun, you should use srun --export=ALL to forward environment variables set in your batch script to the remote processes.
For details, see the sbatch documentation

1.1.2 Notifications

--mail-user=email_address
Notifications will be sent to this email address.
Default is to send mails to the local user submitting the job.
--mail-type=[TYPE|ALL|NONE]
Send notifications for the specified type of events (default: NONE).
Possible values for TYPE are BEGIN, END, FAIL, REQUEUE, STAGE_OUT, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, TIME_LIMIT_50, ARRAY_TASKS. Multiple types may be specified by using a comma speparated list.
ALL is equivalent to BEGIN,END,FAIL,REQUEUE,STAGE_OUT.
TIME_LIMIT sends a notification when the time limit of the running job is reached. TIME_LIMIT_XX sends one when XX percent of the limit are reached.
When ARRAY_TASKS is specified BEGIN, END and FAIL apply to each task in the job array (we strongly advise against using this)! Without this, these messages are sent for the array as a whole.

1.1.3 Time Limits

--time=time
Set a limit on the run time (wallclock time from start to termination) of the job. The default depends on the used partition.
When the time limit is reached, each task in each job step receives a TERM signal followed (after 30 seconds) by a KILL signal. So "trapping" the TERM signal and gracefully shutting down the script is possible.
Times may be specified as:
minutes,
days-hours, or
[[days-]hours:]minutes[:seconds].
If you know the runtime of your job before-hand it's a good idea to use this option to specify it as this helps the scheduler doing its resource planning and may result in an earlier start of your job.

1.1.4 Memory Allocation

--mem=size[K|M|G|T]
Specify the memory required per node. Slurm will set the environment variable $SLURM_MEM_PER_NODE to the memory allocated per node (unit: megabytes).
--mem-per-cpu=size[K|M|G|T]
Specify the memory required per CPU. Environment variable: $SLURM_MEM_PER_CPU
--mem-per-gpu=size[K|M|G|T]
Specify the memory required per GPU. Environment variable: $SLURM_MEM_PER_GPU

1.1.5 Nodes, Tasks, and CPUs

--ntasks=ntasks
Request CPU resources for a total number of ntasks tasks.
Without further options (see below) the tasks are placed on free resources on any node (nodes are "filled up").
For MPI jobs, tasks correspond to MPI ranks.
Slurm will set the environment variable $SLURM_NTASKS to the number ntasks that you requested.
--nodes=n[-m]
                    --ntasks-per-node=ntasks
Explicitly request at least n and up to m nodes with ntasks each. If only one number is given (and not a range) it is interpreted as exactly this number of nodes.
Environment variables:
$SLURM_JOB_NUM_NODES number of allocated nodes
$SLURM_NTASKS_PER_NODE number of tasks requested per node.

Please note: Unless you have a good reason to explicitly control placement of tasks, do not use these options, but let the system decide.
--cpus-per-task=ncpus
Tell Slurm that each task will require ncpus CPUs. Default is one CPU per task. This is the level at which multithreading (e.g. Posix threads or OpenMP threads) is specified.
Slurm will set the environment variable $SLURM_CPUS_PER_TASK to the number ncpus that you requested.
MPI + OpenMP hybrid jobs are natively supported by simultaneously setting ntasks and ncpus to values greater than 1.
--hint=multithread
Enable hyperthreading. By default, only one CPU per core is used, the other is assigned to your job but not used. Use this option if you know that your programs can profit from hyperthreading.
Default: nomultithread.

1.1.6 GPUs

--gpus=[type:]number
Request number GPUs, optionally of type type. GPU nodes have two GPUs installed on each node.
Environment variable: $SLURM_GPUS number of GPUs requested

1.1.7 Job Arrays

--array=m-n[:step][%maxrunning]
Trivial parallelisation using a job array. This will start n-m+1 independent instances of your job (so called "array tasks") with a task ID ranging from m to n inclusive. At run time, each task has the following environment variables set:
VariableMeaning
SLURM_ARRAY_TASK_COUNT total number of tasks of your array
SLURM_ARRAY_TASK_ID ID of the current task
SLURM_ARRAY_TASK_MAX last ID
SLURM_ARRAY_TASK_MIN first ID
SLURM_ARRAY_TASK_STEP step (increment value) of the IDs of the array.

Appending %maxrunning to the array specification allows you to specify a maximum number of simultaneously running tasks. E.g.
-a 0-9%4 will run ten tasks in total but only a maximum of four simultaneously.

Instead of a range of IDs you can also give a comma separated list of values.

The minimum task-ID is 0, the maximum is 75000.

If the number of your job instances is substantially higher than about 10 please do not use ARRAY_TASKS in --mail-type (see above).

1.1.8 Job script validation and start estimate

--test-only
The job script is validated but not submitted. Additionally an estimate is shown of when a job would be scheduled to run with the current settings given in the job script and on the command line.

1.1.9 A note on maximum stack size

Our systems are configured in a way that the maximum allowed size of the stack of your programs is unlimited (unlike the default in most Linux systems where it is limited to 8 MB). Most programs will not need this but some will benefit from it.
There are edge-cases where (FORTRAN?) programs will not work with an unlimited stack size. In that case please limit stack size in your job-script before calling that program. With e.g.

ulimit -s 80000

you will set the limit to about 80 MB (80000 kB). This works because as a user you are allowed to lower the limit anytime.

1.2 Running Parallel Jobs

The setup of parallel jobs submitted by sbatch --ntasks=n_tasks consists of two phases:

  1. Starting the job script on the master node selected by the batch scheduler. This is the actual job.
  2. Deploying the individual n_tasks tasks across the worker nodes selected by the batch system. In Slurm terms, this phase is called a job step. There may be multiple job steps in a job.

For starting a job step, you can...

  • ... either use Slurm's srun command - this is now the preferred way -
  • ... or use the mpirun command if supplied by the MPI implementation that you are using.

In our OpenMPI installations, we enable the legacy launchers option, so the mpirun command will be available, and it will correctly honor your resource requests given in your sbatch command or job script. Behind the scenes, mpirun will call srun. With other MPI implementations, in particular when they have not been linked against our Slurm-integrated OpenMPI, mpirun may be available but can have bizarre effects, including the following:

  • all tasks run on the same CPU, causing a massive slow-down
  • all tasks have rank 0
  • no remote tasks can be started

For this reason, you should write new jobs using the srun command. The srun command largely accepts the same options as the sbatch command, but normally, it will take its resource information from environment variables set by sbatch.

Advantages of using the srun command:

  • Full integration into the Slurm run time environment
  • No need to load the correct MPI module if your binaries have been linked using the RPATH attribute
  • Works for some MPI implementations that have not been integrated with Slurm, e.g. Anaconda's MPICH.

When you use the srun command, you may want to set some environment variables or use the following options in your batch script:

1.2.1 Propagate Environment Variables

srun --export=ALL [...]
This will make sure that environment variables that you set in the batch script (e.g. by loading modules or activating conda environments) will be seen by the tasks spawned by srun. This is the default, unless you specified --export=NONE in your sbatch command.

1.2.2 Set CPUs used per task

export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-1}

or

srun --cpus-per-task=${SLURM_CPUS_PER_TASK:-1} [...]
If you gave the options --ntasks=n_tasks and --cpus-per-task=ncpus, both with values greater than 1, your job will be an MPI+OpenMP hybrid job with n_tasks individual MPI processes consisting of ncpus threads each. Slurm in its current version does not propagate $SLURM_CPUS_PER_TASK to the srun command, so you have to do this manually.

1.2.3 Ensuring Correct MPI ranks and Communication

srun --mpi=pmi2 [...]
This option makes sure that your tasks and their ranks are correctly started and mapped on the nodes selected by the batch system.

1.3 Job submission examples

1.3.1 Parallel MPI job

The contents of your job script may look like this:
(if you just copy&paste this example please be aware of line breaks and special characters)

#!/bin/bash

                                    # Name of your job.
                                    #SBATCH --job-name=name

                                    # Send status information to this email address. 
                                    #SBATCH --mail-user=[email protected]

                                    # Send an e-mail when the job has finished or failed. 
                                    #SBATCH --mail-type=END,FAIL

                                    # Start an MPI job with 80 single threaded tasks
                                    #SBATCH --ntasks=80
                                    #SBATCH --cpus-per-task=1

                                    # In this example we allocate ressources for 80 MPI processes/tasks,
                                    # placing exactly 10 tasks on each of 8 separate nodes like this:
                                    ## #SBATCH --ntasks-per-node=10
                                    ## #SBATCH --nodes=8
                                    ## #SBATCH --cpus-per-task=1
                                    # do this only when you have good reason to explicitly control
                                    # task placement

                                    # Specify the amount of memory given to each MPI process
                                    # in the job.
                                    #SBATCH --mem-per-cpu=1G

                                    module purge
                                    module load openmpi/xx.yy.zz

                                    mpirun ./your_mpi_executable [extra arguments]
                                    
Note: The Slurm integration of OpenMPI will automatically start as many tasks as requested for the job. The parameter -n $SLURM_NTASKS is no longer necessary.

1.3.2 Parallel OpenMP jobs

The contents of your job script may look like this:
(if you just copy&paste this example please be aware of line breaks and special characters)

#!/bin/bash

                                    # Name of your job.
                                    #SBATCH --job-name=name

                                    # Send status information to this email address. 
                                    #SBATCH --mail-user=[email protected]

                                    # Send an e-mail when the job has finished or failed. 
                                    #SBATCH --mail-type=END,FAIL

                                    # Allocate one task on one node and six cpus for this task
                                    #SBATCH --ntasks=1
                                    #SBATCH --cpus-per-task=6

                                    # Allocate 12 Gigabytes for the whole node/task
                                    #SBATCH --mem=12G

                                    # it is no longer necessary to tell OpenMP how many software threads to start
                                    ####    export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
                                    ./your_openmp_executable
                                    

Important: We have configured Slurm to use control groups in order to limit access of your job to memory and cpus.
If your job uses shared memory parallelization other than OpenMP, you should check that the number of CPU-intensive software threads is consistent with the number of slots assigned to the job.
If you start more software threads than you requested in the --cpus-per-task directive, these will be restricted to run on the requested amount of CPUs so they will interfere with each other, possibly degrading the overall efficiency of your job.

1.3.3 Hybrid MPI + OpenMP Parallel Job

#!/bin/bash

                    # Name of your job.
                    #SBATCH --job-name=name

                    # Send status information to this email address. 
                    #SBATCH --mail-user=[email protected]

                    # Send an e-mail when the job has finished or failed. 
                    #SBATCH --mail-type=END,FAIL

                    # Start an MPI job with 20 tasks with 4 software threads each
                    #SBATCH --ntasks=20
                    #SBATCH --cpus-per-task=4

                    # In this example we allocate ressources for 20 hybrid MPI+OpenMP tasks,
                    # placing exactly 4 tasks on each of 5 separate nodes like this:
                    ## #SBATCH --ntasks-per-node=4
                    ## #SBATCH --nodes=5
                    ## #SBATCH --cpus-per-task=4
                    # do this only when you have good reason to explicitly control
                    # task placement

                    # Specify the amount of memory given to each MPI process
                    # in the job.
                    #SBATCH --mem-per-cpu=1G

                    module purge
                    # module load [....]

                    # let Slurm take care of both levels of parallelism
                    export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-1}
                    srun --export=ALL --mpi=pmi2 ./your_mpi_executable [extra arguments]
                    

1.4 Heterogeneous Jobs

For some MPI applications, resource requirements of individual tasks may differ from each other. A typical use case is a main - worker setup, where the main task, running in MPI rank 0, coordinates the worker tasks, which are running in ranks 1 to n, totaling n+1 tasks.

Slurm supports this type of setup, but as of April 2023, the original documentation is incomplete and misleading. Use the information given below in addition to Slurm's documentation.

To support heterogeneous jobs, Slurm allows you to specify multiple groups of resource requirements, to which individual tasks are assigned in ascending rank order. Resource allocation groups are separated by a colon (:) in the command line and by the directive #SBATCH hetjob in the batch script.

The following template assumes a memory intensive main task and 20 CPU intensive worker tasks. Please note that contrary to above quoted documentation, the resource requirements passed to the srun command by default do not reflect the requirements specified in your batch script or sbatch command and thus must be repeated explicitly in the srun command. Please adjust the following template according to your own needs:

#!/bin/bash
                    #SBATCH some non-resource options as needed
                    #SBATCH --ntasks=1 --cpus-per-task=1 --mem-per-cpu=1200
                    #SBATCH hetjob
                    #SBATCH --ntasks=20 --cpus-per-task=4 --mem-per-cpu=50 --hint=multithread
                    #SBATCH more options as needed

                    # the following must be consistent with above requirements
                    resources=' --ntasks=1 --cpus-per-task=1 --mem-per-cpu=1200 : --ntasks=20 --cpus-per-task=4 --mem-per-cpu=200 --hint=multithread'

                    srun [your options] $resources your-mpi-program [ your arguments ... ]
                    

This will run MPI rank 0 on a single CPU with 1200 MB RSS available. Ranks 1-20 will run on 4 CPUs each (hybrid MPI+OpenMP) and make use of hardware hyperthreads, effectively using both hyperthreads of each of the two cores assigned to each task. Memory available per task will be (50 MB times 4 CPUs) = 200 MB each.

2. Interactive jobs (srun --pty)

The submission of interactive jobs is useful in situations where a job requires some sort of direct intervention.

This is usually the case for X-Windows applications or in situations in which further processing depends on your interpretation of immediate results. A typical example for both of these cases is a graphical debugging session.

Note: Interactive sessions are also particularly helpful for getting acquainted with the system or when building and testing new programs.

Interactive sessions might be sequential or parallel:

Sequential
(one CPU on one node)
srun --pty bash
Parallel (shared memory)
(n CPUs on one node)
srun --nodes=1 --ntasks-per-node=1 --cpus-per-task=n --pty bash
Parallel (distributed memory)
(n CPUs on each of m nodes)
Either
srun --ntasks-per-node=n --nodes=m --cpus-per-task=1 --pty bash
or
srun --ntasks=x --pty bash
x being n * m.

In a multi-node parallel (aka distributed memory) interactive session you can use srun (or after loading an MPI module mpiexec) to run programs on all nodes.

For using an X-Windows application, supply --x11 as a further option, e.g. like this:

srun --pty --x11 xterm -ls

Prepare your session as needed, e.g. by loading all necessary modules within the provided xterm and then start your program on the executing node.

Note: Make sure to end your interactive session (logging out or closing the xterm window) as soon as it is no longer needed!

3. Monitoring jobs (squeue, scontrol, srun and sacct)

To get information about running or waiting jobs use

squeue [options]
or
sq [options]

The command squeue displays a list of all running or waiting jobs of all users.

The locally implemented command sq displays more fields than squeue does by default.

The (in our opinion) most interesting additional field is START_TIME which for pending jobs shows the date and time when Slurm plans to run this job. It is always possible that a job will start earlier but not (much) later.

Slurm calculates this field only once per minute so it might not contain a meaningful value right after submitting a job.

squeue and sq display the jobs of all users which might not always be what you want. So we created another shortcut for you:
squ
which is a shorter way of typing sq -u $USER and thus lists only the jobs belonging to you.

Get more detailed information about a particular pending or running job by
scontrol show job jobid.

You can further inspect a running job by "connecting" to it with this command:

srun --jobid=jobid --overlap --pty bash

This will open an interactive shell as a job step under an already allocated job. I.e. you will be able to see how your job is "behaving". For distributed memory jobs you will get a shell at the first node used by your job.

To get information about past jobs use

sacct -X [options]

4. Altering jobs (scontrol update)

You can change the configuration of pending jobs with

scontrol update job jobid SETTING=VALUE [...]

To find out which settings are available we recommend to first run
scontrol show job jobid.

If then for example you want to change the run-time limit of your job to let's say three hours you would use
scontrol update job jobid TimeLimit=3:00

Some adaption might require you to change more than one setting. If e.g. your Job is flexible wrt to the number of used tasks and nodes and you want to change those after having submitted a job you would have to run
scontrol update job jobid NumTasks=xx NumCPUs=xx NumNodes=y

5. Deleting jobs (scancel)

To delete pending or running jobs you have to look up their numerical job identifiers aka job-ids. You can e.g. use squeue or squ and take the value(s) from the JOBID column.
Then execute

scancel job-id [...]

and the corresponding job will be removed from the wait queue or stopped if it's running.
You can specify more than one job identifier after scancel.
If you want to cancel all your pending and running jobs without being asked for confirmation you may use
squ -h -o %i | xargs scancel.
Those options tell squeue to only output the JOBID column (-o %i) and to omit the column header (-h).