Job Scheduling and Queues
It is the responsibility of the job scheduler to determine when and where jobs will be run. The rules that influence these decisions are defined by the job scheduling policy. The job scheduling policy on Vikram-100 attempts to accommodate the various needs of its users and ensure all those who invest in it receive a fair share of its compute resources. The job scheduling policy apply equally to all users. When a job starts running on a compute node it will only be interrupted if it exceeds its resource requirements. Jobs will not be preempted to make way for a later job with higher priority.
There are several job queues on Vikram-100. But in general, you should not specify a queue name when submitting the job unless your requirement cannot be satisfied by the default queue, or for specific types of jobs. Jobs will be routed by default into an appropriate execution queue based upon job resource requirements. Following is the queue details of Vikram-100:
S. No | Queue Name | Maximum Cores (per user per job) | Walltime | Priority | Remarks |
1 | short | 2328 of 2328 | 15 mins | 1 | For compiling, debugging, short runs, etc. |
2 | defaultq | 512 of 2328 | 1 week | 2 | Default Queue |
3 | medium | 320 of 2328 | 15 days | 3 | For medium jobs |
4 | long | 192 of 2328 | 30 days | 4 | For long running jobs |
5 | gpu | 48 of 480 | 1 week | 0 | Only for GPU jobs |
6 | serial | 1 of 2328 | 45 days | 5 | Only one core per job |
7 | smp | 24 of 2328 | 30 days | 6 | Only for SMP jobs. Only one node per job. |
8 | garuda | 240 of 240 | ∞ | 7 | Only for Garuda users |
short queue
An LSF queue called short
has been created for test runs. Jobs in the short
queue have a higher priority than any other jobs, but they are limited to 15 minutes of CPU time. You can submit a job to the short
queue by specifying the queue name as an option via a #BSUB comment in the job script file,
#BSUB -q shortor on the
bsub
command when you submit the job:
bsub -q short < jobfile bsub -J jobname -q short -oo outfile.%J -eo errorfile.%J myprog
gpu queue
- The gpu queue has 20 GPU compute nodes and handles jobs that require access to Nvidia K40 cards for computation (e.g. CUDA, OpenACC Programs, etc). In this queue, users can run 5 jobs and queue 10 more. Jobs submitted in this queue does not count towards jobs submitted across any other queues. Submit your job to gpu queue ONLY if your job requires GPU access. Submitting a normal CPU job to
gpu
queue is strictly prohibited.
defaultq queue
- This is the default queue for the Vikram-100 HPC. You don't have to explicitly specify your job to run on this queue.
This queue primarily includes 77 CPU compute nodes, however, if GPU nodes are free AND if all CPU nodes are occupied, it will schedule your jobs to the GPU nodes, albeit with low priority. It has a walltime of 1 week. If users require more walltime, they can submit to
medium
and long
queues.
serial queue
- This queue must be used if your programs are serial (requiring only one CPU core to run). In this queue, users can run 36 jobs and queue 96 more, but each job can only access one CPU. Jobs submitted in this queue does not count towards jobs submitted across any other queues.
smp queue
- This queue must be used if your programs only runs on SMP systems and cannot span across nodes. In this queue, users can run 10 jobs. Jobs submitted in this queue does not count towards jobs submitted across any other queues.
Queue policy
- Users can run a total of 5 jobs across any queue (except for ‘serial’ and ‘gpu’ queue) at a time and queue 5 more.
- In ‘gpu’ queue, users can run 5 jobs and queue 10 more. Jobs submitted in this queue does not count towards jobs submitted across any other queues.
- In ‘serial’ queue, users can run 36 jobs and queue 96 more, but each job can only access one CPU. Jobs submitted in this queue does not count towards jobs submitted across any other queues.
- In ‘smp’ queue, users can run 10 jobs. Each job can only run on one node.