Submitting Jobs

In order to run programs on the Vikram-100 cluster, you must submit a batch job via LSF. LSF, the Load Sharing Facility, is a subsystem for submitting, scheduling, executing, monitoring, and controlling a workload of batch jobs across compute servers in a cluster. When you log in to Vikram-100, you log in to the “master” node. The master node is used to prepare and submit jobs. User jobs are not allowed to run on the master node. You must submit your jobs to LSF via the bsub command.
LSF will queue the jobs, and schedule them for execution on one or more of the 97 “compute” nodes in the cluster that have enough available CPUs and memory to satisfy the jobs requirements.  If there are not enough CPUs, memory, or other resources immediately available to satisfy a job's requirements, then the job will remain pending in the queue until enough resources do become available.

Jobs/CPUs limits

Currently on Vikram-100, each user may have jobs running on up to 120 CPUs at a time. Additionally, users may submit upto 600 CPUs in queue. This number may be set higher or lower depending upon the work load on the Vikram-100 cluster. If they are changed, a notice will be displayed when you log in to Vikram-100 and on the 'NEWS' section of this web site. After one user reach the limit, the user will not be able to submit any other jobs until some of these are done. Jobs requesting for more than a total of 720 (120 runnable + 600 queueable) cpus will be put into the PEND state from which they never leave until killed by the user or HPC administrator. The scheduling policies on Vikram-100 attempts to fully utilize system resources without over-committing those resources, while being fair to all HPC users.

Before submitting jobs, check avaiable resources using vikram-100-stat command. You may provide free and cfree as argument to vikram-100-stat to know the status of nodes that are partially free and completely free respectively. You may then modify your job requirements depending on currently available resources.

Submitting Batch Jobs using LSF

  1. Submit a long running job by the bsub command line
  2. Example: Submit a program/job that will use ONE CPU

    bsub -J jobname -oo outfile.%J -eo errorfile.%J myprog

    where bsub is the job submitting command under LSF. myprog is the job/program you want to run. It can be your own executable such as a.out, a software program, a system command such as gzip temp.dat, or an executable script file such as run.sh. jobname is the name that you want to give to the job so as to differentiate the current job from your other jobs. %J is an unique job ID number assigned by the Vikram-100 system. It is always a good idea to include %Jin the output file names so as to make the filenames unique for different jobs. outfile and errorfile can be any file names you want to specify and may include file paths. The standard output and error messages of myprog will be stored in the files outfile.%J and errorfile.%J, respectively.


  3. Submit a long running job via an LSF script file
  4. An alternative way of the bsub command line is to use an LSF script file as follows
    bsub < jobfile
    where the jobfile is a job script file. A sample LSF jobfile might contain lines like the following:
    #!/bin/bash
    #BSUB -J jobname 
    #BSUB -oo outfile.%J 
    #BSUB -eo errorfile.%J 
    ./myprog
    You can also use #BSUB comments to specify any other bsub command options that you want to include, and multiple options can be specified on a single #BSUB comment line. An advantage of this method is that all the command options are saved in the job script file so that they are not forgotten.

Running parallel Jobs on multiple CPUs

Programs can also be developed to make use of more than one CPU. There are two types of programs that can use multiple CPUs.
  1. MPI programs
  2. This type of programs are written and compiled for parallel computing through the use of parallel programming tools such as MPI. An advantage of MPI programs is that they can use multiple CPUs distributed on different compute nodes. Therefore, theoretically an MPI job can take advantage of as many as the maximum number of CPUs on a cluster. A sample job script file to run an MPI job on Vikram-100 may look like this:

    #!/bin/bash
    # Set job parameters
    #BSUB -J jobname
    #BSUB -o jobname.o%J
    #BSUB -e jobname.e%J
    
    # Set number of CPUs
    #BSUB -n 48
    
    # Run MPI program
    mpirun -np 48 ./mpi_program
    In the above job script, the mpi_program will be run on 48 CPUs. Since a single node on Vikram-100 has only 24 CPUs, this job will be automatically span across 2 nodes.
  3. Multithreading programs
  4. These programs implement another parallel programming method such as OpenMP through multithreading with shared memory. Due to its shared memory, all of those CPUs must be on the same node. In other words, this type of programs can only take advantage of the number of CPUs on a single compute node. To ensure that LSF allocates all of the CPUs for a multithreaded program on the same node, use the -R option of bsub to specify a resource requirement for a single host. A sample job script file to run a multithreading job on Vikram-100 may look like this

    # Set job parameters
    #BSUB -R "span[hosts=1]" 
    #BSUB -J jobname
    #BSUB -oo jobname.o%J
    #BSUB -eo jobname.e%J
    
    # Set number of CPUs
    #BSUB -n 4
    
    # Run multithreading program
    ./multithread_program

    Please note that the number of CPUs that you specify for a job is very important for job scheduling on Vikram-100. Job scheduling is based upon the load level of the system's compute nodes and the number of CPUs that are in use. The LSF scheduler cannot "look inside" of a job to determine how many CPUs it will actually use, so LSF must be told how many CPUs a job will use via the -n option for the bsub command when the job is submitted. Please be sure that the number of CPUs that you request is correct for your job. If the number of CPUs is not specified, LSF will assume that one CPU is needed. If you request more CPUs than your job will actually use, then your job may wait in the queue, even though enough CPUs are available for it. If you request fewer CPUs than your job actually uses, then your job may be canceled.

    To run the job on specific node, you have to mention node name in the job submission script. For example, to run your application on compute node 090, add like like - #BSUB -m "compute090". This will run your application on compute node 090.

    Please note that if you set multiple CPUs (#BSUB -n) for a program that is NOT multi-threaded, then LSF will run multiple instances of the SAME program thereby degrading performance for you and for other users. This may also lead to file corruption. Kindly set multiple CPUs only if you know for sure that the program is multi-threaded.

Noteworthy BSUB parameters

The BSUB commands listed below should be added to your submission scripts.

Command Description
#BSUB -o output.out Write output to output.log (use %J to include JOBID) will append to output.log if it exists
#BSUB -oo output.out As above but overwriting output.out if it exists
#BSUB -e error.err Write errors to error.err (use %J to include JOBID) "error-%J.err")
#BSUB -eo error.err As above but overwriting error.err if it exists
#BSUB -n 16 Request a number of slots (16 in this case)
#BSUB -J Jobname Job Name
#BSUB -J Jobname[1-10] Array job with 10 elements
#BSUB -R "rusage[mem=4000]" Request a specific amount of memory for your job (4 GB in this case)
#BSUB -q cpu Submit your job to a specific queue (cpu in this example)

For a complete reference of bsub parameters, refer to the bsub man page (man bsub).
Previous | Next