Handling Large Memory Jobs

When preparing to submit a job, keep the system's usable memory in mind and configure your job script to avoid oversubscribing the cores, which will degrade performance. Use the bjobs -w command to get the list of nodes on which your job is running. If, for example compute005 is identified as one of the nodes your job is running on then login to that node and run top

Identify your process(es) in the listing of all the jobs on that node and look at the VIRT (virtual memory), RES (resident memory), %CPU and %MEM columns for your job(s). If your job is using a large percentage of the memory on the machine, and it is only using a small percentage of CPU time, and there is a large difference between the virtual memory and resident memory values, then your job is likely swapping due to not having enough memory.

Look under the %MEM column to see how much memory you are using. If you are using more than 20% then you should specify the memory requirement for your job. This is done by including the following in the script file:
#BSUB -R "rusage[mem=nnnn]"
where nnnn is in MB, and is per node. If a job spans multiple nodes, each node will have to have nnnn MB of memory available. For example, if your job needs about 8GB of memory, you can include the following in your job script file when submitting it to LSF
#BSUB -R "rusage[mem=8000]"

If you are still facing memory related issues, you may consider spreading memory-intensive jobs across more nodes and use fewer than the full number of cores on each node. This makes more memory available to run your processes.

You may submit your script with following parameters:
#BSUB -R "span[ptile=12]"
e.g. The following script will run your job across two nodes (as each node has 24 CPUs) by default with a total memory of 512 GB (as each node has 256 GB RAM).
#!/bin/bash
#
#BSUB -n 48
#BSUB -e errors.%J
#BSUB -o output.%J
 
mpirun -np 48 ./program_name.exe
However, by adding #BSUB -R "span[ptile=12]" to the script, will force the job scheduler to span your job across 4 compute nodes with each nodes executing 12 threads for a total memory of 1024 GB (256 x 4 GB). Do this ONLY if your job is experiencing memory related problems.
Previous | Next