Knowledge Base

Article ID: 1481 - Last Modified:

My parallel Jaguar calculation failed with an "out of memory" error. What is the problem?

First, you should check that the hard limit set by the root process or the daemon process of the queueing system is high enough.

You might also be encountering a problem if you have an Infiniband network. The mca_btl_openib module in the OpenMPI library, which is used on Infiniband networks, can prevent Jaguar from having access to all of the available memory on a compute node. This is because the module reserves some memory within the 32-bit address space and if Jaguar needs to expand the heap beyond that space, an out-of-memory failure occurs.

There are two ways around this. One is to force the user limit on stacksize to be so large that it causes mca_btl_openib to allocate its memory outside the 32-bit address space. This can be accomplished with either of the following commands placed in your shell startup script on the compute host:

ulimit -s 2097152 (for Bourne/Bash shell)
limit stacksize 2097152 (for C-shell)

(It has been reported to us that the use of 'unlimited' does not always accomplish the goal.)

The second workaround is to block the use of the mca_btl_openib module by setting the environment variable OMPI_MCA_BTL to '^openib'. This can be done, for example, by adding an env setting to the entry for the host in the hosts file (schrodinger.hosts). However, this second workaround is only applicable for jobs that run on a single node.

If you use an SGE queueing system and encounter memory limitations when using Infiniband, you can also change the memory parameters as queue manager with the command:

qconf -mconf

Then change the line execd_params to

execd_params H_MEMORYLOCKED=infinity H_DESCRIPTORS=32000

Back to Search Results

Was this information helpful?

What can we do to improve this information?

To ask a question or get help, please submit a support ticket or email us at
Knowledge Base Search

Type the words or phrases on which you would like to search, or click here to view a list of all
Knowledge Base articles