Knowledge Base

My cluster is experiencing heavy traffic and packet drops on the head node and jobs are failing. What can I do to improve performance?

Running NAT on a cluster head node often leads to a condition that causes packet drops due to tracking too many connections. Ignoring local connections from the tracking process resolves the issue, as explained below.

Part of handling NAT is maintaining a connection tracking table. This connection tracking table has a configurable size. When the table is full, the firewall will start dropping packets and you will see errors in your system logs like:

kernel: nf_conntrack: table full, dropping packet

On systems that serve as a dedicated firewall/NAT device, most connections have a source and destination other than the device itself. Increasing the table size alleviates the problem.

However, on machines like cluster head nodes, the majority of connections are instead to the machine itself (qstat, dns, lmgrd, etc) and only a minority of connections are from compute nodes to outside machines. In this case, not tracking connections bound for the machine itself will result in a substantially smaller tracking table.

The iptables rules below can be used to filter out local connections.

*raw
-A PREROUTING -i eth0 -d address -j NOTRACK
-A OUTPUT -o eth0 -s address -j NOTRACK

-A PREROUTING -i lo -j NOTRACK
-A OUTPUT -o lo -j NOTRACK

where address is the private IP address of the cluster head node.

To ask a question or get help, please submit a support ticket or email us at help@schrodinger.com.

Back To Top