A few days ago I had a problem with a relatively busy HTCondor pool. The central manager (which, in my case, is also responsible for running a scheduler daemon and submitting jobs) was completely unresponsive. Although the system load wasn’t too high, any attempt to use condor_submit, condor_rm or even condor_q and condor_status was timing out.

At a first glance, the HTCondor logs on the central manager looked OK. But after catting and grepping I spotted the following in /var/log/condor/SharedPortLog:

08/26/15 12:05:13 Number of Active Workers 50
08/26/15 12:05:13 Number of Active Workers 49
08/26/15 12:05:13 ForkWork: not forking because reached max workers 50
08/26/15 12:05:13 Number of Active Workers 50
08/26/15 12:05:13 ForkWork: not forking because reached max workers 50
08/26/15 12:05:13 Number of Active Workers 50
08/26/15 12:06:14 ForkWork: not forking because reached max workers 50
08/26/15 12:06:14 Number of Active Workers 50

As we know, HTCondor uses port 9618 for communication (even when running HTCondor commands on the same host!). In order to allow the different daemons to use this port simultaneously, by default HTCondor uses a port sharing daemon (condor_shared_port).

In order to handle multiple requests, this daemon forks worker processes for each client. To reduce resource consumption, there is a built-in limit of the number of forked workers. Once this limit is reached, the new clients have to wait until a worker is ready to handle them. Since the operations (especially job submits) aren’t very quick, this leads to timeouts and various failures.

The bad news is there isn’t even a single word about how to increase this limit in the documentation. Grepping the config parameters leads us only to a config parameter for controlling the number of schedd workers.

Fortunately, we can find the following in the condor_shared_port source (shared_port_server.cpp):

forker.Initialize();
int max_workers = param_integer("SHARED_PORT_MAX_WORKERS",50,0);
forker.setMaxWorkers( max_workers );

So, SHARED_PORT_MAX_WORKERS is the parameter we need. Setting the following in condor_config.local fixed the problem for me:

SHARED_PORT_MAX_WORKERS = 500

I love open source software. :)