A few days ago I had a problem with a relatively busy HTCondor pool. The central manager (which, in my case, is also responsible for running a scheduler daemon and submitting jobs) was completely unresponsive. Although the system load wasn’t too high, any attempt to use
condor_rm or even
condor_status was timing out.
At a first glance, the HTCondor logs on the central manager looked OK. But after catting and grepping I spotted the following in
08/26/15 12:05:13 Number of Active Workers 50 08/26/15 12:05:13 Number of Active Workers 49 08/26/15 12:05:13 ForkWork: not forking because reached max workers 50 08/26/15 12:05:13 Number of Active Workers 50 08/26/15 12:05:13 ForkWork: not forking because reached max workers 50 08/26/15 12:05:13 Number of Active Workers 50 08/26/15 12:06:14 ForkWork: not forking because reached max workers 50 08/26/15 12:06:14 Number of Active Workers 50
As we know, HTCondor uses port 9618 for communication (even when running HTCondor commands on the same host!). In order to allow the different daemons to use this port simultaneously, by default HTCondor uses a port sharing daemon (
In order to handle multiple requests, this daemon forks worker processes for each client. To reduce resource consumption, there is a built-in limit of the number of forked workers. Once this limit is reached, the new clients have to wait until a worker is ready to handle them. Since the operations (especially job submits) aren’t very quick, this leads to timeouts and various failures.
The bad news is there isn’t even a single word about how to increase this limit in the documentation. Grepping the config parameters leads us only to a config parameter for controlling the number of
Fortunately, we can find the following in the
condor_shared_port source (
SHARED_PORT_MAX_WORKERS is the parameter we need. Setting the following in condor_config.local fixed the problem for me:
SHARED_PORT_MAX_WORKERS = 500
I love open source software. :)