After upgrading to 6.1.2, I can only run single-node jobs. How do I fix this?


Issue:  Only single-node jobs can be run. 

Cause:  This is often due to the "Trusted Client" list not being properly updated on the MOMs.  Each MOM will only interract with other MOMs it "trusts".  At Startup, the Torque server normally provides this to the compute nodes.  To see if this is the cause of the problem, go to a compute node that seems to have problems and run "momctl  -d3" (diagnose).  The line titled "Trusted Client List:" should have ALL of the other compute nodes.  If not, this needs to be fixed.

Solution:  This is often due to the Torque Server binding sockets (an issue inadvertently introduced in 6.1.2).  There are two solutions. 

One is to re-build Torque with the "configure" flag "--disable-bind-outbound-sockets", then resinstall the server (you do not need to re-install the MOMs).  Restart the server and all MOMs and you should see complete client lists.

Another workaround solution is available which avoids re-building Torque.  The workaround is to update each MOM's "config" file (<TORQUE HOME/mom_priv/config) so it contains one line for each of the MOM nodes, with that line looking like this (using the actual IP addresses for the MOMs):

$pbsclient  10.10.1.2

Once the config file has been updated, the MOM will need to be restarted.  This can be quite an effort, but some type of distributed shell can often be used to simplify, especially since on most sites these config files are identical. 

Either of the above solutions should resolve this issue and the "momctl -d3" will show the complete list of MOMs.  At that point multi-node jobs should run again. 

References:  Salesforce tickets 26378 and 26508, and JIRA tickets TRQ-4244 and TRQ-4247

Long-Term Solution:  This issue should go away with release 6.1.3, and may, in fact, be corrected with an earlier hotfix. 

Last update:
2018-08-09 17:27
Author:
Rob Greenbank
Revision:
1.1
Average rating:0 (0 Votes)

You cannot comment on this entry

Chuck Norris has counted to infinity. Twice.

Records in this category

Tags