How to fix Mass Job cancellation creates DDOS on pbs_server (around 4.2.x)



Issue: How to fix Mass Job cancellation creates DDOS on pbs_server (around 4.2.x)


Affected Versions: 4.2.5, 4.2.x


Symptom:

In some of the lower to mid versions of Torque cancellations of many jobs either in an array or individual jobs could lead to a sluggish or non-repsonsive Torque (pbs_server). Even many jobs reaching walltime limit can trigger this. Also even if you regain control over pbs_server these cancelling jobs will continue to block resources for the new jobs making resources unusable.

Typical message: Exiting loop because we passed our retry tolerance: 24


Solution:

The problem is all the moms trying to contact the server at once to cancel their jobs is too much for pbs_server to handle. Even regular client commands like pbsnodes and qstat may fail. Here are the simple steps I use to regain control.

1 Shutdown the pbs_server. This maybe easier said than done. In some cases I have had to sig 9 it. - We need it down to restart and regain control.

2 Now restart the server with pbs_server -c This staggers the server's response to the moms and gives you a few minutes to work on the problem (doing the steps below). Sometimes you may need to repeat this to get enough time to attack the problem

Attacking the problem. Since the jobs were being cancelled anyway you should not feel bad about getting rid of them.

3 Assuming you just started with pbs_server -c you should now be able to issue commands. Use qstat to look for jobs stuck in 'cancelling'

4 Use qstat -f to see which node or nodes they are on.

5 'Offline' those nodes with pbsnodes -o

6 Restart those moms with pbsnodes -P to purge the junk jobs from those moms. In severe cases you may need to go to the job and array folders and remove any .jb, .ar or .sc files.

7 Slowly, preferably one by one bring the nodes back on line with pbsnodes -c to "clear" the offline status. You should see the culprit jobs clear off the pbs_server (Torque) and Moab and new jobs should then freely start on those nodes.

In most cases this is all that is needed. In some of the more serious cases I have had to follow up in Moab removing the individual jobs from the .ck file as it is not recommended by developers that the .ck file be removed.

Tags: pbs_server job cancel cancellation unresponsive DDOS
Last update:
2015-06-11 17:32
Author:
Nathan Burton
Revision:
1.1
Average rating:0 (0 Votes)

You can comment this FAQ

Chuck Norris has counted to infinity. Twice.

Records in this category

Tags