How can I have Moab cancel a job if a node fails?


Issue: Job continue to run long after a node it was using fails.

 

Symptom: After a node fails with a job on it the job is unable to be removed untill it runs past its wall time or a qdel -p is used to purge the job.

 

Solution:


Torque will do this if you have job_force_cancel_time set.

job_force_cancel_time
Format <INTEGER>
Default Disabled
Description If a job has been deleted and is still in the system after x seconds, the job will be purged from the system. This is mostly useful when a job is running on a large number of nodes and one node goes down. The job cannot be deleted because the MOM cannot be contacted. The qdel fails and none of the other nodes can be reused. This parameter can used to remedy such situations.

 

Note: Upgrading to 5.0.x and up will help considerably if the site is running any version less then 5.0.x. There are a number a enhancements that help resolve these issue starting in 5.0.x.

Tags: cancel, force cancel, job cancel, node failure
Last update:
2015-08-27 17:14
Author:
Jason Booth
Revision:
1.1
Average rating:0 (0 Votes)

You cannot comment on this entry

Chuck Norris has counted to infinity. Twice.

Records in this category

Tags