Why does a Job not cancel when a node has failed?


 

Issue:

Moab is no longer cancelling jobs after node failure.

 

Symptom:

Checkjob reports the node failure, and the jobs gets in a state "Cancelling" (shown with showq), but they never actually get cancelled until canceled manually (with mjobctl -F or qdel -p ). 


Solution:

 

Torque should do this if you have job_force_cancel_time set. 

job_force_cancel_time 
Format : <INTEGER> 
Default:  Disabled 
Description:  If a job has been deleted and is still in the system after x seconds, the job will be purged from the system. This is mostly useful when a job is running on a large number of nodes and one node goes down. The job cannot be deleted because the MOM cannot be contacted. The qdel fails and none of the other nodes can be reused. This parameter can used to remedy such situations. 

 

 

Tags: cancel, node fail
Last update:
2015-08-26 18:31
Author:
Jason Booth
Revision:
1.2
Average rating:0 (0 Votes)

You cannot comment on this entry

Chuck Norris has counted to infinity. Twice.

Records in this category

Tags