Issue: Job continue to run long after a node it was using fails.
Symptom: After a node fails with a job on it the job is unable to be removed untill it runs past its wall time or a qdel -p is used to purge the job.
Torque will do this if you have job_force_cancel_time set.
Description If a job has been deleted and is still in the system after x seconds, the job will be purged from the system. This is mostly useful when a job is running on a large number of nodes and one node goes down. The job cannot be deleted because the MOM cannot be contacted. The qdel fails and none of the other nodes can be reused. This parameter can used to remedy such situations.
Note: Upgrading to 5.0.x and up will help considerably if the site is running any version less then 5.0.x. There are a number a enhancements that help resolve these issue starting in 5.0.x.Tags: cancel, force cancel, job cancel, node failure