Moab/TORQUE is reporting a job that is no longer on the node


Issue: Moab/TORQUE is reporting a job that is no longer on the node.

Affected Versions: ALL

Symptom: Some times a resource manager or compute node may experience some type of transient failure such as a network outage. In these cases Moab assumes that the job is still running on the compute node. When the outage goes beyond the walltime of the job Moab will still track that job as running. Likewise the pbs_server will track the job in the running state as well. In rare cases the pbs_mom was never able to send a completion "obituary bit" to the server for the job. This can lead to a situation known as phantom jobs.

Solution: There are a few ways to address this.

  1. Try running qdel -p <job id>
    1. qdel -c Clean up unreported jobs from the server. This should only be used if the scheduler is unable to purge unreported jobs. This option is only available to a batch operator or the batch administrator.
  2. ssh into the compute node and remove the job file under "/var/spool/torque/mom_priv/jobs/"
  3. ssh into the pbs_server and remove the job file under "/var/spool/torque/server_priv/jobs/"
  4. If Moab still reports the phantom job remove the associated checkpoint job under "$MOABHOMEDIR/spool/"

You may need to restart Moab for the scheduler to pick up the change in the checkpoint file.

 

Tags: completed jobs, phantom jobs
Last update:
2015-06-11 18:46
Author:
Jason Booth
Revision:
1.1
Average rating:0 (0 Votes)

You cannot comment on this entry

Chuck Norris has counted to infinity. Twice.

Records in this category

Tags