Issue: Jobs are exiting almost as soon as they run and restarting over and over again.
Symptom: When Moab starts a job the job almost immediatly exits and restarts. Showq may show the job in the queue and running and qstat may show the job as queued.
In the checkjob output you may see a message like the following: "NODExyz did not respond in certain amount of time."
Solution: There may be a few reason for a job not starting or almost immediatly ending. Below is a list of items that can be checked. If the job is started/restarted then you will want to look at the pbs_mom logs. The pbs_mom has three stages that is passes through before a job starts. The pbs_mom logs will show the jobID and each stage. In this example we are looking at the TMomFinalizeJob3 entry.
05/16/2016 09:32:19.089;01; pbs_mom.24849;Job;TMomFinalizeJob3;Job 350.support-mpi read start return code=0 session=22740
05/16/2016 09:32:19.089;01; pbs_mom.24849;Job;350.support-mpi;saving task (TMomFinalizeJob3)
05/16/2016 09:32:19.090;01; pbs_mom.24849;Job;TMomFinalizeJob3;job 350.support-mpi started, pid = 22740
This set of logs above show a succecssful job that started. We also see the PID associated with the job shell.
If there is a failure you will also see this in the mom logs as well. For example if there is a prologue or epilogue script failure you will see entried like the following.
05/16/2016 09:37:53.043;01; pbs_mom.22878;Job;TMomFinalizeJob3;Job 351.support-mpi read start return code=-2 session=22885
05/16/2016 09:37:53.043;01; pbs_mom.22878;Job;TMomFinalizeJob3;job not started, Failure job exec failure, after files staged, no retry (see syslog for more information)
Then is the syslogs you see the reason for the prologe failure.
[root@support-mpi2 mom_logs]# grep "mom" /var/log/messages
May 16 09:37:53 support-mpi2 pbs_mom: LOG_ERROR::pelog_err, prolog/epilog failed, file: /var/spool/torque/mom_priv/prologue, exit: 1, nonzero p/e exit status
May 16 09:37:53 support-mpi2 pbs_mom: LOG_ERROR::handle_prologs, prolog failed
The above logs entry shows a prologue failure. The pbs_mom logs say to check the syslogs. In our case our syslogs are /var/log/messages. The failure was that the prologe returned a exit 1 and so the pbs_mom bailed out of the job.
In other cases the user credential may not match or permissions on the work directory on the compute node may not allow for the job to run. In other cases /tmp may be full or $TORQUEHOME/spool might be full as well.