Why is my Torque prologue script unable to find the job file (.JB)?


Issue: Torque’s job file (.JB) does not appear to exist

 

Affected Version: All versions

 

Symptom: A prologue script is attempting to mine information from the XML contained in Torques Job file. This file normally appears in the jobs subdirectory of the mom_priv directory (example: “/var/spool/torque/mom_priv/jobs/7345.Moab-4881.JB”). The script file (.SC) may or may not be there.

 

Solution: This is typically seen only on systems where that directory is NFS-mounted, although it can’t be ruled out for systems where that directory is part of the root filesystem. One cause may be caused by delays in NFS caching, although there may be other possible causes as well. While it’s possible to handle configure NFS mount options to avoid that, doing so will also result in slower NFS performance. The best option is to account for this in the prologue script, wait on the file, giving it ample time to appear.

The delay for that file appearing will depend on multiple factors. In some cases the file may already be there. Other times it appears within a second of the epilogue script being run. On other systems, however, it has been observed to take as long as 8 seconds to appear, and even longer delays cannot be ruled out. The system that took 8 seconds was actually a Virtual Machine running under Openstack.

To program the prologue script to handle this, it is suggested that a loop be used. Within that loop check for the file, and if it doesn’t yet exist, “sleep” for some period of time. This can either be a fixed length of time, or increasingly longer times, but in any case, at some point the code should give up and assume the file will not appear, and error out.

As an example, the following Python code is provided:

import os


jobFile = os.path.splitext(sys.argv[8])[0] + ".JB"

# .JB file might not be available yet .. wait up to 32 seconds
waitTimes = [0.5, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0]

for sleepSecs in waitTimes:
   if os.path.exists(jobFile):
     break

   time.sleep(sleepSecs)

try:
   jobXml = open(jobFile, "r").read()
except Exception:
   <File is likely really missing, so handle and abort>

Tags: job file, mom_priv/jobs
Last update:
2017-08-07 23:38
Author:
Rob Greenbank
Revision:
1.0
Average rating:0 (0 Votes)

You cannot comment on this entry

Chuck Norris has counted to infinity. Twice.

Records in this category

Tags