Why is Moab seeing many "End of File" entries when trying to start a jobs?


Issue:

Moab is seeing many "End of File" when trying to start a job. This appears to be happening over and over again.

Some log entries will look like the following:

2016-07-26T10:13:32.788-0400 29132 WARN MBF.c:MBFFirstFit:792 0x1100a785 job:6541378[10] Cannot start job 6541378[10]. (cannot start job 6541378[10] - RM failure, rc: 15033, msg: 'End of File')
2016-07-26T10:13:32.793-0400 29132 WARN MBF.c:MBFFirstFit:792 0x1100a785 job:6541378[11] Cannot start job 6541378[11]. (cannot start job 6541378[11] - RM failure, rc: 15033, msg: 'End of File')
2016-07-26T10:13:32.797-0400 29132 WARN MBF.c:MBFFirstFit:792 0x1100a785 job:6541378[12] Cannot start job 6541378[12]. (cannot start job 6541378[12] - RM failure, rc: 15033, msg: 'End of File')
2016-07-26T10:13:32.802-0400 29132 WARN MBF.c:MBFFirstFit:792 0x1100a785 job:6541378[13] Cannot start job 6541378[13]. (cannot start job 6541378[13] - RM failure, rc: 15033, msg: 'End of File')


Solution:

The two issues that required code changes are:
MOAB-8496
TRQ-3539

This is most likely the result of a socket that has been closed. The API returns the error PBSE_PROTOCOL, and the API is setting the error message to "End of File." Despite this, Moab continues to attempt to start jobs. When this occues Moab tries to start jobs, and all consecutively are getting the PBSE_PROTOCOL / End of File error. The TORQUE API has been improved to give a more specific error other than PBSE_PROTOCOL, and Moab has been improved to not attempt to start jobs using a socket that has been closed. Now when Moab sees these errors it opens a new socket for communications with Torque.

In addition make sure you set the TORQUE "set server tcp_timeout=320" or greater depending on the number of jobs and job environment size. 
Also set Moabs RMCFG timeout to "RMCFG[] TIMEOUT=300". This will allow flexibility for Moab and TORQUE to handle the large influx of jobs. Specifically when TORQUE builds up the job query and node query that is sent to Moab. When a large amount of jobs are submitted, TORQUE has to read all jobs, build the jobs table and message, and then send that to Moab. At times that can take a large amount of time to compile and send that much information.

 

 

Tags: 15033, cannot start job, End of File
Last update:
2016-08-08 16:52
Author:
Jason Booth
Revision:
1.0
Average rating:0 (0 Votes)

You cannot comment on this entry

Chuck Norris has counted to infinity. Twice.

Records in this category

Tags