Why does Moab misinterpret the hostlist from Torque, reserving nodes that are not part of the job?


**Warning:  This is not intended for external audiences until it has recieves buy-off and been scrubbed for customer specific info**

There have been some issues crop up in 5.1 / 8.1 with some of the node naming schemes and the newer condensed RM queries. In short this shows up in both checkjob and showres and nodes dedicated to a job that have nothing to do with the job according to Torque's qstat -f.

In this example, even though the request was nodes=12:ppn=16 we saw reservations like this which should be impossible with a denominator of 16.


n0127 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0129 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0130 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0131 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0132 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0133 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0134 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0140 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0141 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0142 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0143 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0144 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0616 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0634 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0961 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0962 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0963 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0964 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0965 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0966 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0967 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0968 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0969 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0970 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0971 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0972 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0973 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0974 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0975 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0976 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0977 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0978 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0979 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0980 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0981 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0982 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0983 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0984 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0985 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0986 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0987 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0988 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0989 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0990 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0991 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0992 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0993 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0994 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0995 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0996 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0997 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0998 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n0999 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1000 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1001 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1002 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1003 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1004 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1005 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1006 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1007 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1008 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1432 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1433 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1434 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1435 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1436 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49
n1502 Job 1374630 Running 1 -2:09:56:50 10:00:00:00 Sun Aug 23 23:40:49

Allocated Nodes:
n[0127,0129-0134,0140-0144]*16:n[0570,0579-0580,0615-0616,0634,0961-1008,1432-1436,1502]*1

This is VERY problematic as it quickly uses up the resources with non-existent jobs. The way to solve this for now is with this flag on the RMCFG Line.

RMCFG[] FLAGS=NoCondensedQuery

As for running jobs blocking large swaths of the cluster, you should be able to clear the reservations after a recycle with 'releaseres'.

Last update:
2015-08-27 17:22
Author:
Nathan Burton
Revision:
1.0
Average rating:0 (0 Votes)

You can comment this FAQ

Chuck Norris has counted to infinity. Twice.

Records in this category

Tags