W
- Where does showstats pull its information from?
Problem: Showstats: If you run it by itself it will pull from the .moab.ck. If you run it with a -t 24:00:00 it will pull from ... - What does (Est/Avg Backlog) mean in showstats
Issue: What does (Est/Avg Backlog) mean in showstats Symptom: [root@moab ~]$ showstats moab active for 00:00:54 stats initialized on Wed May 20 13:19:37 2015 Eligible/Idle Jobs: 0/0 (0.000%)Active Jobs: 1Successful/Completed ... - Why does preemption and JOBNODEMATCHPOLICY EXACTNODE not seem to work together?
Issue: Why does preemption and JOBNODEMATCHPOLICY EXACTNODE not seem to work together? Affected Versions: 7.2.8 and older, and 8.0.x. Symptom: When setting JOBNODEMATCHPOLICY EXACTNODE preemption seems to not work ... - Why do my jobs appear in batchhold when plenty of resources are available?
Issue: Why do my jobs appear in batchhold when plenty of resources are available? Affected Versions: 5.0.1, 5.1.0,4.2.10 Symptom: On a system where 4 procs are availible you ... - Why will my job not start when there is no other job on the compute resource but CPU usage on that node is high?
Issue: Why will my job not start when there is no other job on the compute resource but CPU usage on that node is high? Affected Versions: ... - Why is Moab not responding to client commands.
Issue: Why is Moab not responding to client commands. Affected Versions: 7.2.x, 8.0.x 8.1.x Symptom: When running client commands the command times out. Jobs appear to still ... - Why does MAM have double entries for a job charge?
Issue: Why does MAM have double entries for a job charge? Affected Versions: All version before 8.1.0 Symptom: Id Type Instance Charge Stage User Group Account Organization Class QualityOfService ... - Why does Moab append a qos from the IDCFG and not overwrite it?
Issue: When using an identity manager Moab appends the QOS after removing and old QOS and adding a new QOS. In other words it appends instead ... - Why does TORQUE leave many defunct moms behind?
Issue: Over time pbs_moms build up on compute nodes or Cray login nodes. Symptom: On the compute nodes or Cray login nodes you see many pbs_mom processes in ... - Why do Moab client commands timeout even with UIMANAGEMENTPOLICY and CLIENTUIPORT set?
Issue: Moab client commands somtimes report timeouts even with UIMANAGEMENTPOLICY and CLIENTUIPORT set. Symptom: ERROR: client timed out after 60 seconds (hostname:port=moab:42559) Solution: Consider checking the moab.cg on the submithost or host ... - Why are process not being assigned/tracked with mpiexec
Issue: TORQUE is only tracking resource usage of processes on the first node in the nodelist. Symptom: MPICH 3.1. RHEL/CentOS 6.x When using mpiexec to launch processes across multiple nodes, the ... - Why do I see "cannot establish connection - Operation now in progress" when running commands as a non-root user?
Issue: When running client commands as a non-root user, I see: ERROR: cannot establish connection - Operation now in progress (hostname:port=:42559) Symptom: You will see the keywords "Operation now ... - Why do I see Moab logs full of "No QueueTime has been specified for job"?
Issue: Moab logs are saying that several jobs have"No QueueTime has been specified for job" Symptom: I'm seeing several log errors in Moab like: 2015-08-10T09:47:09.708-0400 23496 WARN MRMJobLoad.c:MRMJobPostLoad:332 0x1100a918 ... - Why is Moab appending qlist to a user via the IDCFG when it should be overwriting?
Issue: Moab is appending qlist to a user via the IDCFG when it should be overwriting. Symptom: Moab supports an identity manager via the IDCFG line where a ... - What does the % column in mdiag -f stand for?
Issue: What does the % column in mdiag -f stand for? Symptom: mdiag -f FairShare Information Depth: 6 intervals Interval Length: 00:20:00 Decay Rate: 0.50 FS Policy: DEDICATEDPES System FS Settings: Target ... - Why does a Job not cancel when a node has failed?
Issue: Moab is no longer cancelling jobs after node failure. Symptom: Checkjob reports the node failure, and the jobs gets in a state "Cancelling" (shown with showq), but ... - Why does Moab misinterpret the hostlist from Torque, reserving nodes that are not part of the job?
**Warning: This is not intended for external audiences until it has recieves buy-off and been scrubbed for customer specific info**There have been some issues ... - Why do I see "ALERT: Moab is configured to use Mongo, but no MONGOSERVER specified."?
Issue: Why do I see "ALERT: Moab is configured to use Mongo, but no MONGOSERVER specified."? Symptom: When running mdiag -R -v you see the following message: ALERT: Moab ... - What are common things to look at when troubleshooting datawarp?
Issue: What are common things to look at when troubleshooting datawarp? Solution: Doubble check that "JOBMIGRATEPOLICY" is set to "JUSTINTIME". If this is set to "IMMEDIATE" then datawarp will not work ... - What is the difference between node locked gres and global node gres?
Issue: What is the difference between node locked gres and global node gres? Symptom: If Moab is setup to have a gres on a specif node, such as ... - What does AvgQH mean from showstats?
Issue: What does AvgQH mean from showstats? Solution: AvgQH* Average queue time (in hours) of jobs. In my case here are ... - Which rpm packages are necessary for an install of Moab?
Question: I'm getting ready to install the rpm for Moab and I need to know which rpms are actually needed for the install. Solution: The install ... - Why does my job run on only 1 node?
Issue: Why does my job run on only 1 node? Example: Resource_List.nodes = 10:ppn=1 The job then runs on only 1 node and not 10 different nodes. Here is ... - Why do my Moab/SLURM Jobs report "Invalid job credential" with srun?
Issue: Job that are launched with srun fail with: [jbooth@support-slurm ~]$ srun -N2 -l -t 30 /bin/hostname srun: Job is in held state, pending scheduler releasesrun: ... - Why do I see "could not locate requested gpu resources" when the job requested none?
Issue: PBS_SERVER occationaly reports the following in the server logs, "could not locate requested gpu resources". Symptom: When submitting job into a TORQUE some jobs appear ... - When using qsub -k what happens to the job output on the pbs_mom?
Issue: When using qsub -k what happens to the job output on the pbs_mom? Solution: With a -k the error and output are stored in /var/spool/torque/spool/ on the ... - What is the best way to monitor GPU usage in Mam?
Issue: What is the best way to monitor GPU usage in Mam? Solution: What you do is add a usage record to MAM of type Gpus.Moab passes ... - What attributes are supported with a fairshare tree?
Issue: What attribues are supported with a fairshare tree? Solution: Fairshare trees are a great way to to compartmentalize credentials and priority. They are also a great way ... - Why is MAM yum package size significantly smaller?
Issue: MAM 9.0.1 appears to be 4.3 megs when running "yum info moab-accounting-manager". However when looking at 9.0.2 MAM reports 1.4 M. [root@support-901 moab-hpc-suite-9.0.2-1469837953-el7]# yum info moab-accounting-manager Loaded plugins: ... - Why is Moab seeing many "End of File" entries when trying to start a jobs?
Issue: Moab is seeing many "End of File" when trying to start a job. This appears to be happening over and over again. Some log entries will look ... - What are the accounting "x attributes" in qstat -f?
Issue: What are the accounting "x attributes" in qstat -f? The Torque configure script (for building it) has this option: --enable-acct-x enable adding x attributes to accounting ... - What '#' directives does msub pass to qsub?
Issue: What '#' directives does msub pass to qsub? Example: In short any #MSUB directives are converted to #PBS [john@support-mpi ~]$ cat pbs.sh #!/bin/bash #MSUB -l procs=3 #PBS -l walltime=35 sleep ... - Why does TORQUE send the mom hierarchy to down nodes?
Issue: Why does TORQUE send the mom hierarchy to down nodes? I have 6 node (out of 590) that are marked down and/or offline. I have 5000+ messages ... - Why does Moab provision every compute node in a multi-req request?
Question: Why does Moab provision every compute node in a multi-req request? Solution: Moab supports advanced multi-req resource requests within the same job using the msub/qsub "-L" syntax. For ... - What to check after a crash.
Question: Our machine running Moab and Torque recently crashed. Do I need to take any action before bringing moab and pbs_server back up to verify ... - Why can't I set a negative exit code for my jobs?
Question: Why can I not set a negative exit code for my jobs when Torque is able to do so?Answer: The exit codes come from ... - What was resolved in the Moab Workload Manager 9.1.0 release?
The following list are items that were resolved in Moab v9.1.0 release: Moab was not reporting the correct task count for jobs that allocate ALLPROCS The systemd ... - What was resolved in the Moab Accounting Manager 9.1.0 release?
The following list are items that were resolved in Moab Accounting Manager v9.1.0 release: Events did not log notifications to the Notification table by default, as ... - Why is my Torque prologue script unable to find the job file (.JB)?
Issue: Torque’s job file (.JB) does not appear to exist Affected Version: All versions Symptom: A prologue script is attempting to mine information from the XML ... - Why is there not a checkpoint file for my job submitted from another host?
Issue: Missing checkpoint (.cp) file for job submitted from a submit host. Affected Version: All versions, Moab with Torque Symptom: A job is submitted from a ... - Why are licenses being overallocated for array jobs?
Issue: When submitting array jobs, all jobs run at once even if generic resources are not available. For resources like licenses, managed by FlexLM, this ... - Why doesn't my job start? (overlapping a reservation)
Issue: Users job may overlap reservations, and users do not realize this Affected Version: All Version Symptom: A system has a maintenance reservation scheduled, and a ... - What the expected response(s) are by Moab for these scripts:
CREATEURL, STARTURL, PAUSEURL, RESUMEURL, UPDATEURL, CONTINUEURL, ENDURL, DELETEURL, and QUERYURL
Problem: I need to know for each of the following, what the expected response(s) are by Moab for these scripts: CREATEURL, STARTURL, PAUSEURL, RESUMEURL, UPDATEURL, CONTINUEURL, ENDURL, ... - Why don't my GPU jobs run while all other jobs do?
Issue: Known bugs in the hwloc library require work-arounds with some Nvidia GPUs Affected Version: All version of Torque (this atricle specifically addresses 6.1.2) Symptom: A site ... - Why are spaces in user/group names causing problems?
Issue: One or more users are unable to run jobs, and are seeing authentication errors, usually for a group "doman_users", or possibly for a user ... - Why are my GPU nodes not being scheduled correctly?
Issue: Jobs requesting GPUs willl not run due to lack of GPU resources, yet the GPUs are available. Cause: Most likely, Torque was compiled without the ... - Why is trqauthd failing to start during redhat/centos boot?
Issue: When the system boots, trqauthd is failing to start. Yet "systemctl start trqauthd" brings it up just fine. Symptoms: Running systemctl status trqauthd produces output ... - Why does "msub" give the error "qsub: illegal -d value"?
Symptoms: When jobs are submitted using "msub", the following error is observered, either returned on the command-line or in the "checkjob" messages:job submit failed - [qsub: ... - Why is my spark work directory filling up the hard drive?
There are circumstances that can cause the /opt/spark-2.1.2/work directory to contain so much data that it it fills the hard drive. This happens because every ... - Why are workload queries timing out and job completions not being detected?
SYMPTOMS When using Torque, the workload query times out repeatedly, and after that Moab no longer sees job completions or jobs submitted through qsub. Raising the ... - Why can I no longer log into Viewpoint after an OS update?
Symptom: After applying using "yum update" to apply updates to the Viewpoint server, the user is no longer able to authenticate. Issue: The "tomcat" process may be running ... - What are the meanings of the fields returned by the "mstats" command.
Issue: The statistics for mstats are labeled with cryptic names, and are not descriptive Affected Version: All currently supported releases Problem: Running the mstats --xml command provides a lot of statistics, but ... - Why do I have incorrect job charges in Moab Accounting Manager?
Problem: Some of the charges in MAM are not reflecting the actual job run times, as reported by Torque. The "tracejob" output is showing the correct ... - What are the fields in the events statistics files?
These events files are found in the $MOABHOMEDIR/stats directory, and contain detailed logging of job-related events. The filename is of the format "events.<weekday>_<month>_<day>_year. For example, ...