Issue: Known bugs in the hwloc library require work-arounds with some Nvidia GPUs
Affected Version: All version of Torque (this atricle specifically addresses 6.1.2)
Symptom: A site has GPU jobs that never run, in spite of having GPU nodes available. The jobs may appear to start, then go back to idle.
Check the /var/log/messages file for errors similar to this: pbs_mom: LOG_ERROR::initializeGpu, could not open /sys/bus/pci/devices/00000000:02:00.0/local_cpulist
It has been discovered that there is a bug in the "hwloc" libraries that causes the paths to be incorrect for some GPU devices. The authors of "hwloc" have indicated they will not fix this any time soon. It does not happen with all GPUs, just certain NVidia ones.
Solution: There is a patch that will work around this issue for Torque 6.1.2. This patch needs to be applied, Torque build, and then the pbs_mom daemons replaced.
If you are not an Adaptive employee, please open a Salesforce ticket and refer to this knowledge base article in the ticket.
If you are an adaptive employee, refer to the Jira ticket TRQ-4199. There is a patch attached to that ticket, but you may need to request a new patch for versions other than 6.1.2.
Tags: GPU, GPU jobs, GPUs