Issue: Job that are launched with srun fail with:
[jbooth@support-slurm ~]$ srun -N2 -l -t 30 /bin/hostname
srun: Job is in held state, pending scheduler release
srun: job 500007 queued and waiting for resources
srun: job 500007 has been allocated resources
srun: error: Task launch for 500007.0 failed on node support-sn1: Invalid job credential
srun: error: Task launch for 500007.0 failed on node support-sn2: Invalid job credential
srun: error: Application launch failed: Invalid job credential
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
Symptom: Jobs will start and then fail on "Invalid job credential".
Affeted Version: All
Solution:
You will need to setup ssh keys across the cluster to allow access to the compute nodes.
See JobCredentialPrivateKey and JobCredentialPublicCertificate in the man page for slurm.conf
JobCredentialPrivateKey
Fully qualified pathname of a file containing a private key used for authentication by Slurm
daemons. This parameter is ignored if CryptoType=crypto/munge.
JobCredentialPublicCertificate
Fully qualified pathname of a file containing a public key used for authentication by Slurm
daemons. This parameter is ignored if CryptoType=crypto/munge.
If you are using "CryptoType=crypto/munge", make sure that munged is running on all the compute nodes and that munge.key is the same across the cluster.
Tags: Invalid job credential, srun