My MPI jobs are failing to run when I run them through TORQUE however they run externally


Issue: My MPI jobs are failing to run when I run them through TORQUE however they run externally.


Affected Versions: All


Symptom

MPI_Init: ibv_create_cq() failed 
MPI_Init: Can't initialize RDMA device 
MPI_Init: Internal Error: Cannot initialize RDMA protocol

 

 MPI_Init: ibv_create_cq() failed 
fMPI_Init: Can't initialize RDMA device 
MPI_Init: Internal Error: Cannot initialize RDMA protocol 
MPI Application rank <NN> exited before MPI_Init() with status 1 
mpirun: Broken pipe

 

Solution:

When a batch job is created through TORQUE sometimes the jobs environments ulimits are inherited from the pbs_mom. Normally a jobs ulimits inherit the ulimits from the OS environment. In this case the "max memory size" (ulimit -m) is not suficient for the mpi job to run correctly.

You can edit the init script for the pbs_mom to set the ulimit -m before pbs_mom is executed. This will allow the mom to have a mem limit greater then 64. Alternativly if you start your pbs_mom by hand you will set your current shells ulimit with the command (ulimit -m unlimited) before running "pbs_mom".

Tags: Cannot initialize RDMA protocol, mpi
Last update:
2015-06-11 00:00
Author:
Jason Booth
Revision:
1.2
Average rating:0 (0 Votes)

You cannot comment on this entry

Chuck Norris has counted to infinity. Twice.

Records in this category

Tags